2025-06-04

Title: Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN

Authors: Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01977
Pdf URL: https://arxiv.org/pdf/2506.01977
Copy Paste: [[2506.01977]] Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN(https://arxiv.org/abs/2506.01977)
Keywords: diffusion, generative
Abstract: Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and its corresponding edit path can be extracted using a traditional algorithm. However, such methods typically rely heavily on ground-truth supervision, where the ground-truth labels are often costly to obtain in real-world scenarios. In this paper, we propose GEDRanker, a novel unsupervised GAN-based framework for GED computation. Specifically, GEDRanker consists of a matching-based GED solver and introduces an interpretable preference-aware discriminator with an effective training strategy to guide the matching-based GED solver toward generating high-quality node matching without the need for ground-truth labels. Extensive experiments on benchmark datasets demonstrate that our GEDRanker enables the matching-based GED solver to achieve near-optimal solution quality without any ground-truth supervision.

Title: Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification

Authors: Reyhaneh Keshavarzpour, Eghbal Mansoori
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01983
Pdf URL: https://arxiv.org/pdf/2506.01983
Copy Paste: [[2506.01983]] Improvement of AMPs Identification with Generative Adversarial Network and Ensemble Classification(https://arxiv.org/abs/2506.01983)
Keywords: generative
Abstract: Identification of antimicrobial peptides is an important and necessary issue in today's era. Antimicrobial peptides are essential as an alternative to antibiotics for biomedical applications and many other practical applications. These oligopeptides are useful in drug design and cause innate immunity against microorganisms. Artificial intelligence algorithms have played a significant role in the ease of identifying these this http URL research is improved by improving proposed method in the field of antimicrobial peptides prediction. Suggested method is improved by combining the best coding method from different perspectives, In the following a deep neural network to balance the imbalanced combined datasets. The results of this research show that the proposed method have a significant improvement in the accuracy and efficiency of the prediction of antimicrobial peptides and are able to provide the best results compared to the existing methods. These development in the field of prediction and classification of antimicrobial peptides, basically in the fields of medicine and pharmaceutical industries, have high effectiveness and application.

Title: EWGN: Elastic Weight Generation and Context Switching in Deep Learning

Authors: Shriraj P. Sawant, Krishna P. Miyapuram
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02065
Pdf URL: https://arxiv.org/pdf/2506.02065
Copy Paste: [[2506.02065]] EWGN: Elastic Weight Generation and Context Switching in Deep Learning(https://arxiv.org/abs/2506.02065)
Keywords: generative
Abstract: The ability to learn and retain a wide variety of tasks is a hallmark of human intelligence that has inspired research in artificial general intelligence. Continual learning approaches provide a significant step towards achieving this goal. It has been known that task variability and context switching are challenging for learning in neural networks. Catastrophic forgetting refers to the poor performance on retention of a previously learned task when a new task is being learned. Switching between different task contexts can be a useful approach to mitigate the same by preventing the interference between the varying task weights of the network. This paper introduces Elastic Weight Generative Networks (EWGN) as an idea for context switching between two different tasks. The proposed EWGN architecture uses an additional network that generates the weights of the primary network dynamically while consolidating the weights learned. The weight generation is input-dependent and thus enables context switching. Using standard computer vision datasets, namely MNIST and fashion-MNIST, we analyse the retention of previously learned task representations in Fully Connected Networks, Convolutional Neural Networks, and EWGN architectures with Stochastic Gradient Descent and Elastic Weight Consolidation learning algorithms. Understanding dynamic weight generation and context-switching ability can be useful in enabling continual learning for improved performance.

Title: Developing a Risk Identification Framework for Foundation Model Uses

Authors: David Piorkowski, Michael Hind, John Richards, Jacquelyn Martino
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.02066
Pdf URL: https://arxiv.org/pdf/2506.02066
Copy Paste: [[2506.02066]] Developing a Risk Identification Framework for Foundation Model Uses(https://arxiv.org/abs/2506.02066)
Keywords: foundation model
Abstract: As foundation models grow in both popularity and capability, researchers have uncovered a variety of ways that the models can pose a risk to the model's owner, user, or others. Despite the efforts of measuring these risks via benchmarks and cataloging them in AI risk taxonomies, there is little guidance for practitioners on how to determine which risks are relevant for a given foundation model use. In this paper, we address this gap and develop requirements and an initial design for a risk identification framework. To do so, we look to prior literature to identify challenges for building a foundation model risk identification framework and adapt ideas from usage governance to synthesize four design requirements. We then demonstrate how a candidate framework can addresses these design requirements and provide a foundation model use example to show how the framework works in practice for a small subset of risks.

Title: An Introduction to Flow Matching and Diffusion Models

Authors: Peter Holderrieth, Ezra Erives
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02070
Pdf URL: https://arxiv.org/pdf/2506.02070
Copy Paste: [[2506.02070]] An Introduction to Flow Matching and Diffusion Models(https://arxiv.org/abs/2506.02070)
Keywords: diffusion, generative
Abstract: Diffusion and flow-based models have become the state of the art for generative AI across a wide range of data modalities, including images, videos, shapes, molecules, music, and more! These notes are originally from this https URL, as taught at MIT over the 2025 IAP (winter) term, and are intended to accompany other course content, including lectures and labs. Overall, they function as a self-contained introduction to both flow matching and diffusion models, starting with ordinary and stochastic differential equations, and culminating in flow matching, score matching, classifier-free guidance, and the inner workings of modern, state-of-the-art models for image and video. These notes, and the accompanying course, are ideal for students and practitioners alike who want to develop a principled understanding of the theory and practice of generative AI.

Title: RATFM: Retrieval-augmented Time Series Foundation Model for Anomaly Detection

Authors: Chihiro Maru, Shoetsu Sato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02081
Pdf URL: https://arxiv.org/pdf/2506.02081
Copy Paste: [[2506.02081]] RATFM: Retrieval-augmented Time Series Foundation Model for Anomaly Detection(https://arxiv.org/abs/2506.02081)
Keywords: foundation model, anomaly
Abstract: Inspired by the success of large language models (LLMs) in natural language processing, recent research has explored the building of time series foundation models and applied them to tasks such as forecasting, classification, and anomaly detection. However, their performances vary between different domains and tasks. In LLM-based approaches, test-time adaptation using example-based prompting has become common, owing to the high cost of retraining. In the context of anomaly detection, which is the focus of this study, providing normal examples from the target domain can also be effective. However, time series foundation models do not naturally acquire the ability to interpret or utilize examples or instructions, because the nature of time series data used during training does not encourage such capabilities. To address this limitation, we propose a retrieval augmented time series foundation model (RATFM), which enables pretrained time series foundation models to incorporate examples of test-time adaptation. We show that RATFM achieves a performance comparable to that of in-domain fine-tuning while avoiding domain-dependent fine-tuning. Experiments on the UCR Anomaly Archive, a multi-domain dataset including nine domains, confirms the effectiveness of the proposed approach.

Title: Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Authors: Hyojin Bahng, Caroline Chan, Fredo Durand, Phillip Isola
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02095
Pdf URL: https://arxiv.org/pdf/2506.02095
Copy Paste: [[2506.02095]] Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences(https://arxiv.org/abs/2506.02095)
Keywords: diffusion
Abstract: Learning alignment between language and vision is a fundamental challenge, especially as multimodal data becomes increasingly detailed and complex. Existing methods often rely on collecting human or AI preferences, which can be costly and time-intensive. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image and generated text, we map the text back to image space using a text-to-image model and compute the similarity between the original image and its reconstruction. Analogously, for text-to-image generation, we measure the textual similarity between an input caption and its reconstruction through the cycle. We use the cycle consistency score to rank candidates and construct a preference dataset of 866K comparison pairs. The reward model trained on our dataset outperforms state-of-the-art alignment metrics on detailed captioning, with superior inference-time scalability when used as a verifier for Best-of-N sampling. Furthermore, performing DPO and Diffusion DPO using our dataset enhances performance across a wide range of vision-language tasks and text-to-image generation. Our dataset, model, and code are at this https URL

Title: Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability

Authors: Yarden Bakish, Itamar Zimerman, Hila Chefer, Lior Wolf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02138
Pdf URL: https://arxiv.org/pdf/2506.02138
Copy Paste: [[2506.02138]] Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability(https://arxiv.org/abs/2506.02138)
Keywords: foundation model
Abstract: The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.

Title: Constrained Sliced Wasserstein Embedding

Authors: Navid NaderiAlizadeh, Darian Salehi, Xinran Liu, Soheil Kolouri
Subjects: cs.LG, cs.AI, math.OC, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02203
Pdf URL: https://arxiv.org/pdf/2506.02203
Copy Paste: [[2506.02203]] Constrained Sliced Wasserstein Embedding(https://arxiv.org/abs/2506.02203)
Keywords: foundation model
Abstract: Sliced Wasserstein (SW) distances offer an efficient method for comparing high-dimensional probability measures by projecting them onto multiple 1-dimensional probability distributions. However, identifying informative slicing directions has proven challenging, often necessitating a large number of slices to achieve desirable performance and thereby increasing computational complexity. We introduce a constrained learning approach to optimize the slicing directions for SW distances. Specifically, we constrain the 1D transport plans to approximate the optimal plan in the original space, ensuring meaningful slicing directions. By leveraging continuous relaxations of these transport plans, we enable a gradient-based primal-dual approach to train the slicer parameters, alongside the remaining model parameters. We demonstrate how this constrained slicing approach can be applied to pool high-dimensional embeddings into fixed-length permutation-invariant representations. Numerical results on foundation models trained on images, point clouds, and protein sequences showcase the efficacy of the proposed constrained learning approach in learning more informative slicing directions. Our implementation code can be found at this https URL.

Title: Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

Authors: Johannes Schusterbauer, Ming Gui, Frank Fundel, Björn Ommer
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02221
Pdf URL: https://arxiv.org/pdf/2506.02221
Copy Paste: [[2506.02221]] Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment(https://arxiv.org/abs/2506.02221)
Keywords: diffusion, generative
Abstract: Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM models are computationally prohibitive for finetuning, while diffusion models like Stable Diffusion benefit from efficient architectures and ecosystem support. This work addresses the critical challenge of efficiently transferring knowledge from pre-trained diffusion models to flow matching. We propose Diff2Flow, a novel framework that systematically bridges diffusion and FM paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields from diffusion predictions. This alignment enables direct and efficient FM finetuning of diffusion priors with no extra computation overhead. Our experiments demonstrate that Diff2Flow outperforms naïve FM and diffusion finetuning particularly under parameter-efficient constraints, while achieving superior or competitive performance across diverse downstream tasks compared to state-of-the-art methods. We will release our code at this https URL.

Title: Investigating the Impact of Word Informativeness on Speech Emotion Recognition

Authors: Sofoklis Kakouros
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02239
Pdf URL: https://arxiv.org/pdf/2506.02239
Copy Paste: [[2506.02239]] Investigating the Impact of Word Informativeness on Speech Emotion Recognition(https://arxiv.org/abs/2506.02239)
Keywords: self-supervised
Abstract: In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.

Title: Motion aware video generative model

Authors: Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02244
Pdf URL: https://arxiv.org/pdf/2506.02244
Copy Paste: [[2506.02244]] Motion aware video generative model(https://arxiv.org/abs/2506.02244)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion-based video generation have yielded unprecedented quality in visual content and semantic coherence. However, current approaches predominantly rely on statistical learning from vast datasets without explicitly modeling the underlying physics of motion, resulting in subtle yet perceptible non-physical artifacts that diminish the realism of generated videos. This paper introduces a physics-informed frequency domain approach to enhance the physical plausibility of generated videos. We first conduct a systematic analysis of the frequency-domain characteristics of diverse physical motions (translation, rotation, scaling), revealing that each motion type exhibits distinctive and identifiable spectral signatures. Building on this theoretical foundation, we propose two complementary components: (1) a physical motion loss function that quantifies and optimizes the conformity of generated videos to ideal frequency-domain motion patterns, and (2) a frequency domain enhancement module that progressively learns to adjust video features to conform to physical motion constraints while preserving original network functionality through a zero-initialization strategy. Experiments across multiple video diffusion architectures demonstrate that our approach significantly enhances motion quality and physical plausibility without compromising visual quality or semantic alignment. Our frequency-domain physical motion framework generalizes effectively across different video generation architectures, offering a principled approach to incorporating physical constraints into deep learning-based video synthesis pipelines. This work seeks to establish connections between data-driven models and physics-based motion models.

Title: Latent Stochastic Interpolants

Authors: Saurabh Singh, Dmitry Lagun
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02276
Pdf URL: https://arxiv.org/pdf/2506.02276
Copy Paste: [[2506.02276]] Latent Stochastic Interpolants(https://arxiv.org/abs/2506.02276)
Keywords: diffusion, generative
Abstract: Stochastic Interpolants (SI) are a powerful framework for generative modeling, capable of flexibly transforming between two probability distributions. However, their use in jointly optimized latent variable models remains unexplored as they require direct access to the samples from the two distributions. This work presents Latent Stochastic Interpolants (LSI) enabling joint learning in a latent space with end-to-end optimized encoder, decoder and latent SI models. We achieve this by developing a principled Evidence Lower Bound (ELBO) objective derived directly in continuous time. The joint optimization allows LSI to learn effective latent representations along with a generative process that transforms an arbitrary prior distribution into the encoder-defined aggregated posterior. LSI sidesteps the simple priors of the normal diffusion models and mitigates the computational demands of applying SI directly in high-dimensional observation spaces, while preserving the generative flexibility of the SI framework. We demonstrate the efficacy of LSI through comprehensive experiments on the standard large scale ImageNet generation benchmark.

Title: Sounding Like a Winner? Prosodic Differences in Post-Match Interviews

Authors: Sofoklis Kakouros, Haoyu Chen
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02283
Pdf URL: https://arxiv.org/pdf/2506.02283
Copy Paste: [[2506.02283]] Sounding Like a Winner? Prosodic Differences in Post-Match Interviews(https://arxiv.org/abs/2506.02283)
Keywords: self-supervised
Abstract: This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues -- such as pitch variability -- remain strong indicators of victory.

Title: Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

Authors: Niclas Popp, Kevin Alexander Laube, Matthias Hein, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02294
Pdf URL: https://arxiv.org/pdf/2506.02294
Copy Paste: [[2506.02294]] Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation(https://arxiv.org/abs/2506.02294)
Keywords: diffusion, foundation model
Abstract: Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines

Title: MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

Authors: Xiaojun Shan, Qi Cao, Xing Han, Haofei Yu, Paul Pu Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02308
Pdf URL: https://arxiv.org/pdf/2506.02308
Copy Paste: [[2506.02308]] MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping(https://arxiv.org/abs/2506.02308)
Keywords: foundation model
Abstract: Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to ever-larger datasets in both quantity and scale, our findings reveal that simply increasing the number of instruction-tuning tasks does not consistently yield better performance. Instead, we observe that grouping tasks by the common interactions across modalities, such as discovering redundant shared information, prioritizing modality selection with unique information, or requiring synergistic fusion to discover new information from both modalities, encourages the models to learn transferrable skills within a group while suppressing interference from mismatched tasks. To this end, we introduce MINT, a simple yet surprisingly effective task-grouping strategy based on the type of multimodal interaction. We demonstrate that the proposed method greatly outperforms existing task grouping baselines for multimodal instruction tuning, striking an effective balance between generalization and specialization.

Title: Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models

Authors: Yuchen Liang, Renxiang Huang, Lifeng Lai, Ness Shroff, Yingbin Liang
Subjects: cs.LG, eess.SP, math.ST
Abstract URL: https://arxiv.org/abs/2506.02318
Pdf URL: https://arxiv.org/pdf/2506.02318
Copy Paste: [[2506.02318]] Absorb and Converge: Provable Convergence Guarantee for Absorbing Discrete Diffusion Models(https://arxiv.org/abs/2506.02318)
Keywords: diffusion
Abstract: Discrete state space diffusion models have shown significant advantages in applications involving discrete data, such as text and image generation. It has also been observed that their performance is highly sensitive to the choice of rate matrices, particularly between uniform and absorbing rate matrices. While empirical results suggest that absorbing rate matrices often yield better generation quality compared to uniform rate matrices, existing theoretical works have largely focused on the uniform rate matrices case. Notably, convergence guarantees and error analyses for absorbing diffusion models are still missing. In this work, we provide the first finite-time error bounds and convergence rate analysis for discrete diffusion models using absorbing rate matrices. We begin by deriving an upper bound on the KL divergence of the forward process, introducing a surrogate initialization distribution to address the challenge posed by the absorbing stationary distribution, which is a singleton and causes the KL divergence to be ill-defined. We then establish the first convergence guarantees for both the $\tau$-leaping and uniformization samplers under absorbing rate matrices, demonstrating improved rates over their counterparts using uniform rate matrices. Furthermore, under suitable assumptions, we provide convergence guarantees without early stopping. Our analysis introduces several new technical tools to address challenges unique to absorbing rate matrices. These include a Jensen-type argument for bounding forward process convergence, novel techniques for bounding absorbing score functions, and a non-divergent upper bound on the score near initialization that removes the need of early-stopping.

Title: Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning

Authors: Yijun Yang, Zhao-Yang Wang, Qiuping Liu, Shuwen Sun, Kang Wang, Rama Chellappa, Zongwei Zhou, Alan Yuille, Lei Zhu, Yu-Dong Zhang, Jieneng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02327
Pdf URL: https://arxiv.org/pdf/2506.02327
Copy Paste: [[2506.02327]] Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning(https://arxiv.org/abs/2506.02327)
Keywords: generative
Abstract: Providing effective treatment and making informed clinical decisions are essential goals of modern medicine and clinical care. We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models. To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that visually predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics. Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13%, paving the way for future integration of medical world models as the second readers.

Title: Auto-Labeling Data for Object Detection

Authors: Brent A. Griffin, Manushree Gangwar, Jacob Sela, Jason J. Corso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02359
Pdf URL: https://arxiv.org/pdf/2506.02359
Copy Paste: [[2506.02359]] Auto-Labeling Data for Object Detection(https://arxiv.org/abs/2506.02359)
Keywords: foundation model
Abstract: Great labels make great models. However, traditional labeling approaches for tasks like object detection have substantial costs at scale. Furthermore, alternatives to fully-supervised object detection either lose functionality or require larger models with prohibitive computational costs for inference at scale. To that end, this paper addresses the problem of training standard object detection models without any ground truth labels. Instead, we configure previously-trained vision-language foundation models to generate application-specific pseudo "ground truth" labels. These auto-generated labels directly integrate with existing model training frameworks, and we subsequently train lightweight detection models that are computationally efficient. In this way, we avoid the costs of traditional labeling, leverage the knowledge of vision-language models, and keep the efficiency of lightweight models for practical application. We perform exhaustive experiments across multiple labeling configurations, downstream inference models, and datasets to establish best practices and set an extensive auto-labeling benchmark. From our results, we find that our approach is a viable alternative to standard labeling in that it maintains competitive performance on multiple datasets and substantially reduces labeling time and costs.

Title: Approximate Borderline Sampling using Granular-Ball for Classification Tasks

Authors: Qin Xie, Qinghua Zhang, Shuyin Xia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02366
Pdf URL: https://arxiv.org/pdf/2506.02366
Copy Paste: [[2506.02366]] Approximate Borderline Sampling using Granular-Ball for Classification Tasks(https://arxiv.org/abs/2506.02366)
Keywords: diffusion
Abstract: Data sampling enhances classifier efficiency and robustness through data compression and quality improvement. Recently, the sampling method based on granular-ball (GB) has shown promising performance in generality and noisy classification tasks. However, some limitations remain, including the absence of borderline sampling strategies and issues with class boundary blurring or shrinking due to overlap between GBs. In this paper, an approximate borderline sampling method using GBs is proposed for classification tasks. First, a restricted diffusion-based GB generation (RD-GBG) method is proposed, which prevents GB overlaps by constrained expansion, preserving precise geometric representation of GBs via redefined ones. Second, based on the concept of heterogeneous nearest neighbor, a GB-based approximate borderline sampling (GBABS) method is proposed, which is the first general sampling method capable of both borderline sampling and improving the quality of class noise datasets. Additionally, since RD-GBG incorporates noise detection and GBABS focuses on borderline samples, GBABS performs outstandingly on class noise datasets without the need for an optimal purity threshold. Experimental results demonstrate that the proposed methods outperform the GB-based sampling method and several representative sampling methods. Our source code is publicly available at this https URL.

Title: SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples

Authors: Haoye Lu, Darren Lo, Yaoliang Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02371
Pdf URL: https://arxiv.org/pdf/2506.02371
Copy Paste: [[2506.02371]] SFBD Flow: A Continuous-Optimization Framework for Training Diffusion Models with Noisy Samples(https://arxiv.org/abs/2506.02371)
Keywords: diffusion, generative
Abstract: Diffusion models achieve strong generative performance but often rely on large datasets that may include sensitive content. This challenge is compounded by the models' tendency to memorize training data, raising privacy concerns. SFBD (Lu et al., 2025) addresses this by training on corrupted data and using limited clean samples to capture local structure and improve convergence. However, its iterative denoising and fine-tuning loop requires manual coordination, making it burdensome to implement. We reinterpret SFBD as an alternating projection algorithm and introduce a continuous variant, SFBD flow, that removes the need for alternating steps. We further show its connection to consistency constraint-based methods, and demonstrate that its practical instantiation, Online SFBD, consistently outperforms strong baselines across benchmarks.

Title: Exploring Explanations Improves the Robustness of In-Context Learning

Authors: Ukyo Honda, Tatsushi Oka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02378
Pdf URL: https://arxiv.org/pdf/2506.02378
Copy Paste: [[2506.02378]] Exploring Explanations Improves the Robustness of In-Context Learning(https://arxiv.org/abs/2506.02378)
Keywords: in-context
Abstract: In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs). However, it often struggles to generalize beyond the distribution of the provided demonstrations. A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels. Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X$^2$-ICL), thereby enabling more comprehensive and robust decision-making. Experimental results on multiple natural language understanding datasets validate the effectiveness of X$^2$-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.

Title: The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

Authors: Xiaofeng Cong, Yu-Xin Zhang, Haoran Wei, Yeying Jin, Junming Hou, Jie Gui, Jing Zhang, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02395
Pdf URL: https://arxiv.org/pdf/2506.02395
Copy Paste: [[2506.02395]] The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception(https://arxiv.org/abs/2506.02395)
Keywords: diffusion, generative
Abstract: While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model's generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset's utility and the model's superior performance in joint haze removal and brightness mapping.

Title: Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models

Authors: Zhiya Tan, Xin Zhang, Joey Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02405
Pdf URL: https://arxiv.org/pdf/2506.02405
Copy Paste: [[2506.02405]] Modelship Attribution: Tracing Multi-Stage Manipulations Across Generative Models(https://arxiv.org/abs/2506.02405)
Keywords: generative
Abstract: As generative techniques become increasingly accessible, authentic visuals are frequently subjected to iterative alterations by various individuals employing a variety of tools. Currently, to avoid misinformation and ensure accountability, a lot of research on detection and attribution is emerging. Although these methods demonstrate promise in single-stage manipulation scenarios, they fall short when addressing complex real-world iterative manipulation. In this paper, we are the first, to the best of our knowledge, to systematically model this real-world challenge and introduce a novel method to solve it. We define a task called "Modelship Attribution", which aims to trace the evolution of manipulated images by identifying the generative models involved and reconstructing the sequence of edits they performed. To realistically simulate this scenario, we utilize three generative models, StyleMapGAN, DiffSwap, and FacePartsSwap, that sequentially modify distinct regions of the same image. This process leads to the creation of the first modelship dataset, comprising 83,700 images (16,740 images*5). Given that later edits often overwrite the fingerprints of earlier models, the focus shifts from extracting blended fingerprints to characterizing each model's distinctive editing patterns. To tackle this challenge, we introduce the modelship attribution transformer (MAT), a purpose-built framework designed to effectively recognize and attribute the contributions of various models within complex, multi-stage manipulation workflows. Through extensive experiments and comparative analysis with other related methods, our results, including comprehensive ablation studies, demonstrate that the proposed approach is a highly effective solution for modelship attribution.

Title: Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

Authors: Wenhao Tang, Rong Qin, Heng Fang, Fengtao Zhou, Hao Chen, Xiang Li, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02408
Pdf URL: https://arxiv.org/pdf/2506.02408
Copy Paste: [[2506.02408]] Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology(https://arxiv.org/abs/2506.02408)
Keywords: foundation model
Abstract: Pre-trained encoders for offline feature extraction followed by multiple instance learning (MIL) aggregators have become the dominant paradigm in computational pathology (CPath), benefiting cancer diagnosis and prognosis. However, performance limitations arise from the absence of encoder fine-tuning for downstream tasks and disjoint optimization with MIL. While slide-level supervised end-to-end (E2E) learning is an intuitive solution to this issue, it faces challenges such as high computational demands and suboptimal results. These limitations motivate us to revisit E2E learning. We argue that prior work neglects inherent E2E optimization challenges, leading to performance disparities compared to traditional two-stage methods. In this paper, we pioneer the elucidation of optimization challenge caused by sparse-attention MIL and propose a novel MIL called ABMILX. It mitigates this problem through global correlation-based attention refinement and multi-head mechanisms. With the efficient multi-scale random patch sampling strategy, an E2E trained ResNet with ABMILX surpasses SOTA foundation models under the two-stage paradigm across multiple challenging benchmarks, while remaining computationally efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath and calls for greater research focus in this area. The code is this https URL.

Title: SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning

Authors: Zhengyuan Liu, Geyu Lin, Hui Li Tan, Huayun Zhang, Yanfeng Lu, Xiaoxue Gao, Stella Xin Yin, He Sun, Hock Huan Goh, Lung Hsiang Wong, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02412
Pdf URL: https://arxiv.org/pdf/2506.02412
Copy Paste: [[2506.02412]] SingaKids: A Multilingual Multimodal Dialogic Tutor for Language Learning(https://arxiv.org/abs/2506.02412)
Keywords: generative
Abstract: The integration of generative artificial intelligence into educational applications has enhanced personalized and interactive learning experiences, and it shows strong potential to promote young learners language acquisition. However, it is still challenging to ensure consistent and robust performance across different languages and cultural contexts, and kids-friendly design requires simplified instructions, engaging interactions, and age-appropriate scaffolding to maintain motivation and optimize learning outcomes. In this work, we introduce SingaKids, a dialogic tutor designed to facilitate language learning through picture description tasks. Our system integrates dense image captioning, multilingual dialogic interaction, speech understanding, and engaging speech generation to create an immersive learning environment in four languages: English, Mandarin, Malay, and Tamil. We further improve the system through multilingual pre-training, task-specific tuning, and scaffolding optimization. Empirical studies with elementary school students demonstrate that SingaKids provides effective dialogic teaching, benefiting learners at different performance levels.

Title: Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models

Authors: Nurislam Tursynbek, Hastings Greer, Basar Demir, Marc Niethammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02419
Pdf URL: https://arxiv.org/pdf/2506.02419
Copy Paste: [[2506.02419]] Guiding Registration with Emergent Similarity from Pre-Trained Diffusion Models(https://arxiv.org/abs/2506.02419)
Keywords: diffusion
Abstract: Diffusion models, while trained for image generation, have emerged as powerful foundational feature extractors for downstream tasks. We find that off-the-shelf diffusion models, trained exclusively to generate natural RGB images, can identify semantically meaningful correspondences in medical images. Building on this observation, we propose to leverage diffusion model features as a similarity measure to guide deformable image registration networks. We show that common intensity-based similarity losses often fail in challenging scenarios, such as when certain anatomies are visible in one image but absent in another, leading to anatomically inaccurate alignments. In contrast, our method identifies true semantic correspondences, aligning meaningful structures while disregarding those not present across images. We demonstrate superior performance of our approach on two tasks: multimodal 2D registration (DXA to X-Ray) and monomodal 3D registration (brain-extracted to non-brain-extracted MRI). Code: this https URL

Title: Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals

Authors: Weiheng Yao, Xuhang Chen, Shuqiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02433
Pdf URL: https://arxiv.org/pdf/2506.02433
Copy Paste: [[2506.02433]] Empowering Functional Neuroimaging: A Pre-trained Generative Framework for Unified Representation of Neural Signals(https://arxiv.org/abs/2506.02433)
Keywords: generative
Abstract: Multimodal functional neuroimaging enables systematic analysis of brain mechanisms and provides discriminative representations for brain-computer interface (BCI) decoding. However, its acquisition is constrained by high costs and feasibility limitations. Moreover, underrepresentation of specific groups undermines fairness of BCI decoding model. To address these challenges, we propose a unified representation framework for multimodal functional neuroimaging via generative artificial intelligence (AI). By mapping multimodal functional neuroimaging into a unified representation space, the proposed framework is capable of generating data for acquisition-constrained modalities and underrepresented groups. Experiments show that the framework can generate data consistent with real brain activity patterns, provide insights into brain mechanisms, and improve performance on downstream tasks. More importantly, it can enhance model fairness by augmenting data for underrepresented groups. Overall, the framework offers a new paradigm for decreasing the cost of acquiring multimodal functional neuroimages and enhancing the fairness of BCI decoding models.

Title: SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios

Authors: Lingwei Dang, Ruizhi Shao, Hongwen Zhang, Wei Min, Yebin Liu, Qingyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02444
Pdf URL: https://arxiv.org/pdf/2506.02444
Copy Paste: [[2506.02444]] SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios(https://arxiv.org/abs/2506.02444)
Keywords: diffusion
Abstract: Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at \href{this https URL}{this https URL}.

Title: ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model

Authors: Wenshuo Chen, Kuimou Yu, Haozhe Jia, Kaishen Yuan, Bowen Tian, Songning Lai, Hongru Xiao, Erhang Zhang, Lei Wang, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02452
Pdf URL: https://arxiv.org/pdf/2506.02452
Copy Paste: [[2506.02452]] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model(https://arxiv.org/abs/2506.02452)
Keywords: diffusion
Abstract: While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. **(iii) Temporal-semantic reweighting:** Quantitatively aligns text influence with phase requirements. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.

Title: ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment

Authors: Martin JJ. Bucher, Iro Armeni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02459
Pdf URL: https://arxiv.org/pdf/2506.02459
Copy Paste: [[2506.02459]] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment(https://arxiv.org/abs/2506.02459)
Keywords: diffusion, generative
Abstract: Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. In contrast, LLM-based methods enable richer semantics via natural language (e.g., 'modern studio with light wood furniture') but do not support editing, remain limited to rectangular layouts or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a novel voxelization-based evaluation that captures fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on object addition while maintaining competitive results on full scene synthesis.

Title: Generative Perception of Shape and Material from Differential Motion

Authors: Xinran Nicole Han, Ko Nishino, Todd Zickler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02473
Pdf URL: https://arxiv.org/pdf/2506.02473
Copy Paste: [[2506.02473]] Generative Perception of Shape and Material from Differential Motion(https://arxiv.org/abs/2506.02473)
Keywords: diffusion, generative
Abstract: Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

Title: Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay

Authors: Kunyu Wang, Xueyang Fu, Chengzhi Cao, Chengjie Ge, Wei Zhai, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02477
Pdf URL: https://arxiv.org/pdf/2506.02477
Copy Paste: [[2506.02477]] Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay(https://arxiv.org/abs/2506.02477)
Keywords: generative
Abstract: Current image de-raining methods primarily learn from a limited dataset, leading to inadequate performance in varied real-world rainy conditions. To tackle this, we introduce a new framework that enables networks to progressively expand their de-raining knowledge base by tapping into a growing pool of datasets, significantly boosting their adaptability. Drawing inspiration from the human brain's ability to continuously absorb and generalize from ongoing experiences, our approach borrow the mechanism of the complementary learning system. Specifically, we first deploy Generative Adversarial Networks (GANs) to capture and retain the unique features of new data, mirroring the hippocampus's role in learning and memory. Then, the de-raining network is trained with both existing and GAN-synthesized data, mimicking the process of hippocampal replay and interleaved learning. Furthermore, we employ knowledge distillation with the replayed data to replicate the synergy between the neocortex's activity patterns triggered by hippocampal replays and the pre-existing neocortical knowledge. This comprehensive framework empowers the de-raining network to amass knowledge from various datasets, continually enhancing its performance on previously unseen rainy scenes. Our testing on three benchmark de-raining networks confirms the framework's effectiveness. It not only facilitates continuous knowledge accumulation across six datasets but also surpasses state-of-the-art methods in generalizing to new real-world scenarios.

Title: Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models

Authors: Hongtao Huang, Xiaojun Chang, Lina Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02488
Pdf URL: https://arxiv.org/pdf/2506.02488
Copy Paste: [[2506.02488]] Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models(https://arxiv.org/abs/2506.02488)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but are constrained by high computational costs due to iterative multi-step inference. While Neural Architecture Search (NAS) can optimize DMs, existing methods are hindered by retraining requirements, exponential search complexity from step-wise optimization, and slow evaluation relying on massive image generation. To address these challenges, we propose Flexiffusion, a training-free NAS framework that jointly optimizes generation schedules and model architectures without modifying pre-trained parameters. Our key insight is to decompose the generation process into flexible segments of equal length, where each segment dynamically combines three step types: full (complete computation), partial (cache-reused computation), and null (skipped computation). This segment-wise search space reduces the candidate pool exponentially compared to step-wise NAS while preserving architectural diversity. Further, we introduce relative FID (rFID), a lightweight evaluation metric for NAS that measures divergence from a teacher model's outputs instead of ground truth, slashing evaluation time by over $90\%$. In practice, Flexiffusion achieves at least $2\times$ acceleration across LDMs, Stable Diffusion, and DDPMs on ImageNet and MS-COCO, with FID degradation under $5\%$, outperforming prior NAS and caching methods. Notably, it attains $5.1\times$ speedup on Stable Diffusion with near-identical CLIP scores. Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.

Title: LumosFlow: Motion-Guided Long Video Generation

Authors: Jiahao Chen, Hangjie Yuan, Yichen Qian, Jingyun Liang, Jiazheng Xing, Pengwei Liu, Weihua Chen, Fan Wang, Bing Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02497
Pdf URL: https://arxiv.org/pdf/2506.02497
Copy Paste: [[2506.02497]] LumosFlow: Motion-Guided Long Video Generation(https://arxiv.org/abs/2506.02497)
Keywords: diffusion
Abstract: Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15x interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance. Our project page: this https URL

Title: KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG

Authors: Yongjian Li, HaoCheng Chu, Yukun Yan, Zhenghao Liu, Shi Yu, Zheni Zeng, Ruobing Wang, Sen Song, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02503
Pdf URL: https://arxiv.org/pdf/2506.02503
Copy Paste: [[2506.02503]] KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG(https://arxiv.org/abs/2506.02503)
Keywords: generative
Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access broader knowledge sources, yet factual inconsistencies persist due to noise in retrieved documents-even with advanced retrieval methods. We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance. In this paper, we present KARE-RAG (Knowledge-Aware Refinement and Enhancement for RAG), which improves knowledge utilization through three key innovations: (1) structured knowledge representations that facilitate error detection during training, (2) Dense Direct Preference Optimization (DDPO)-a refined training objective that prioritizes correction of critical errors, and (3) a contrastive data generation pipeline that maintains semantic consistency while rectifying factual inaccuracies. Experiments show our method significantly enhances standard RAG pipelines across model scales, improving both in-domain and out-of-domain task performance without compromising general capabilities. Notably, these gains are achieved with modest training data, suggesting data-efficient optimization is possible through targeted learning strategies. Our findings establish a new direction for RAG improvement: by improving how models learn to process retrieved content, we can enhance performance across diverse inference paradigms. All data and code will be publicly available on Github.

Title: RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers

Authors: Yan Gong, Yiren Song, Yicheng Li, Chenglin Li, Yin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02528
Pdf URL: https://arxiv.org/pdf/2506.02528
Copy Paste: [[2506.02528]] RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers(https://arxiv.org/abs/2506.02528)
Keywords: diffusion, in-context
Abstract: Inspired by the in-context learning mechanism of large language models (LLMs), a new paradigm of generalizable visual prompt-based image editing is emerging. Existing single-reference methods typically focus on style or appearance adjustments and struggle with non-rigid transformations. To address these limitations, we propose leveraging source-target image pairs to extract and transfer content-aware editing intent to novel query images. To this end, we introduce RelationAdapter, a lightweight module that enables Diffusion Transformer (DiT) based models to effectively capture and apply visual transformations from minimal examples. We also introduce Relation252K, a comprehensive dataset comprising 218 diverse editing tasks, to evaluate model generalization and adaptability in visual prompt-driven scenarios. Experiments on Relation252K show that RelationAdapter significantly improves the model's ability to understand and transfer editing intent, leading to notable gains in generation quality and overall editing performance.

Title: MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection

Authors: Juntong Li, Lingwei Dang, Yukun Su, Yun Hao, Qingxin Xiao, Yongwei Nie, Qingyao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02535
Pdf URL: https://arxiv.org/pdf/2506.02535
Copy Paste: [[2506.02535]] MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection(https://arxiv.org/abs/2506.02535)
Keywords: anomaly
Abstract: Video Anomaly Detection (VAD) methods based on reconstruction or prediction face two critical challenges: (1) strong generalization capability often results in accurate reconstruction or prediction of abnormal events, making it difficult to distinguish normal from abnormal patterns; (2) reliance only on low-level appearance and motion cues limits their ability to identify high-level semantic in abnormal events from complex scenes. To address these limitations, we propose a novel VAD framework with two key innovations. First, to suppress excessive generalization, we introduce the Sparse Feature Filtering Module (SFFM) that employs bottleneck filters to dynamically and adaptively remove abnormal information from features. Unlike traditional memory modules, it does not need to memorize the normal prototypes across the training dataset. Further, we design the Mixture of Experts (MoE) architecture for SFFM. Each expert is responsible for extracting specialized principal features during running time, and different experts are selectively activated to ensure the diversity of the learned principal features. Second, to overcome the neglect of semantics in existing methods, we integrate a Vision-Language Model (VLM) to generate textual descriptions for video clips, enabling comprehensive joint modeling of semantic, appearance, and motion cues. Additionally, we enforce modality consistency through semantic similarity constraints and motion frame-difference contrastive loss. Extensive experiments on multiple public datasets validate the effectiveness of our multimodal joint modeling framework and sparse feature filtering paradigm. Project page at this https URL.

Title: Rethinking Post-Unlearning Behavior of Large Vision-Language Models

Authors: Minsung Kim, Nakyeong Yang, Kyomin Jung
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.02541
Pdf URL: https://arxiv.org/pdf/2506.02541
Copy Paste: [[2506.02541]] Rethinking Post-Unlearning Behavior of Large Vision-Language Models(https://arxiv.org/abs/2506.02541)
Keywords: generative
Abstract: Machine unlearning is used to mitigate the privacy risks of Large Vision-Language Models (LVLMs) arising from training on large-scale web data. However, existing unlearning methods often fail to carefully select substitute outputs for forget targets, resulting in Unlearning Aftermaths-undesirable behaviors such as degenerate, hallucinated, or excessively refused responses. We highlight that, especially for generative LVLMs, it is crucial to consider the quality and informativeness of post-unlearning responses rather than relying solely on naive suppression. To address this, we introduce a new unlearning task for LVLMs that requires models to provide privacy-preserving yet informative and visually grounded responses. We also propose PUBG, a novel unlearning method that explicitly guides post-unlearning behavior toward a desirable output distribution. Experiments show that, while existing methods suffer from Unlearning Aftermaths despite successfully preventing privacy violations, PUBG effectively mitigates these issues, generating visually grounded and informative responses without privacy leakage for forgotten targets.

Title: Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

Authors: Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02550
Pdf URL: https://arxiv.org/pdf/2506.02550
Copy Paste: [[2506.02550]] Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025(https://arxiv.org/abs/2506.02550)
Keywords: foundation model
Abstract: In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at this https URL.

Title: SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Authors: Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, Xiaochun Cao, Yutong Ban, Qi Dou, Yang Liu, Yueming Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02555
Pdf URL: https://arxiv.org/pdf/2506.02555
Copy Paste: [[2506.02555]] SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence(https://arxiv.org/abs/2506.02555)
Keywords: foundation model
Abstract: Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).

Title: Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Authors: Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02557
Pdf URL: https://arxiv.org/pdf/2506.02557
Copy Paste: [[2506.02557]] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models(https://arxiv.org/abs/2506.02557)
Keywords: foundation model
Abstract: Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.

Title: DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing

Authors: Zixiang Li, Haoyu Wang, Wei Wang, Chuangchuang Tan, Yunchao Wei, Yao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02560
Pdf URL: https://arxiv.org/pdf/2506.02560
Copy Paste: [[2506.02560]] DCI: Dual-Conditional Inversion for Boosting Diffusion-Based Image Editing(https://arxiv.org/abs/2506.02560)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in image generation and editing tasks. Inversion within these models aims to recover the latent noise representation for a real or generated image, enabling reconstruction, editing, and other downstream tasks. However, to date, most inversion approaches suffer from an intrinsic trade-off between reconstruction accuracy and editing flexibility. This limitation arises from the difficulty of maintaining both semantic alignment and structural consistency during the inversion process. In this work, we introduce Dual-Conditional Inversion (DCI), a novel framework that jointly conditions on the source prompt and reference image to guide the inversion process. Specifically, DCI formulates the inversion process as a dual-condition fixed-point optimization problem, minimizing both the latent noise gap and the reconstruction error under the joint guidance. This design anchors the inversion trajectory in both semantic and visual space, leading to more accurate and editable latent representations. Our novel setup brings new understanding to the inversion process. Extensive experiments demonstrate that DCI achieves state-of-the-art performance across multiple editing tasks, significantly improving both reconstruction quality and editing precision. Furthermore, we also demonstrate that our method achieves strong results in reconstruction tasks, implying a degree of robustness and generalizability approaching the ultimate goal of the inversion process.

Title: Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning

Authors: Sarenne Wallbridge, Christoph Minixhofer, Catherine Lai, Peter Bell
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2506.02584
Pdf URL: https://arxiv.org/pdf/2506.02584
Copy Paste: [[2506.02584]] Prosodic Structure Beyond Lexical Content: A Study of Self-Supervised Learning(https://arxiv.org/abs/2506.02584)
Keywords: self-supervised
Abstract: People exploit the predictability of lexical structures during text comprehension. Though predictable structure is also present in speech, the degree to which prosody, e.g. intonation, tempo, and loudness, contributes to such structure independently of the lexical content is unclear. This study leverages self-supervised learning (SSL) to examine the temporal granularity of structures in the acoustic correlates of prosody. Representations from our proposed Masked Prosody Model can predict perceptual labels dependent on local information, such as word boundaries, but provide the most value for labels involving longer-term structures, like emotion recognition. Probing experiments across various perceptual labels show strong relative gains over untransformed pitch, energy, and voice activity features. Our results reveal the importance of SSL training objective timescale and highlight the value of complex SSL-encoded structures compared to more constrained classical structures.

Title: Hyperspectral Image Generation with Unmixing Guided Diffusion Model

Authors: Shiyu Shen, Bin Pan, Ziye Zhang, Zhenwei Shi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02601
Pdf URL: https://arxiv.org/pdf/2506.02601
Copy Paste: [[2506.02601]] Hyperspectral Image Generation with Unmixing Guided Diffusion Model(https://arxiv.org/abs/2506.02601)
Keywords: diffusion, generative
Abstract: Recently, hyperspectral image generation has received increasing attention, but existing generative models rely on conditional generation schemes, which limits the diversity of generated images. Diffusion models are popular for their ability to generate high-quality samples, but adapting these models from RGB to hyperspectral data presents the challenge of high dimensionality and physical constraints. To address these challenges, we propose a novel diffusion model guided by hyperspectral unmixing. Our model comprises two key modules: an unmixing autoencoder module and an abundance diffusion module. The unmixing autoencoder module leverages unmixing guidance to shift the generative task from the image space to the low-dimensional abundance space, significantly reducing computational complexity while preserving high fidelity. The abundance diffusion module generates samples that satisfy the constraints of non-negativity and unity, ensuring the physical consistency of the reconstructed HSIs. Additionally, we introduce two evaluation metrics tailored to hyperspectral data. Empirical results, evaluated using both traditional metrics and our proposed metrics, indicate that our model is capable of generating high-quality and diverse hyperspectral images, offering an advancement in hyperspectral data generation.

Title: One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation

Authors: Xue Wu, Jingwei Xin, Zhijun Tu, Jie Hu, Jie Li, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02605
Pdf URL: https://arxiv.org/pdf/2506.02605
Copy Paste: [[2506.02605]] One-Step Diffusion-based Real-World Image Super-Resolution with Visual Perception Distillation(https://arxiv.org/abs/2506.02605)
Keywords: diffusion
Abstract: Diffusion-based models have been widely used in various visual generation tasks, showing promising results in image super-resolution (SR), while typically being limited by dozens or even hundreds of sampling steps. Although existing methods aim to accelerate the inference speed of multi-step diffusion-based SR methods through knowledge distillation, their generated images exhibit insufficient semantic alignment with real images, resulting in suboptimal perceptual quality reconstruction, specifically reflected in the CLIPIQA score. These methods still have many challenges in perceptual quality and semantic fidelity. Based on the challenges, we propose VPD-SR, a novel visual perception diffusion distillation framework specifically designed for SR, aiming to construct an effective and efficient one-step SR model. Specifically, VPD-SR consists of two components: Explicit Semantic-aware Supervision (ESS) and High-Frequency Perception (HFP) loss. Firstly, the ESS leverages the powerful visual perceptual understanding capabilities of the CLIP model to extract explicit semantic supervision, thereby enhancing semantic consistency. Then, Considering that high-frequency information contributes to the visual perception quality of images, in addition to the vanilla distillation loss, the HFP loss guides the student model to restore the missing high-frequency details in degraded images that are critical for enhancing perceptual quality. Lastly, we expand VPD-SR in adversarial training manner to further enhance the authenticity of the generated content. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed VPD-SR achieves superior performance compared to both previous state-of-the-art methods and the teacher model with just one-step sampling.

Title: Simple, Good, Fast: Self-Supervised World Models Free of Baggage

Authors: Jan Robine, Marc Höftmann, Stefan Harmeling
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.02612
Pdf URL: https://arxiv.org/pdf/2506.02612
Copy Paste: [[2506.02612]] Simple, Good, Fast: Self-Supervised World Models Free of Baggage(https://arxiv.org/abs/2506.02612)
Keywords: self-supervised
Abstract: What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF's connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark.

Title: HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport

Authors: Yanbei Liu, Chongxu Wang, Zhitao Xiao, Lei Geng, Yanwei Pang, Xiao Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02619
Pdf URL: https://arxiv.org/pdf/2506.02619
Copy Paste: [[2506.02619]] HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport(https://arxiv.org/abs/2506.02619)
Keywords: self-supervised
Abstract: Heterogeneous Graph Neural Networks (HGNNs), have demonstrated excellent capabilities in processing heterogeneous information networks. Self-supervised learning on heterogeneous graphs, especially contrastive self-supervised strategy, shows great potential when there are no labels. However, this approach requires the use of carefully designed graph augmentation strategies and the selection of positive and negative samples. Determining the exact level of similarity between sample pairs is this http URL solve this problem, we propose a novel self-supervised Heterogeneous graph neural network with Optimal Transport (HGOT) method which is designed to facilitate self-supervised learning for heterogeneous graphs without graph augmentation strategies. Different from traditional contrastive self-supervised learning, HGOT employs the optimal transport mechanism to relieve the laborious sampling process of positive and negative samples. Specifically, we design an aggregating view (central view) to integrate the semantic information contained in the views represented by different meta-paths (branch views). Then, we introduce an optimal transport plan to identify the transport relationship between the semantics contained in the branch view and the central view. This allows the optimal transport plan between graphs to align with the representations, forcing the encoder to learn node representations that are more similar to the graph space and of higher quality. Extensive experiments on four real-world datasets demonstrate that our proposed HGOT model can achieve state-of-the-art performance on various downstream tasks. In particular, in the node classification task, HGOT achieves an average of more than 6% improvement in accuracy compared with state-of-the-art methods.

Title: Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies

Authors: Ada Sawilska, Mateusz Trokielewicz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02626
Pdf URL: https://arxiv.org/pdf/2506.02626
Copy Paste: [[2506.02626]] Synthetic Iris Image Databases and Identity Leakage: Risks and Mitigation Strategies(https://arxiv.org/abs/2506.02626)
Keywords: diffusion, generative
Abstract: This paper presents a comprehensive overview of iris image synthesis methods, which can alleviate the issues associated with gathering large, diverse datasets of biometric data from living individuals, which are considered pivotal for biometric methods development. These methods for synthesizing iris data range from traditional, hand crafted image processing-based techniques, through various iterations of GAN-based image generators, variational autoencoders (VAEs), as well as diffusion models. The potential and fidelity in iris image generation of each method is discussed and examples of inferred predictions are provided. Furthermore, the risks of individual biometric features leakage from the training sets are considered, together with possible strategies for preventing them, which have to be implemented should these generative methods be considered a valid replacement of real-world biometric datasets.

Title: ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration

Authors: Cheng Yang, Lijing Liang, Zhixun Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02633
Pdf URL: https://arxiv.org/pdf/2506.02633
Copy Paste: [[2506.02633]] ControlMambaIR: Conditional Controls with State-Space Model for Image Restoration(https://arxiv.org/abs/2506.02633)
Keywords: diffusion
Abstract: This paper proposes ControlMambaIR, a novel image restoration method designed to address perceptual challenges in image deraining, deblurring, and denoising tasks. By integrating the Mamba network architecture with the diffusion model, the condition network achieves refined conditional control, thereby enhancing the control and optimization of the image generation process. To evaluate the robustness and generalization capability of our method across various image degradation conditions, extensive experiments were conducted on several benchmark datasets, including Rain100H, Rain100L, GoPro, and SSID. The results demonstrate that our proposed approach consistently surpasses existing methods in perceptual quality metrics, such as LPIPS and FID, while maintaining comparable performance in image distortion metrics, including PSNR and SSIM, highlighting its effectiveness and adaptability. Notably, ablation experiments reveal that directly noise prediction in the diffusion process achieves better performance, effectively balancing noise suppression and detail preservation. Furthermore, the findings indicate that the Mamba architecture is particularly well-suited as a conditional control network for diffusion models, outperforming both CNN- and Attention-based approaches in this context. Overall, these results highlight the flexibility and effectiveness of ControlMambaIR in addressing a range of image restoration perceptual challenges.

Title: Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

Authors: Xiao Chen, Jiazhen Huang, Qinting Jiang, Fanding Huang, Xianghua Fu, Jingyan Jiang, Zhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02671
Pdf URL: https://arxiv.org/pdf/2506.02671
Copy Paste: [[2506.02671]] Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet(https://arxiv.org/abs/2506.02671)
Keywords: self-supervised
Abstract: Test-time adaptation (TTA) has emerged as a critical technique for enhancing the generalization capability of vision-language models (VLMs) during inference. However, existing approaches often incur substantial computational costs and exhibit poor scalability, primarily due to sample-wise adaptation granularity and reliance on costly auxiliary designs such as data augmentation. To address these limitations, we introduce SAIL (Small Aid, Big Leap), a novel adapter-based TTA framework that leverages a lightweight, learnable AdaptNet to enable efficient and scalable model adaptation. As SAIL's core, a frozen pre-trained VLM collaborates with AdaptNet through a confidence-based interpolation weight, generating robust predictions during inference. These predictions serve as self-supervised targets to align AdaptNet's outputs through efficient batch-wise processing, dramatically reducing computational costs without modifying the VLM or requiring memory caches. To mitigate catastrophic forgetting during continual adaptation, we propose a gradient-aware reset strategy driven by a gradient drift indicator (GDI), which dynamically detects domain transitions and strategically resets AdaptNet for stable adaptation. Extensive experiments across diverse benchmarks on two scenarios demonstrate that SAIL achieves state-of-the-art performance while maintaining low computational costs. These results highlight SAIL's effectiveness, efficiency and scalability for real-world deployment. The code will be released upon acceptance.

Title: Solving Inverse Problems with FLAIR

Authors: Julius Erbach, Dominik Narnhofer, Andreas Dombos, Bernt Schiele, Jan Eric Lenssen, Konrad Schindler
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02680
Pdf URL: https://arxiv.org/pdf/2506.02680
Copy Paste: [[2506.02680]] Solving Inverse Problems with FLAIR(https://arxiv.org/abs/2506.02680)
Keywords: diffusion, generative
Abstract: Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the encoding into a lower-dimensional latent space makes the underlying (forward) mapping non-linear; (ii) the data likelihood term is usually intractable; and (iii) learned generative models struggle to recover rare, atypical data modes during inference. We present FLAIR, a novel training free variational framework that leverages flow-based generative models as a prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to recover atypical modes. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.

Title: Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

Authors: Shu Yang, Fengtao Zhou, Leon Mayer, Fuxiang Huang, Yiliang Chen, Yihui Wang, Sunan He, Yuxiang Nie, Xi Wang, Ömer Sümer, Yueming Jin, Huihui Sun, Shuchang Xu, Alex Qinyang Liu, Zheng Li, Jing Qin, Jeremy YuenChun Teoh, Lena Maier-Hein, Hao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02692
Pdf URL: https://arxiv.org/pdf/2506.02692
Copy Paste: [[2506.02692]] Large-scale Self-supervised Video Foundation Model for Intelligent Surgery(https://arxiv.org/abs/2506.02692)
Keywords: self-supervised, foundation model
Abstract: Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.

Title: LayoutRAG: Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation

Authors: Yuxuan Wu, Le Wang, Sanping Zhou, Mengnan Liu, Gang Hua, Haoxiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02697
Pdf URL: https://arxiv.org/pdf/2506.02697
Copy Paste: [[2506.02697]] LayoutRAG: Retrieval-Augmented Model for Content-agnostic Conditional Layout Generation(https://arxiv.org/abs/2506.02697)
Keywords: diffusion
Abstract: Controllable layout generation aims to create plausible visual arrangements of element bounding boxes within a graphic design according to certain optional constraints, such as the type or position of a specific component. While recent diffusion or flow-matching models have achieved considerable advances in multifarious conditional generation tasks, there remains considerable room for generating optimal arrangements under given conditions. In this work, we propose to carry out layout generation through retrieving by conditions and reference-guided generation. Specifically, we retrieve appropriate layout templates according to given conditions as references. The references are then utilized to guide the denoising or flow-based transport process. By retrieving layouts compatible with the given conditions, we can uncover the potential information not explicitly provided in the given condition. Such an approach offers more effective guidance to the model during the generation process, in contrast to previous models that feed the condition to the model and let the model infer the unprovided layout attributes directly. Meanwhile, we design a condition-modulated attention that selectively absorbs retrieval knowledge, adapting to the difference between retrieved templates and given conditions. Extensive experiment results show that our method successfully produces high-quality layouts that meet the given conditions and outperforms existing state-of-the-art models. Code will be released upon acceptance.

Title: Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences

Authors: Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02698
Pdf URL: https://arxiv.org/pdf/2506.02698
Copy Paste: [[2506.02698]] Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences(https://arxiv.org/abs/2506.02698)
Keywords: diffusion
Abstract: Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: \textit{preferences vary across individuals and should be represented with more granularity.} To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Our SmPO-Diffusion achieves state-of-the-art performance in preference evaluation, outperforming baselines across metrics with lower training costs. The project page is this https URL.

Title: Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection

Authors: Ruiying Lu, Jinhan Liu, Chuan Du, Dandan Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02757
Pdf URL: https://arxiv.org/pdf/2506.02757
Copy Paste: [[2506.02757]] Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection(https://arxiv.org/abs/2506.02757)
Keywords: anomaly
Abstract: Tabular anomaly detection, which aims at identifying deviant samples, has been crucial in a variety of real-world applications, such as medical disease identification, financial fraud detection, intrusion monitoring, etc. Although recent deep learning-based methods have achieved competitive performances, these methods suffer from representation entanglement and the lack of global correlation modeling, which hinders anomaly detection performance. To tackle the problem, we incorporate mask modeling and prototype learning into tabular anomaly detection. The core idea is to design learnable masks by disentangled representation learning within a projection space and extracting normal dependencies as explicit global prototypes. Specifically, the overall model involves two parts: (i) During encoding, we perform mask modeling in both the data space and projection space with orthogonal basis vectors for learning shared disentangled normal patterns; (ii) During decoding, we decode multiple masked representations in parallel for reconstruction and learn association prototypes to extract normal characteristic correlations. Our proposal derives from a distribution-matching perspective, where both projection space learning and association prototype learning are formulated as optimal transport problems, and the calibration distances are utilized to refine the anomaly scores. Quantitative and qualitative experiments on 20 tabular benchmarks demonstrate the effectiveness and interpretability of our model.

Title: Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs

Authors: Stefano Bannò, Kate Knill, Mark Gales
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.02758
Pdf URL: https://arxiv.org/pdf/2506.02758
Copy Paste: [[2506.02758]] Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs(https://arxiv.org/abs/2506.02758)
Keywords: in-context
Abstract: Vocabulary use is a fundamental aspect of second language (L2) proficiency. To date, its assessment by automated systems has typically examined the context-independent, or part-of-speech (PoS) related use of words. This paper introduces a novel approach to enable fine-grained vocabulary evaluation exploiting the precise use of words within a sentence. The scheme combines large language models (LLMs) with the English Vocabulary Profile (EVP). The EVP is a standard lexical resource that enables in-context vocabulary use to be linked with proficiency level. We evaluate the ability of LLMs to assign proficiency levels to individual words as they appear in L2 learner writing, addressing key challenges such as polysemy, contextual variation, and multi-word expressions. We compare LLMs to a PoS-based baseline. LLMs appear to exploit additional semantic information that yields improved performance. We also explore correlations between word-level proficiency and essay-level proficiency. Finally, the approach is applied to examine the consistency of the EVP proficiency levels. Results show that LLMs are well-suited for the task of vocabulary assessment.

Title: FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

Authors: Tongyuan Bai, Wangyuanfan Bai, Dong Chen, Tieru Wu, Manyi Li, Rui Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02781
Pdf URL: https://arxiv.org/pdf/2506.02781
Copy Paste: [[2506.02781]] FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts(https://arxiv.org/abs/2506.02781)
Keywords: diffusion
Abstract: Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene this http URL, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

Title: CART-based Synthetic Tabular Data Generation for Imbalanced Regression

Authors: António Pedro Pinheiro, Rita P. Ribeiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02811
Pdf URL: https://arxiv.org/pdf/2506.02811
Copy Paste: [[2506.02811]] CART-based Synthetic Tabular Data Generation for Imbalanced Regression(https://arxiv.org/abs/2506.02811)
Keywords: generative
Abstract: Handling imbalanced target distributions in regression tasks remains a significant challenge in tabular data settings where underrepresented regions can hinder model performance. Among data-level solutions, some proposals, such as random sampling and SMOTE-based approaches, propose adapting classification techniques to regression tasks. However, these methods typically rely on crisp, artificial thresholds over the target variable, a limitation inherited from classification settings that can introduce arbitrariness, often leading to non-intuitive and potentially misleading problem formulations. While recent generative models, such as GANs and VAEs, provide flexible sample synthesis, they come with high computational costs and limited interpretability. In this study, we propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression. The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space and employs a threshold-free, feature-driven generation process. Our experimental study focuses on the prediction of extreme target values across benchmark datasets. The results indicate that the proposed method is competitive with other resampling and generative strategies in terms of performance, while offering faster execution and greater transparency. These results highlight the method's potential as a transparent, scalable data-level strategy for improving regression models in imbalanced domains.

Title: Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection

Authors: Luca Maiano, Fabrizio Casadei, Irene Amerini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02857
Pdf URL: https://arxiv.org/pdf/2506.02857
Copy Paste: [[2506.02857]] Enhancing Abnormality Identification: Robust Out-of-Distribution Strategies for Deepfake Detection(https://arxiv.org/abs/2506.02857)
Keywords: generative
Abstract: Detecting deepfakes has become a critical challenge in Computer Vision and Artificial Intelligence. Despite significant progress in detection techniques, generalizing them to open-set scenarios continues to be a persistent difficulty. Neural networks are often trained on the closed-world assumption, but with new generative models constantly evolving, it is inevitable to encounter data generated by models that are not part of the training distribution. To address these challenges, in this paper, we propose two novel Out-Of-Distribution (OOD) detection approaches. The first approach is trained to reconstruct the input image, while the second incorporates an attention mechanism for detecting OODs. Our experiments validate the effectiveness of the proposed approaches compared to existing state-of-the-art techniques. Our method achieves promising results in deepfake detection and ranks among the top-performing configurations on the benchmark, demonstrating their potential for robust, adaptable solutions in dynamic, real-world applications.

Title: Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings

Authors: Amal S. Perera, David Fernandez, Chandi Witharana, Elias Manos, Michael Pimenta, Anna K. Liljedahl, Ingmar Nitze, Yili Yang, Todd Nicholson, Chia-Yu Hsu, Wenwen Li, Guido Grosse
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02868
Pdf URL: https://arxiv.org/pdf/2506.02868
Copy Paste: [[2506.02868]] Pan-Arctic Permafrost Landform and Human-built Infrastructure Feature Detection with Vision Transformers and Location Embeddings(https://arxiv.org/abs/2506.02868)
Keywords: self-supervised
Abstract: Accurate mapping of permafrost landforms, thaw disturbances, and human-built infrastructure at pan-Arctic scale using sub-meter satellite imagery is increasingly critical. Handling petabyte-scale image data requires high-performance computing and robust feature detection models. While convolutional neural network (CNN)-based deep learning approaches are widely used for remote sensing (RS),similar to the success in transformer based large language models, Vision Transformers (ViTs) offer advantages in capturing long-range dependencies and global context via attention mechanisms. ViTs support pretraining via self-supervised learning-addressing the common limitation of labeled data in Arctic feature detection and outperform CNNs on benchmark datasets. Arctic also poses challenges for model generalization, especially when features with the same semantic class exhibit diverse spectral characteristics. To address these issues for Arctic feature detection, we integrate geospatial location embeddings into ViTs to improve adaptation across regions. This work investigates: (1) the suitability of pre-trained ViTs as feature extractors for high-resolution Arctic remote sensing tasks, and (2) the benefit of combining image and location embeddings. Using previously published datasets for Arctic feature detection, we evaluate our models on three tasks-detecting ice-wedge polygons (IWP), retrogressive thaw slumps (RTS), and human-built infrastructure. We empirically explore multiple configurations to fuse image embeddings and location embeddings. Results show that ViTs with location embeddings outperform prior CNN-based models on two of the three tasks including F1 score increase from 0.84 to 0.92 for RTS detection, demonstrating the potential of transformer-based models with spatial awareness for Arctic RS applications.

Title: Token and Span Classification for Entity Recognition in French Historical Encyclopedias

Authors: Ludovic Moncla, Hédi Zeghidi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.02872
Pdf URL: https://arxiv.org/pdf/2506.02872
Copy Paste: [[2506.02872]] Token and Span Classification for Entity Recognition in French Historical Encyclopedias(https://arxiv.org/abs/2506.02872)
Keywords: generative
Abstract: Named Entity Recognition (NER) in historical texts presents unique challenges due to non-standardized language, archaic orthography, and nested or overlapping entities. This study benchmarks a diverse set of NER approaches, ranging from classical Conditional Random Fields (CRFs) and spaCy-based models to transformer-based architectures such as CamemBERT and sequence-labeling models like Flair. Experiments are conducted on the GeoEDdA dataset, a richly annotated corpus derived from 18th-century French encyclopedias. We propose framing NER as both token-level and span-level classification to accommodate complex nested entity structures typical of historical documents. Additionally, we evaluate the emerging potential of few-shot prompting with generative language models for low-resource scenarios. Our results demonstrate that while transformer-based models achieve state-of-the-art performance, especially on nested entities, generative models offer promising alternatives when labeled data are scarce. The study highlights ongoing challenges in historical NER and suggests avenues for hybrid approaches combining symbolic and neural methods to better capture the intricacies of early modern French text.

Title: Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning

Authors: Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Subjects: cs.CL, cs.AI, cs.CE, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02911
Pdf URL: https://arxiv.org/pdf/2506.02911
Copy Paste: [[2506.02911]] Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning(https://arxiv.org/abs/2506.02911)
Keywords: foundation model
Abstract: Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at this https URL.

Title: Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection

Authors: Yechi Ma, Wei Hua, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.02914
Pdf URL: https://arxiv.org/pdf/2506.02914
Copy Paste: [[2506.02914]] Towards Auto-Annotation from Annotation Guidelines: A Benchmark through 3D LiDAR Detection(https://arxiv.org/abs/2506.02914)
Keywords: foundation model
Abstract: A crucial yet under-appreciated prerequisite in machine learning solutions for real-applications is data annotation: human annotators are hired to manually label data according to detailed, expert-crafted guidelines. This is often a laborious, tedious, and costly process. To study methods for facilitating data annotation, we introduce a new benchmark AnnoGuide: Auto-Annotation from Annotation Guidelines. It aims to evaluate automated methods for data annotation directly from expert-defined annotation guidelines, eliminating the need for manual labeling. As a case study, we repurpose the well-established nuScenes dataset, commonly used in autonomous driving research, which provides comprehensive annotation guidelines for labeling LiDAR point clouds with 3D cuboids across 18 object classes. These guidelines include a few visual examples and textual descriptions, but no labeled 3D cuboids in LiDAR data, making this a novel task of multi-modal few-shot 3D detection without 3D annotations. The advances of powerful foundation models (FMs) make AnnoGuide especially timely, as FMs offer promising tools to tackle its challenges. We employ a conceptually straightforward pipeline that (1) utilizes open-source FMs for object detection and segmentation in RGB images, (2) projects 2D detections into 3D using known camera poses, and (3) clusters LiDAR points within the frustum of each 2D detection to generate a 3D cuboid. Starting with a non-learned solution that leverages off-the-shelf FMs, we progressively refine key components and achieve significant performance improvements, boosting 3D detection mAP from 12.1 to 21.9! Nevertheless, our results highlight that AnnoGuide remains an open and challenging problem, underscoring the urgent need for developing LiDAR-based FMs. We release our code and models at GitHub: this https URL

Title: INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification

Authors: Diogo A.P. Nunes, Eugénio Ribeiro
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02924
Pdf URL: https://arxiv.org/pdf/2506.02924
Copy Paste: [[2506.02924]] INESC-ID @ eRisk 2025: Exploring Fine-Tuned, Similarity-Based, and Prompt-Based Approaches to Depression Symptom Identification(https://arxiv.org/abs/2506.02924)
Keywords: foundation model
Abstract: In this work, we describe our team's approach to eRisk's 2025 Task 1: Search for Symptoms of Depression. Given a set of sentences and the Beck's Depression Inventory - II (BDI) questionnaire, participants were tasked with submitting up to 1,000 sentences per depression symptom in the BDI, sorted by relevance. Participant submissions were evaluated according to standard Information Retrieval (IR) metrics, including Average Precision (AP) and R-Precision (R-PREC). The provided training data, however, consisted of sentences labeled as to whether a given sentence was relevant or not w.r.t. one of BDI's symptoms. Due to this labeling limitation, we framed our development as a binary classification task for each BDI symptom, and evaluated accordingly. To that end, we split the available labeled data into training and validation sets, and explored foundation model fine-tuning, sentence similarity, Large Language Model (LLM) prompting, and ensemble techniques. The validation results revealed that fine-tuning foundation models yielded the best performance, particularly when enhanced with synthetic data to mitigate class imbalance. We also observed that the optimal approach varied by symptom. Based on these insights, we devised five independent test runs, two of which used ensemble methods. These runs achieved the highest scores in the official IR evaluation, outperforming submissions from 16 other teams.

Title: FORLA:Federated Object-centric Representation Learning with Slot Attention

Authors: Guiqiu Liao, Matjaz Jogan, Eric Eaton, Daniel A. Hashimoto
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.02964
Pdf URL: https://arxiv.org/pdf/2506.02964
Copy Paste: [[2506.02964]] FORLA:Federated Object-centric Representation Learning with Slot Attention(https://arxiv.org/abs/2506.02964)
Keywords: foundation model
Abstract: Learning efficient visual representations across heterogeneous unlabeled datasets remains a central challenge in federated learning. Effective federated representations require features that are jointly informative across clients while disentangling domain-specific factors without supervision. We introduce FORLA, a novel framework for federated object-centric representation learning and feature adaptation across clients using unsupervised slot attention. At the core of our method is a shared feature adapter, trained collaboratively across clients to adapt features from foundation models, and a shared slot attention module that learns to reconstruct the adapted features. To optimize this adapter, we design a two-branch student-teacher architecture. In each client, a student decoder learns to reconstruct full features from foundation models, while a teacher decoder reconstructs their adapted, low-dimensional counterpart. The shared slot attention module bridges cross-domain learning by aligning object-level representations across clients. Experiments in multiple real-world datasets show that our framework not only outperforms centralized baselines on object discovery but also learns a compact, universal representation that generalizes well across domains. This work highlights federated slot attention as an effective tool for scalable, unsupervised visual representation learning from cross-domain data with distributed concepts.

Title: Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

Authors: Dingwei Chen, Ziqiang Liu, Feiteng Fang, Chak Tou Leong, Shiwen Ni, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang, Chengming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.02973
Pdf URL: https://arxiv.org/pdf/2506.02973
Copy Paste: [[2506.02973]] Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation(https://arxiv.org/abs/2506.02973)
Keywords: diffusion
Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ''hallucinations'', remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs' internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.

Title: On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses

Authors: Mohamed Djilani, Thibault Simonetto, Karim Tit, Florian Tambon, Paul Récamier, Salah Ghamizi, Maxime Cordy, Mike Papadakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02978
Pdf URL: https://arxiv.org/pdf/2506.02978
Copy Paste: [[2506.02978]] On the Robustness of Tabular Foundation Models: Test-Time Attacks and In-Context Defenses(https://arxiv.org/abs/2506.02978)
Keywords: foundation model, in-context
Abstract: Recent tabular Foundational Models (FM) such as TabPFN and TabICL, leverage in-context learning to achieve strong performance without gradient updates or fine-tuning. However, their robustness to adversarial manipulation remains largely unexplored. In this work, we present a comprehensive study of the adversarial vulnerabilities of tabular FM, focusing on both their fragility to targeted test-time attacks and their potential misuse as adversarial tools. We show on three benchmarks in finance, cybersecurity and healthcare, that small, structured perturbations to test inputs can significantly degrade prediction accuracy, even when training context remain fixed. Additionally, we demonstrate that tabular FM can be repurposed to generate transferable evasion to conventional models such as random forests and XGBoost, and on a lesser extent to deep tabular models. To improve tabular FM, we formulate the robustification problem as an optimization of the weights (adversarial fine-tuning), or the context (adversarial in-context learning). We introduce an in-context adversarial training strategy that incrementally replaces the context with adversarial perturbed instances, without updating model weights. Our approach improves robustness across multiple tabular benchmarks. Together, these findings position tabular FM as both a target and a source of adversarial threats, highlighting the urgent need for robust training and evaluation practices in this emerging paradigm.

Title: Astrophotography turbulence mitigation via generative models

Authors: Joonyeoup Kim, Yu Yuan, Xingguang Zhang, Xijun Wang, Stanley Chan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.02981
Pdf URL: https://arxiv.org/pdf/2506.02981
Copy Paste: [[2506.02981]] Astrophotography turbulence mitigation via generative models(https://arxiv.org/abs/2506.02981)
Keywords: diffusion, generative
Abstract: Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at this https URL

Title: Implicit Regularization of the Deep Inverse Prior Trained with Inertia

Authors: Nathan Buskulic, Jalal Fadil, Yvain Quéau
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.02986
Pdf URL: https://arxiv.org/pdf/2506.02986
Copy Paste: [[2506.02986]] Implicit Regularization of the Deep Inverse Prior Trained with Inertia(https://arxiv.org/abs/2506.02986)
Keywords: self-supervised
Abstract: Solving inverse problems with neural networks benefits from very few theoretical guarantees when it comes to the recovery guarantees. We provide in this work convergence and recovery guarantees for self-supervised neural networks applied to inverse problems, such as Deep Image/Inverse Prior, and trained with inertia featuring both viscous and geometric Hessian-driven dampings. We study both the continuous-time case, i.e., the trajectory of a dynamical system, and the discrete case leading to an inertial algorithm with an adaptive step-size. We show in the continuous-time case that the network can be trained with an optimal accelerated exponential convergence rate compared to the rate obtained with gradient flow. We also show that training a network with our inertial algorithm enjoys similar recovery guarantees though with a less sharp linear convergence rate.

Title: DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

Authors: Jiarui Wang, Huiyu Duan, Juntong Wang, Ziheng Jia, Woo Yi Yang, Xiaorong Zhu, Yu Zhao, Jiaying Qian, Yuke Xing, Guangtao Zhai, Xiongkuo Min
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03007
Pdf URL: https://arxiv.org/pdf/2506.03007
Copy Paste: [[2506.03007]] DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models(https://arxiv.org/abs/2506.03007)
Keywords: generative
Abstract: With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at this https URL.

Title: Sample complexity of Schrödinger potential estimation

Authors: Nikita Puchkin, Iurii Pustovalov, Yuri Sapronov, Denis Suchkov, Alexey Naumov, Denis Belomestny
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03043
Pdf URL: https://arxiv.org/pdf/2506.03043
Copy Paste: [[2506.03043]] Sample complexity of Schrödinger potential estimation(https://arxiv.org/abs/2506.03043)
Keywords: diffusion, generative
Abstract: We address the problem of Schrödinger potential estimation, which plays a crucial role in modern generative modelling approaches based on Schrödinger bridges and stochastic optimal control for SDEs. Given a simple prior diffusion process, these methods search for a path between two given distributions $\rho_0$ and $\rho_T^*$ requiring minimal efforts. The optimal drift in this case can be expressed through a Schrödinger potential. In the present paper, we study generalization ability of an empirical Kullback-Leibler (KL) risk minimizer over a class of admissible log-potentials aimed at fitting the marginal distribution at time $T$. Under reasonable assumptions on the target distribution $\rho_T^*$ and the prior process, we derive a non-asymptotic high-probability upper bound on the KL-divergence between $\rho_T^*$ and the terminal density corresponding to the estimated log-potential. In particular, we show that the excess KL-risk may decrease as fast as $O(\log^2 n / n)$ when the sample size $n$ tends to infinity even if both $\rho_0$ and $\rho_T^*$ have unbounded supports.

Title: Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

Authors: Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, Tao Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03065
Pdf URL: https://arxiv.org/pdf/2506.03065
Copy Paste: [[2506.03065]] Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers(https://arxiv.org/abs/2506.03065)
Keywords: diffusion
Abstract: While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

Title: EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

Authors: Mingzhe Li, Gehao Zhang, Zhenting Wang, Shiqing Ma, Siqi Pan, Richard Cartwright, Juan Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03067
Pdf URL: https://arxiv.org/pdf/2506.03067
Copy Paste: [[2506.03067]] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.03067)
Keywords: diffusion
Abstract: Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

Title: ORV: 4D Occupancy-centric Robot Video Generation

Authors: Xiuyu Yang, Bohan Li, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Xin Jin, Hang Zhao, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03079
Pdf URL: https://arxiv.org/pdf/2506.03079
Copy Paste: [[2506.03079]] ORV: 4D Occupancy-centric Robot Video Generation(https://arxiv.org/abs/2506.03079)
Keywords: generative
Abstract: Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive. Recently, action-driven generative models have gained widespread adoption in robot learning and simulation, as they eliminate safety concerns and reduce maintenance efforts. However, the action sequences used in these methods often result in limited control precision and poor generalization due to their globally coarse alignment. To address these limitations, we propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation to provide more accurate semantic and geometric guidance for video generation. By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability. Furthermore, our framework supports the simultaneous generation of multi-view videos of robot gripping operations - an important capability for downstream robotic learning tasks. Extensive experimental results demonstrate that ORV consistently outperforms existing baseline methods across various datasets and sub-tasks. Demo, Code and Model: this https URL

Title: SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis

Authors: Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03082
Pdf URL: https://arxiv.org/pdf/2506.03082
Copy Paste: [[2506.03082]] SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis(https://arxiv.org/abs/2506.03082)
Keywords: diffusion, generative
Abstract: Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID's capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy's size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID's ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.

Title: Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

Authors: Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang
Subjects: cs.LG, cs.AI, cs.CL, cs.IR, math.ST
Abstract URL: https://arxiv.org/abs/2506.03100
Pdf URL: https://arxiv.org/pdf/2506.03100
Copy Paste: [[2506.03100]] Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds(https://arxiv.org/abs/2506.03100)
Keywords: in-context
Abstract: Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA.

Title: ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

Authors: Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03107
Pdf URL: https://arxiv.org/pdf/2506.03107
Copy Paste: [[2506.03107]] ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions(https://arxiv.org/abs/2506.03107)
Keywords: diffusion
Abstract: Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

Title: Rectified Flows for Fast Multiscale Fluid Flow Modeling

Authors: Victor Armegioiu, Yannick Ramic, Siddhartha Mishra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03111
Pdf URL: https://arxiv.org/pdf/2506.03111
Copy Paste: [[2506.03111]] Rectified Flows for Fast Multiscale Fluid Flow Modeling(https://arxiv.org/abs/2506.03111)
Keywords: diffusion
Abstract: The statistical modeling of fluid flows is very challenging due to their multiscale dynamics and extreme sensitivity to initial conditions. While recently proposed conditional diffusion models achieve high fidelity, they typically require hundreds of stochastic sampling steps at inference. We introduce a rectified flow framework that learns a time-dependent velocity field, transporting input to output distributions along nearly straight trajectories. By casting sampling as solving an ordinary differential equation (ODE) along this straighter flow field, our method makes each integration step much more effective, using as few as eight steps versus (more than) 128 steps in standard score-based diffusion, without sacrificing predictive fidelity. Experiments on challenging multiscale flow benchmarks show that rectified flows recover the same posterior distributions as diffusion models, preserve fine-scale features that MSE-trained baselines miss, and deliver high-resolution samples in a fraction of inference time.

Title: Targeted Forgetting of Image Subgroups in CLIP Models

Authors: Zeliang Zhang, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Chenliang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03117
Pdf URL: https://arxiv.org/pdf/2506.03117
Copy Paste: [[2506.03117]] Targeted Forgetting of Image Subgroups in CLIP Models(https://arxiv.org/abs/2506.03117)
Keywords: foundation model
Abstract: Foundation models (FMs) such as CLIP have demonstrated impressive zero-shot performance across various tasks by leveraging large-scale, unsupervised pre-training. However, they often inherit harmful or unwanted knowledge from noisy internet-sourced datasets, compromising their reliability in real-world applications. Existing model unlearning methods either rely on access to pre-trained datasets or focus on coarse-grained unlearning (e.g., entire classes), leaving a critical gap for fine-grained unlearning. In this paper, we address the challenging scenario of selectively forgetting specific portions of knowledge within a class, without access to pre-trained data, while preserving the model's overall performance. We propose a novel three-stage approach that progressively unlearns targeted knowledge while mitigating over-forgetting. It consists of (1) a forgetting stage to fine-tune the CLIP on samples to be forgotten, (2) a reminding stage to restore performance on retained samples, and (3) a restoring stage to recover zero-shot capabilities using model souping. Additionally, we introduce knowledge distillation to handle the distribution disparity between forgetting, retaining samples, and unseen pre-trained data. Extensive experiments on CIFAR-10, ImageNet-1K, and style datasets demonstrate that our approach effectively unlearns specific subgroups while maintaining strong zero-shot performance on semantically similar subgroups and other categories, significantly outperforming baseline unlearning methods, which lose effectiveness under the CLIP unlearning setting.

Title: Controllable Human-centric Keyframe Interpolation with Generative Prior

Authors: Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03119
Pdf URL: https://arxiv.org/pdf/2506.03119
Copy Paste: [[2506.03119]] Controllable Human-centric Keyframe Interpolation with Generative Prior(https://arxiv.org/abs/2506.03119)
Keywords: diffusion, generative
Abstract: Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

Title: DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Authors: Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03123
Pdf URL: https://arxiv.org/pdf/2506.03123
Copy Paste: [[2506.03123]] DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation(https://arxiv.org/abs/2506.03123)
Keywords: diffusion
Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail this http URL approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{this https URL}{this https URL}.

Title: AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

Authors: Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03126
Pdf URL: https://arxiv.org/pdf/2506.03126
Copy Paste: [[2506.03126]] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation(https://arxiv.org/abs/2506.03126)
Keywords: diffusion
Abstract: Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions. Additionally, a dedicated subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources. To demonstrate the effectiveness of AnimeShooter and establish a baseline for the reference-guided multi-shot video generation task, we introduce AnimeShooterGen, which leverages Multimodal Large Language Models (MLLMs) and video diffusion models. The reference image and previously generated shots are first processed by MLLM to produce representations aware of both reference and context, which are then used as the condition for the diffusion model to decode the subsequent shot. Experimental results show that the model trained on AnimeShooter achieves superior cross-shot visual consistency and adherence to reference visual guidance, which highlight the value of our dataset for coherent animated video generation.

Title: Zero-Shot Time Series Forecasting with Covariates via In-Context Learning

Authors: Andreas Auer, Raghul Parthipan, Pedro Mercado, Abdul Fatir Ansari, Lorenzo Stella, Bernie Wang, Michael Bohlke-Schneider, Syama Sundar Rangapuram
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03128
Pdf URL: https://arxiv.org/pdf/2506.03128
Copy Paste: [[2506.03128]] Zero-Shot Time Series Forecasting with Covariates via In-Context Learning(https://arxiv.org/abs/2506.03128)
Keywords: in-context
Abstract: Pretrained time series models, capable of zero-shot forecasting, have demonstrated significant potential in enhancing both the performance and accessibility of time series forecasting. However, existing pretrained models either do not support covariates or fail to incorporate them effectively. We introduce COSMIC, a zero-shot forecasting model that utilizes covariates via in-context learning. To address the challenge of data scarcity, we propose Informative Covariate Augmentation, which enables the training of COSMIC without requiring any datasets that include covariates. COSMIC achieves state-of-the-art performance in zero-shot forecasting, both with and without covariates. Our quantitative and qualitative analysis demonstrates that COSMIC effectively leverages covariates in zero-shot forecasting.

Title: Native-Resolution Image Synthesis

Authors: Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, Yiyuan Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03131
Pdf URL: https://arxiv.org/pdf/2506.03131
Copy Paste: [[2506.03131]] Native-Resolution Image Synthesis(https://arxiv.org/abs/2506.03131)
Keywords: diffusion, generative
Abstract: We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

Title: UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Authors: Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03147
Pdf URL: https://arxiv.org/pdf/2506.03147
Copy Paste: [[2506.03147]] UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation(https://arxiv.org/abs/2506.03147)
Keywords: generative
Abstract: Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL's data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.

Title: Self-Supervised Spatial Correspondence Across Modalities

Authors: Ayush Shrivastava, Andrew Owens
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03148
Pdf URL: https://arxiv.org/pdf/2506.03148
Copy Paste: [[2506.03148]] Self-Supervised Spatial Correspondence Across Modalities(https://arxiv.org/abs/2506.03148)
Keywords: self-supervised
Abstract: We present a method for finding cross-modal space-time correspondences. Given two images from different visual modalities, such as an RGB image and a depth map, our model identifies which pairs of pixels correspond to the same physical points in the scene. To solve this problem, we extend the contrastive random walk framework to simultaneously learn cycle-consistent feature representations for both cross-modal and intra-modal matching. The resulting model is simple and has no explicit photo-consistency assumptions. It can be trained entirely using unlabeled data, without the need for any spatially aligned multimodal image pairs. We evaluate our method on both geometric and semantic correspondence tasks. For geometric matching, we consider challenging tasks such as RGB-to-depth and RGB-to-thermal matching (and vice versa); for semantic matching, we evaluate on photo-sketch and cross-style image alignment. Our method achieves strong performance across all benchmarks.

Title: IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation

Authors: Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Ronald Clark, Ming-Hsuan Yang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.03150
Pdf URL: https://arxiv.org/pdf/2506.03150
Copy Paste: [[2506.03150]] IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation(https://arxiv.org/abs/2506.03150)
Keywords: diffusion
Abstract: Although diffusion-based models can generate high-quality and high-resolution video sequences from textual or image inputs, they lack explicit integration of geometric cues when controlling scene lighting and visual appearance across frames. To address this limitation, we propose IllumiCraft, an end-to-end diffusion framework accepting three complementary inputs: (1) high-dynamic-range (HDR) video maps for detailed lighting control; (2) synthetically relit frames with randomized illumination changes (optionally paired with a static background reference image) to provide appearance cues; and (3) 3D point tracks that capture precise 3D geometry information. By integrating the lighting, appearance, and geometry cues within a unified diffusion architecture, IllumiCraft generates temporally coherent videos aligned with user-defined prompts. It supports background-conditioned and text-conditioned video relighting and provides better fidelity than existing controllable video generation methods. Project Page: this https URL