2025-07-01

Title: Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring

Authors: Xinxin Sun, Peter Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22437
Pdf URL: https://arxiv.org/pdf/2506.22437
Copy Paste: [[2506.22437]] Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring(https://arxiv.org/abs/2506.22437)
Keywords: diffusion
Abstract: Accurate image alignment is essential for monitoring crack evolution in structural health monitoring (SHM), particularly under real-world conditions involving perspective distortion, occlusion, and low contrast. However, traditional feature detectors such as SIFT and SURF, which rely on Gaussian-based scale spaces, tend to suppress high-frequency edges, making them unsuitable for thin crack localization. Lightweight binary alternatives like ORB and BRISK, while computationally efficient, often suffer from poor keypoint repeatability on textured or shadowed surfaces. This study presents a physics-informed alignment framework that adapts the open KAZE architecture to SHM-specific challenges. By utilizing nonlinear anisotropic diffusion to construct a crack-preserving scale space, and integrating RANSAC-based homography estimation, the framework enables accurate geometric correction without the need for training, parameter tuning, or prior calibration. The method is validated on time-lapse images of masonry and concrete acquired via handheld smartphone under varied field conditions, including shadow interference, cropping, oblique viewing angles, and surface clutter. Compared to classical detectors, the proposed framework reduces crack area and spine length errors by up to 70 percent and 90 percent, respectively, while maintaining sub-5 percent alignment error in key metrics. Unsupervised, interpretable, and computationally lightweight, this approach supports scalable deployment via UAVs and mobile platforms. By tailoring nonlinear scale-space modeling to SHM image alignment, this work offers a robust and physically grounded alternative to conventional techniques for tracking real-world crack evolution.

Title: Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Authors: Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22463
Pdf URL: https://arxiv.org/pdf/2506.22463
Copy Paste: [[2506.22463]] Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization(https://arxiv.org/abs/2506.22463)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at this https URL.

Title: Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models

Authors: Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22500
Pdf URL: https://arxiv.org/pdf/2506.22500
Copy Paste: [[2506.22500]] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models(https://arxiv.org/abs/2506.22500)
Keywords: diffusion
Abstract: Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at this https URL.

Title: Weakly Supervised Object Segmentation by Background Conditional Divergence

Authors: Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22505
Pdf URL: https://arxiv.org/pdf/2506.22505
Copy Paste: [[2506.22505]] Weakly Supervised Object Segmentation by Background Conditional Divergence(https://arxiv.org/abs/2506.22505)
Keywords: generative
Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{this https URL}.

Title: SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning

Authors: Momin Ahmad Khan, Yasra Chandio, Fatima Muhammad Anwar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22506
Pdf URL: https://arxiv.org/pdf/2506.22506
Copy Paste: [[2506.22506]] SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning(https://arxiv.org/abs/2506.22506)
Keywords: anomaly
Abstract: Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.

Title: AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Authors: Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22508
Pdf URL: https://arxiv.org/pdf/2506.22508
Copy Paste: [[2506.22508]] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text(https://arxiv.org/abs/2506.22508)
Keywords: in-context
Abstract: In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization this http URL, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at this https URL.

Title: FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Authors: Hang Xu, Jie Huang, Linjiang Huang, Dong Li, Yidi Liu, Feng Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22509
Pdf URL: https://arxiv.org/pdf/2506.22509
Copy Paste: [[2506.22509]] FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment(https://arxiv.org/abs/2506.22509)
Keywords: diffusion
Abstract: Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{this https URL}{this https URL}.

Title: Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning

Authors: Zihao Zhao, Xinlong Zhai, Jinyu Yang, Chuan Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22510
Pdf URL: https://arxiv.org/pdf/2506.22510
Copy Paste: [[2506.22510]] Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning(https://arxiv.org/abs/2506.22510)
Keywords: foundation model
Abstract: Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely this http URL the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33\% on accuracy and 19.13\% on Macro-F1 score.

Title: Lightning the Night with Generative Artificial Intelligence

Authors: Tingting Zhou, Feng Zhang, Haoyang Fu, Baoxiang Pan, Renhe Zhang, Feng Lu, Zhixin Yang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22511
Pdf URL: https://arxiv.org/pdf/2506.22511
Copy Paste: [[2506.22511]] Lightning the Night with Generative Artificial Intelligence(https://arxiv.org/abs/2506.22511)
Keywords: diffusion, generative
Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrm{m}, 0.65~\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model's nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.

Title: In-context learning for the classification of manipulation techniques in phishing emails

Authors: Antony Dalmiere (LAAS-TRUST, LAAS), Guillaume Auriol (LAAS-TRUST, INSA Toulouse), Vincent Nicomette (LAAS-TSF, LAAS), Pascal Marchand (LERASS)
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22515
Pdf URL: https://arxiv.org/pdf/2506.22515
Copy Paste: [[2506.22515]] In-context learning for the classification of manipulation techniques in phishing emails(https://arxiv.org/abs/2506.22515)
Keywords: in-context
Abstract: Traditional phishing detection often overlooks psychological manipulation. This study investigates using Large Language Model (LLM) In-Context Learning (ICL) for fine-grained classification of phishing emails based on a taxonomy of 40 manipulation techniques. Using few-shot examples with GPT-4o-mini on real-world French phishing emails (SignalSpam), we evaluated performance against a human-annotated test set (100 emails). The approach effectively identifies prevalent techniques (e.g., Baiting, Curiosity Appeal, Request For Minor Favor) with a promising accuracy of 0.76. This work demonstrates ICL's potential for nuanced phishing analysis and provides insights into attacker strategies.

Title: A Survey on Model Extraction Attacks and Defenses for Large Language Models

Authors: Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, Yushun Dong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22521
Pdf URL: https://arxiv.org/pdf/2506.22521
Copy Paste: [[2506.22521]] A Survey on Model Extraction Attacks and Defenses for Large Language Models(https://arxiv.org/abs/2506.22521)
Keywords: generative
Abstract: Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

Title: Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Authors: Shansong Wang, Zhecheng Jin, Mingzhe Hu, Mojtaba Safari, Feng Zhao, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22567
Pdf URL: https://arxiv.org/pdf/2506.22567
Copy Paste: [[2506.22567]] Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation(https://arxiv.org/abs/2506.22567)
Keywords: foundation model
Abstract: CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

Title: CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Authors: Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22637
Pdf URL: https://arxiv.org/pdf/2506.22637
Copy Paste: [[2506.22637]] CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation(https://arxiv.org/abs/2506.22637)
Keywords: diffusion
Abstract: The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.

Title: 3D Shape Generation: A Survey

Authors: Nicolas Caytuiro, Ivan Sipiran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22678
Pdf URL: https://arxiv.org/pdf/2506.22678
Copy Paste: [[2506.22678]] 3D Shape Generation: A Survey(https://arxiv.org/abs/2506.22678)
Keywords: generative
Abstract: Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state of the art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.

Title: Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment

Authors: Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2506.22685
Pdf URL: https://arxiv.org/pdf/2506.22685
Copy Paste: [[2506.22685]] Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment(https://arxiv.org/abs/2506.22685)
Keywords: generative
Abstract: In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V^*$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V^*$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V^*$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V^*$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at this https URL.

Title: Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Authors: Jianing Zhang, Jiayi Zhu, Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22753
Pdf URL: https://arxiv.org/pdf/2506.22753
Copy Paste: [[2506.22753]] Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography(https://arxiv.org/abs/2506.22753)
Keywords: diffusion
Abstract: Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: this https URL.

Title: Multimodal Atmospheric Super-Resolution With Deep Generative Models

Authors: Dibyajyoti Chakraborty, Haiwen Guan, Jason Stock, Troy Arcomano, Guido Cervone, Romit Maulik
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2506.22780
Pdf URL: https://arxiv.org/pdf/2506.22780
Copy Paste: [[2506.22780]] Multimodal Atmospheric Super-Resolution With Deep Generative Models(https://arxiv.org/abs/2506.22780)
Keywords: diffusion, generative
Abstract: Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.

Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22783
Pdf URL: https://arxiv.org/pdf/2506.22783
Copy Paste: [[2506.22783]] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection(https://arxiv.org/abs/2506.22783)
Keywords: generative
Abstract: Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

Title: RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Authors: Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22800
Pdf URL: https://arxiv.org/pdf/2506.22800
Copy Paste: [[2506.22800]] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors(https://arxiv.org/abs/2506.22800)
Keywords: diffusion
Abstract: A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at this https URL. (Camera-ready version incorporating reviewer suggestions will be updated soon.)

Title: Riemannian-Geometric Fingerprints of Generative Models

Authors: Hae Jin Song, Laurent Itti
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22802
Pdf URL: https://arxiv.org/pdf/2506.22802
Copy Paste: [[2506.22802]] Riemannian-Geometric Fingerprints of Generative Models(https://arxiv.org/abs/2506.22802)
Keywords: generative
Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.

Title: Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

Authors: Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22806
Pdf URL: https://arxiv.org/pdf/2506.22806
Copy Paste: [[2506.22806]] Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate(https://arxiv.org/abs/2506.22806)
Keywords: diffusion
Abstract: Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at this https URL

Title: Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration

Authors: Ramya Hebbalaguppe, Tamoghno Kandar, Abhinav Nagpal, Chetan Arora
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22819
Pdf URL: https://arxiv.org/pdf/2506.22819
Copy Paste: [[2506.22819]] Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration(https://arxiv.org/abs/2506.22819)
Keywords: self-supervised
Abstract: Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: this https URL.

Title: Listener-Rewarded Thinking in VLMs for Image Preferences

Authors: Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22832
Pdf URL: https://arxiv.org/pdf/2506.22832
Copy Paste: [[2506.22832]] Listener-Rewarded Thinking in VLMs for Image Preferences(https://arxiv.org/abs/2506.22832)
Keywords: generative
Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: this https URL.

Title: SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds

Authors: Shashikant Verma, Shanmuganathan Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22833
Pdf URL: https://arxiv.org/pdf/2506.22833
Copy Paste: [[2506.22833]] SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds(https://arxiv.org/abs/2506.22833)
Keywords: generative
Abstract: Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit's superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.

Title: xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection

Authors: Kamil Faber, Marcin Pietroń, Dominik Żurek, Roberto Corizzo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22837
Pdf URL: https://arxiv.org/pdf/2506.22837
Copy Paste: [[2506.22837]] xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection(https://arxiv.org/abs/2506.22837)
Keywords: anomaly
Abstract: The recently proposed xLSTM is a powerful model that leverages expressive multiplicative gating and residual connections, providing the temporal capacity needed for long-horizon forecasting and representation learning. This architecture has demonstrated success in time series forecasting, lossless compression, and even large-scale language modeling tasks, where its linear memory footprint and fast inference make it a viable alternative to Transformers. Despite its growing popularity, no prior work has explored xLSTM for anomaly detection. In this work, we fill this gap by proposing xLSTMAD, the first anomaly detection method that integrates a full encoder-decoder xLSTM architecture, purpose-built for multivariate time series data. Our encoder processes input sequences to capture historical context, while the decoder is devised in two separate variants of the method. In the forecasting approach, the decoder iteratively generates forecasted future values xLSTMAD-F, while the reconstruction approach reconstructs the input time series from its encoded counterpart xLSTMAD-R. We investigate the performance of two loss functions: Mean Squared Error (MSE), and Soft Dynamic Time Warping (SoftDTW) to consider local reconstruction fidelity and global sequence alignment, respectively. We evaluate our method on the comprehensive TSB-AD-M benchmark, which spans 17 real-world datasets, using state-of-the-art challenging metrics such as VUS-PR. In our results, xLSTM showcases state-of-the-art accuracy, outperforming 23 popular anomaly detection baselines. Our paper is the first work revealing the powerful modeling capabilities of xLSTM for anomaly detection, paving the way for exciting new developments on this subject. Our code is available at: this https URL

Title: STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Authors: Junsung Lee, Junoh Kang, Bohyung Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22868
Pdf URL: https://arxiv.org/pdf/2506.22868
Copy Paste: [[2506.22868]] STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing(https://arxiv.org/abs/2506.22868)
Keywords: diffusion
Abstract: Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

Title: Towards Time Series Generation Conditioned on Unstructured Natural Language

Authors: Jaeyun Woo, Jiseok Lee, Brian Kenji Iwana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22927
Pdf URL: https://arxiv.org/pdf/2506.22927
Copy Paste: [[2506.22927]] Towards Time Series Generation Conditioned on Unstructured Natural Language(https://arxiv.org/abs/2506.22927)
Keywords: diffusion, generative
Abstract: Generative Artificial Intelligence (AI) has rapidly become a powerful tool, capable of generating various types of data, such as images and text. However, despite the significant advancement of generative AI, time series generative AI remains underdeveloped, even though the application of time series is essential in finance, climate, and numerous fields. In this research, we propose a novel method of generating time series conditioned on unstructured natural language descriptions. We use a diffusion model combined with a language model to generate time series from the text. Through the proposed method, we demonstrate that time series generation based on natural language is possible. The proposed method can provide various applications such as custom forecasting, time series manipulation, data augmentation, and transfer learning. Furthermore, we construct and propose a new public dataset for time series generation, consisting of 63,010 time series-description pairs.

Title: Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Authors: Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22960
Pdf URL: https://arxiv.org/pdf/2506.22960
Copy Paste: [[2506.22960]] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images(https://arxiv.org/abs/2506.22960)
Keywords: generative
Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.

Title: On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Authors: Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22977
Pdf URL: https://arxiv.org/pdf/2506.22977
Copy Paste: [[2506.22977]] On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"(https://arxiv.org/abs/2506.22977)
Keywords: in-context
Abstract: We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.

Title: Cybersecurity-Focused Anomaly Detection in Connected Autonomous Vehicles Using Machine Learning

Authors: Prathyush Kumar Reddy Lebaku, Lu Gao, Yunpeng Zhang, Zhixia Li, Yongxin Liu, Tanvir Arafin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22984
Pdf URL: https://arxiv.org/pdf/2506.22984
Copy Paste: [[2506.22984]] Cybersecurity-Focused Anomaly Detection in Connected Autonomous Vehicles Using Machine Learning(https://arxiv.org/abs/2506.22984)
Keywords: anomaly
Abstract: Anomaly detection in connected autonomous vehicles (CAVs) is crucial for maintaining safe and reliable transportation networks, as CAVs can be susceptible to sensor malfunctions, cyber-attacks, and unexpected environmental disruptions. This study explores an anomaly detection approach by simulating vehicle behavior, generating a dataset that represents typical and atypical vehicular interactions. The dataset includes time-series data of position, speed, and acceleration for multiple connected autonomous vehicles. We utilized machine learning models to effectively identify abnormal driving patterns. First, we applied a stacked Long Short-Term Memory (LSTM) model to capture temporal dependencies and sequence-based anomalies. The stacked LSTM model processed the sequential data to learn standard driving behaviors. Additionally, we deployed a Random Forest model to support anomaly detection by offering ensemble-based predictions, which enhanced model interpretability and performance. The Random Forest model achieved an R2 of 0.9830, MAE of 5.746, and a 95th percentile anomaly threshold of 14.18, while the stacked LSTM model attained an R2 of 0.9998, MAE of 82.425, and a 95th percentile anomaly threshold of 265.63. These results demonstrate the models' effectiveness in accurately predicting vehicle trajectories and detecting anomalies in autonomous driving scenarios.

Title: Kernel Outlier Detection

Authors: Can Hakan Dağıdır, Mia Hubert, Peter J. Rousseeuw
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22994
Pdf URL: https://arxiv.org/pdf/2506.22994
Copy Paste: [[2506.22994]] Kernel Outlier Detection(https://arxiv.org/abs/2506.22994)
Keywords: anomaly
Abstract: A new anomaly detection method called kernel outlier detection (KOD) is proposed. It is designed to address challenges of outlier detection in high-dimensional settings. The aim is to overcome limitations of existing methods, such as dependence on distributional assumptions or on hyperparameters that are hard to tune. KOD starts with a kernel transformation, followed by a projection pursuit approach. Its novelties include a new ensemble of directions to search over, and a new way to combine results of different direction types. This provides a flexible and lightweight approach for outlier detection. Our empirical evaluations illustrate the effectiveness of KOD on three small datasets with challenging structures, and on four large benchmark datasets.

Title: Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

Authors: Xinrong Hu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23038
Pdf URL: https://arxiv.org/pdf/2506.23038
Copy Paste: [[2506.23038]] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23038)
Keywords: diffusion
Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.

Title: Ovis-U1 Technical Report

Authors: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23044
Pdf URL: https://arxiv.org/pdf/2506.23044
Copy Paste: [[2506.23044]] Ovis-U1 Technical Report(https://arxiv.org/abs/2506.23044)
Keywords: diffusion
Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

Title: Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction

Authors: Hanlin Dong, Arian Prabowo, Hao Xue, Flora D. Salim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23053
Pdf URL: https://arxiv.org/pdf/2506.23053
Copy Paste: [[2506.23053]] Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction(https://arxiv.org/abs/2506.23053)
Keywords: diffusion, generative
Abstract: Air quality prediction is a challenging forecasting task due to its spatio-temporal complexity and the inherent dynamics as well as uncertainty. Most of the current models handle these two challenges by applying Graph Neural Networks or known physics principles, and quantifying stochasticity through probabilistic networks like Diffusion models. Nevertheless, finding the right balancing point between the certainties and uncertainties remains an open question. Therefore, we propose Double-Diffusion, a novel diffusion probabilistic model that harnesses the power of known physics to guide air quality forecasting with stochasticity. To the best of our knowledge, while precedents have been made of using conditional diffusion models to predict air pollution, this is the first attempt to use physics as a conditional generative approach for air quality prediction. Along with a sampling strategy adopted from image restoration and a new denoiser architecture, Double-Diffusion ranks first in most evaluation scenarios across two real-life datasets compared with other probabilistic models, it also cuts inference time by 50% to 30% while enjoying an increase between 3-12% in Continuous Ranked Probabilistic Score (CRPS).

Title: Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval

Authors: Sophie Zhou, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23132
Pdf URL: https://arxiv.org/pdf/2506.23132
Copy Paste: [[2506.23132]] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval(https://arxiv.org/abs/2506.23132)
Keywords: foundation model, generative
Abstract: Art plagiarism detection plays a crucial role in protecting artists' copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists' styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2\% but suffers from low retrieval precision 29.0\% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12\% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with insightful discussions and outline directions for future research.

Title: VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

Authors: Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23138
Pdf URL: https://arxiv.org/pdf/2506.23138
Copy Paste: [[2506.23138]] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis(https://arxiv.org/abs/2506.23138)
Keywords: diffusion, generative
Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.

Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Authors: Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23146
Pdf URL: https://arxiv.org/pdf/2506.23146
Copy Paste: [[2506.23146]] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions(https://arxiv.org/abs/2506.23146)
Keywords: in-context
Abstract: In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

Title: V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

Authors: Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23149
Pdf URL: https://arxiv.org/pdf/2506.23149
Copy Paste: [[2506.23149]] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy(https://arxiv.org/abs/2506.23149)
Keywords: in-context
Abstract: High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.

Title: Self-Supervised Contrastive Learning for Multi-Label Images

Authors: Jiale Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23156
Pdf URL: https://arxiv.org/pdf/2506.23156
Copy Paste: [[2506.23156]] Self-Supervised Contrastive Learning for Multi-Label Images(https://arxiv.org/abs/2506.23156)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has demonstrated its effectiveness in learning representations through comparison methods that align with human intuition. However, mainstream SSL methods heavily rely on high body datasets with single label, such as ImageNet, resulting in intolerable pre-training overhead. Besides, more general multi-label images are frequently overlooked in SSL, despite their potential for richer semantic information and broader applicability in downstream scenarios. Therefore, we tailor the mainstream SSL approach to guarantee excellent representation learning capabilities using fewer multi-label images. Firstly, we propose a block-wise augmentation module aimed at extracting additional potential positive view pairs from multi-label images. Subsequently, an image-aware contrastive loss is devised to establish connections between these views, thereby facilitating the extraction of semantically consistent representations. Comprehensive linear fine-tuning and transfer learning validate the competitiveness of our approach despite challenging sample quality and quantity.

Title: Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data

Authors: Chen Gong, Bo Liang, Wei Gao, Chenren Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23174
Pdf URL: https://arxiv.org/pdf/2506.23174
Copy Paste: [[2506.23174]] Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data(https://arxiv.org/abs/2506.23174)
Keywords: generative
Abstract: Generative models have gained significant attention for their ability to produce realistic synthetic data that supplements the quantity of real-world datasets. While recent studies show performance improvements in wireless sensing tasks by incorporating all synthetic data into training sets, the quality of synthetic data remains unpredictable and the resulting performance gains are not guaranteed. To address this gap, we propose tractable and generalizable metrics to quantify quality attributes of synthetic data - affinity and diversity. Our assessment reveals prevalent affinity limitation in current wireless synthetic data, leading to mislabeled data and degraded task performance. We attribute the quality limitation to generative models' lack of awareness of untrained conditions and domain-specific processing. To mitigate these issues, we introduce SynCheck, a quality-guided synthetic data utilization scheme that refines synthetic data quality during task model training. Our evaluation demonstrates that SynCheck consistently outperforms quality-oblivious utilization of synthetic data, and achieves 4.3% performance improvement even when the previous utilization degrades performance by 13.4%.

Title: Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data

Authors: Robert Frank, Michael Widrich, Rahmad Akbar, Günter Klambauer, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23182
Pdf URL: https://arxiv.org/pdf/2506.23182
Copy Paste: [[2506.23182]] Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data(https://arxiv.org/abs/2506.23182)
Keywords: generative
Abstract: Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.

Title: BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Authors: Dequan Kong, Zhe Zhu, Honghua Chen, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23205
Pdf URL: https://arxiv.org/pdf/2506.23205
Copy Paste: [[2506.23205]] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion(https://arxiv.org/abs/2506.23205)
Keywords: diffusion
Abstract: Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.

Title: PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Authors: Aradhana Mishra, Bumshik Lee
Subjects: cs.CV, cs.AI, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23254
Pdf URL: https://arxiv.org/pdf/2506.23254
Copy Paste: [[2506.23254]] PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution(https://arxiv.org/abs/2506.23254)
Keywords: diffusion
Abstract: Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.

Title: Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Authors: Lei-lei Li, Jianwu Fang, Junbin Xiao, Shanmin Pang, Hongkai Yu, Chen Lv, Jianru Xue, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23263
Pdf URL: https://arxiv.org/pdf/2506.23263
Copy Paste: [[2506.23263]] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis(https://arxiv.org/abs/2506.23263)
Keywords: diffusion
Abstract: Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.

Title: Why Settle for One? Text-to-ImageSet Generation and Evaluation

Authors: Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W.Tsang, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23275
Pdf URL: https://arxiv.org/pdf/2506.23275
Copy Paste: [[2506.23275]] Why Settle for One? Text-to-ImageSet Generation and Evaluation(https://arxiv.org/abs/2506.23275)
Keywords: diffusion, in-context
Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in this https URL.

Title: Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Authors: Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23282
Pdf URL: https://arxiv.org/pdf/2506.23282
Copy Paste: [[2506.23282]] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector(https://arxiv.org/abs/2506.23282)
Keywords: generative, anomaly
Abstract: Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.

Title: MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Authors: Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23283
Pdf URL: https://arxiv.org/pdf/2506.23283
Copy Paste: [[2506.23283]] MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition(https://arxiv.org/abs/2506.23283)
Keywords: foundation model
Abstract: Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.

Title: Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis

Authors: Zelin Zang, WenZhe Li, Fei Chen, Yongjie Xu, Chang Yu, Zhen Lei, Stan Z. Li
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23287
Pdf URL: https://arxiv.org/pdf/2506.23287
Copy Paste: [[2506.23287]] Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis(https://arxiv.org/abs/2506.23287)
Keywords: diffusion, generative
Abstract: In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding complex biological processes. Key to this is the modeling and generation of hierarchical data that represents the intrinsic structure within datasets. Traditional methods face limitations in terms of computational cost, performance, generative capacity, and stability. Recent VAEs based approaches have made strides in addressing these challenges but still require specialized network modules for each tree branch, limiting their stability and ability to capture deep hierarchical relationships. To overcome these challenges, we introduce diffusion-based approach called HDTree. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes to model tree node transitions. This method improves stability by eliminating branch-specific modules and enhancing generative capacity through gradual hierarchical changes simulated by the diffusion process. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in terms of accuracy and performance. These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths and offering insights for downstream biological tasks. The code of HDTree is available at anonymous link this https URL.

Title: Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

Authors: P. Myles Eugenio
Subjects: cs.CL, cs.AI, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.23293
Pdf URL: https://arxiv.org/pdf/2506.23293
Copy Paste: [[2506.23293]] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines(https://arxiv.org/abs/2506.23293)
Keywords: generative
Abstract: We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology -- quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class -- even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems -- where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.

Title: DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On

Authors: Xiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23295
Pdf URL: https://arxiv.org/pdf/2506.23295
Copy Paste: [[2506.23295]] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On(https://arxiv.org/abs/2506.23295)
Keywords: diffusion
Abstract: Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.

Title: Securing AI Systems: A Guide to Known Attacks and Impacts

Authors: Naoto Kiribuchi, Kengo Zenitani, Takayuki Semitsu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23296
Pdf URL: https://arxiv.org/pdf/2506.23296
Copy Paste: [[2506.23296]] Securing AI Systems: A Guide to Known Attacks and Impacts(https://arxiv.org/abs/2506.23296)
Keywords: generative
Abstract: Embedded into information systems, artificial intelligence (AI) faces security threats that exploit AI-specific vulnerabilities. This paper provides an accessible overview of adversarial attacks unique to predictive and generative AI systems. We identify eleven major attack types and explicitly link attack techniques to their impacts -- including information leakage, system compromise, and resource exhaustion -- mapped to the confidentiality, integrity, and availability (CIA) security triad. We aim to equip researchers, developers, security practitioners, and policymakers, even those without specialized AI security expertise, with foundational knowledge to recognize AI-specific risks and implement effective defenses, thereby enhancing the overall security posture of AI systems.

Title: FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method

Authors: Quang-Huy Che, Vinh-Tiep Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23323
Pdf URL: https://arxiv.org/pdf/2506.23323
Copy Paste: [[2506.23323]] FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method(https://arxiv.org/abs/2506.23323)
Keywords: diffusion
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.

Title: IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23329
Pdf URL: https://arxiv.org/pdf/2506.23329
Copy Paste: [[2506.23329]] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering(https://arxiv.org/abs/2506.23329)
Keywords: generative
Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

Title: Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment

Authors: Pawel Renc, Michal K. Grzeszczyk, Linglong Qian, Nassim Oufattole, Jeff Rasley, Arkadiusz Sitek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23358
Pdf URL: https://arxiv.org/pdf/2506.23358
Copy Paste: [[2506.23358]] Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment(https://arxiv.org/abs/2506.23358)
Keywords: foundation model, generative
Abstract: We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokenized Patient Health Timelines (PHTs), language-agnostic sequences encoding temporal, categorical, and continuous clinical information. Each institution trains an autoregressive transformer on its local PHTs and transmits only model weights to a central server. The server uses the generators to synthesize a large corpus of trajectories and train a Global Generator (GG), enabling zero-shot inference via Monte Carlo simulation of future PHTs. We evaluate FTS on five clinically meaningful prediction tasks using MIMIC-IV data, showing that models trained on synthetic data generated by GG perform comparably to those trained on real data. FTS offers strong privacy guarantees, scalability across institutions, and extensibility to diverse prediction and simulation tasks especially in healthcare, including counterfactual inference, early warning detection, and synthetic trial design.

Title: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23361
Pdf URL: https://arxiv.org/pdf/2506.23361
Copy Paste: [[2506.23361]] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions(https://arxiv.org/abs/2506.23361)
Keywords: diffusion
Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: this https URL. Our code will be released at this https URL

Title: When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery

Authors: Dominik Meier, Sujai Hiremath, Promit Ghosal, Kyra Gan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23374
Pdf URL: https://arxiv.org/pdf/2506.23374
Copy Paste: [[2506.23374]] When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery(https://arxiv.org/abs/2506.23374)
Keywords: diffusion
Abstract: Distinguishing cause and effect from bivariate observational data is a foundational problem in many disciplines, but challenging without additional assumptions. Additive noise models (ANMs) are widely used to enable sample-efficient bivariate causal discovery. However, conventional ANM-based methods fail when unobserved mediators corrupt the causal relationship between variables. This paper makes three key contributions: first, we rigorously characterize why standard ANM approaches break down in the presence of unmeasured mediators. Second, we demonstrate that prior solutions for hidden mediation are brittle in finite sample settings, limiting their practical utility. To address these gaps, we propose Bivariate Denoising Diffusion (BiDD) for causal discovery, a method designed to handle latent noise introduced by unmeasured mediators. Unlike prior methods that infer directionality through mean squared error loss comparisons, our approach introduces a novel independence test statistic: during the noising and denoising processes for each variable, we condition on the other variable as input and evaluate the independence of the predicted noise relative to this input. We prove asymptotic consistency of BiDD under the ANM, and conjecture that it performs well under hidden mediation. Experiments on synthetic and real-world data demonstrate consistent performance, outperforming existing methods in mediator-corrupted settings while maintaining strong performance in mediator-free settings.

Title: Pipelined Decoder for Efficient Context-Aware Text Generation

Authors: Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, Gong Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23431
Pdf URL: https://arxiv.org/pdf/2506.23431
Copy Paste: [[2506.23431]] Pipelined Decoder for Efficient Context-Aware Text Generation(https://arxiv.org/abs/2506.23431)
Keywords: generative
Abstract: As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

Title: PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

Authors: Mahesh Bhosale, Abdul Wasi, Yuanhao Zhai, Yunjie Tian, Samuel Border, Nan Xi, Pinaki Sarder, Junsong Yuan, David Doermann, Xuan Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23440
Pdf URL: https://arxiv.org/pdf/2506.23440
Copy Paste: [[2506.23440]] PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions(https://arxiv.org/abs/2506.23440)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.

Title: Enhancing Insider Threat Detection Using User-Based Sequencing and Transformer Encoders

Authors: Mohamed Elbasheer, Adewale Akinfaderin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23446
Pdf URL: https://arxiv.org/pdf/2506.23446
Copy Paste: [[2506.23446]] Enhancing Insider Threat Detection Using User-Based Sequencing and Transformer Encoders(https://arxiv.org/abs/2506.23446)
Keywords: anomaly
Abstract: Insider threat detection presents unique challenges due to the authorized status of malicious actors and the subtlety of anomalous behaviors. Existing machine learning methods often treat user activity as isolated events, thereby failing to leverage sequential dependencies in user behavior. In this study, we propose a User-Based Sequencing (UBS) methodology, transforming the CERT insider threat dataset into structured temporal sequences suitable for deep sequential modeling. We deploy a Transformer Encoder architecture to model benign user activity and employ its reconstruction errors as anomaly scores. These scores are subsequently evaluated using three unsupervised outlier detection algorithms: One-Class SVM (OCSVM), Local Outlier Factor (LOF), and Isolation Forest (iForest). Across four rigorously designed test sets, including combinations of multiple CERT dataset releases, our UBS-Transformer pipeline consistently achieves state-of-the-art performance - notably 96.61% accuracy, 99.43% recall, 96.38% F1-score, 95.00% AUROC, and exceptionally low false negative (0.0057) and false positive (0.0571) rates. Comparative analyses demonstrate that our approach substantially outperforms tabular and conventional autoencoder baselines, underscoring the efficacy of sequential user modeling and advanced anomaly detection in the insider threat domain.

Title: Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

Authors: Dewen Zeng, Xinrong Hu, Yu-Jen Chen, Yawen Wu, Xiaowei Xu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23460
Pdf URL: https://arxiv.org/pdf/2506.23460
Copy Paste: [[2506.23460]] Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23460)
Keywords: diffusion
Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.

Title: Time-variant Image Inpainting via Interactive Distribution Transition Estimation

Authors: Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Di Lin, Ivor Tsang, Lei Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23461
Pdf URL: https://arxiv.org/pdf/2506.23461
Copy Paste: [[2506.23461]] Time-variant Image Inpainting via Interactive Distribution Transition Estimation(https://arxiv.org/abs/2506.23461)
Keywords: diffusion
Abstract: In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.

Title: Reconciling Attribute and Structural Anomalies for Improved Graph Anomaly Detection

Authors: Chunjing Xiao, Jiahui Lu, Xovee Xu, Fan Zhou, Tianshu Xie, Wei Lu, Lifeng Xu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2506.23469
Pdf URL: https://arxiv.org/pdf/2506.23469
Copy Paste: [[2506.23469]] Reconciling Attribute and Structural Anomalies for Improved Graph Anomaly Detection(https://arxiv.org/abs/2506.23469)
Keywords: anomaly
Abstract: Graph anomaly detection is critical in domains such as healthcare and economics, where identifying deviations can prevent substantial losses. Existing unsupervised approaches strive to learn a single model capable of detecting both attribute and structural anomalies. However, they confront the tug-of-war problem between two distinct types of anomalies, resulting in suboptimal performance. This work presents TripleAD, a mutual distillation-based triple-channel graph anomaly detection framework. It includes three estimation modules to identify the attribute, structural, and mixed anomalies while mitigating the interference between different types of anomalies. In the first channel, we design a multiscale attribute estimation module to capture extensive node interactions and ameliorate the over-smoothing issue. To better identify structural anomalies, we introduce a link-enhanced structure estimation module in the second channel that facilitates information flow to topologically isolated nodes. The third channel is powered by an attribute-mixed curvature, a new indicator that encapsulates both attribute and structural information for discriminating mixed anomalies. Moreover, a mutual distillation strategy is introduced to encourage communication and collaboration between the three channels. Extensive experiments demonstrate the effectiveness of the proposed TripleAD model against strong baselines.

Title: MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23482
Pdf URL: https://arxiv.org/pdf/2506.23482
Copy Paste: [[2506.23482]] MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting(https://arxiv.org/abs/2506.23482)
Keywords: diffusion, generative
Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

Title: ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models

Authors: Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23513
Pdf URL: https://arxiv.org/pdf/2506.23513
Copy Paste: [[2506.23513]] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models(https://arxiv.org/abs/2506.23513)
Keywords: diffusion
Abstract: Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.

Title: WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23518
Pdf URL: https://arxiv.org/pdf/2506.23518
Copy Paste: [[2506.23518]] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image(https://arxiv.org/abs/2506.23518)
Keywords: diffusion
Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

Title: When Test-Time Adaptation Meets Self-Supervised Models

Authors: Jisu Han, Jihee Park, Dongyoon Han, Wonjun Hwang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23529
Pdf URL: https://arxiv.org/pdf/2506.23529
Copy Paste: [[2506.23529]] When Test-Time Adaptation Meets Self-Supervised Models(https://arxiv.org/abs/2506.23529)
Keywords: self-supervised
Abstract: Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-supervised learning (SSL) without relying on source pretraining. We introduce a self-supervised TTA protocol after observing that existing TTA approaches struggle when directly applied to self-supervised models with low accuracy on the source domain. Furthermore, we propose a collaborative learning framework that integrates SSL and TTA models, leveraging contrastive learning and knowledge distillation for stepwise representation refinement. We validate our method on diverse self-supervised models, including DINO, MoCo, and iBOT, across TTA benchmarks. Extensive experiments validate the effectiveness of our approach in SSL, showing that it achieves competitive performance even without source pretraining.

Title: Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound

Authors: Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23538
Pdf URL: https://arxiv.org/pdf/2506.23538
Copy Paste: [[2506.23538]] Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound(https://arxiv.org/abs/2506.23538)
Keywords: diffusion, anomaly
Abstract: Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at this https URL.

Title: Pyramidal Patchification Flow for Visual Generation

Authors: Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23543
Pdf URL: https://arxiv.org/pdf/2506.23543
Copy Paste: [[2506.23543]] Pyramidal Patchification Flow for Visual Generation(https://arxiv.org/abs/2506.23543)
Keywords: diffusion
Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at this https URL.

Title: JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.23552
Pdf URL: https://arxiv.org/pdf/2506.23552
Copy Paste: [[2506.23552]] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching(https://arxiv.org/abs/2506.23552)
Keywords: diffusion, generative
Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: this https URL

Title: Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution

Authors: Luigi Sigillo, Renato Giamba, Danilo Comminiello
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23566
Pdf URL: https://arxiv.org/pdf/2506.23566
Copy Paste: [[2506.23566]] Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution(https://arxiv.org/abs/2506.23566)
Keywords: diffusion
Abstract: The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.

Title: StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection

Authors: Yanning Hou, Yanran Ruan, Junfa Li, Shanshan Wang, Jianfeng Qiu, Ke Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23577
Pdf URL: https://arxiv.org/pdf/2506.23577
Copy Paste: [[2506.23577]] StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection(https://arxiv.org/abs/2506.23577)
Keywords: anomaly
Abstract: Enhancing the alignment between text and image features in the CLIP model is a critical challenge in zero-shot industrial anomaly detection tasks. Recent studies predominantly utilize specific category prompts during pretraining, which can cause overfitting to the training categories and limit model generalization. To address this, we propose a method that transforms category names through multicategory name stacking to create stacked prompts, forming the basis of our StackCLIP model. Our approach introduces two key components. The Clustering-Driven Stacked Prompts (CSP) module constructs generic prompts by stacking semantically analogous categories, while utilizing multi-object textual feature fusion to amplify discriminative anomalies among similar objects. The Ensemble Feature Alignment (EFA) module trains knowledge-specific linear layers tailored for each stack cluster and adaptively integrates them based on the attributes of test categories. These modules work together to deliver superior training speed, stability, and convergence, significantly boosting anomaly segmentation performance. Additionally, our stacked prompt framework offers robust generalization across classification tasks. To further improve performance, we introduce the Regulating Prompt Learning (RPL) module, which leverages the generalization power of stacked prompts to refine prompt learning, elevating results in anomaly detection classification tasks. Extensive testing on seven industrial anomaly detection datasets demonstrates that our method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks.

Title: Transition Matching: Scalable and Flexible Generative Modeling

Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23589
Pdf URL: https://arxiv.org/pdf/2506.23589
Copy Paste: [[2506.23589]] Transition Matching: Scalable and Flexible Generative Modeling(https://arxiv.org/abs/2506.23589)
Keywords: diffusion, generative
Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Title: CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23590
Pdf URL: https://arxiv.org/pdf/2506.23590
Copy Paste: [[2506.23590]] CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models(https://arxiv.org/abs/2506.23590)
Keywords: generative
Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

Title: When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series

Authors: Min-Yeong Park, Won-Jeong Lee, Seong Tae Kim, Gyeong-Moon Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23596
Pdf URL: https://arxiv.org/pdf/2506.23596
Copy Paste: [[2506.23596]] When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series(https://arxiv.org/abs/2506.23596)
Keywords: anomaly
Abstract: Recently, forecasting future abnormal events has emerged as an important scenario to tackle real-world necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address the AP task, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies. Our implementation code is available at this https URL.

Title: SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

Authors: Zhengkang Xiang, Zizhao Li, Amir Khodabandeh, Kourosh Khoshelham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23606
Pdf URL: https://arxiv.org/pdf/2506.23606
Copy Paste: [[2506.23606]] SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion(https://arxiv.org/abs/2506.23606)
Keywords: diffusion, generative
Abstract: Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.

Title: PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

Authors: Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23607
Pdf URL: https://arxiv.org/pdf/2506.23607
Copy Paste: [[2506.23607]] PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum(https://arxiv.org/abs/2506.23607)
Keywords: foundation model
Abstract: Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.

Title: TurboVSR: Fantastic Video Upscalers and Where to Find Them

Authors: Zhongdao Wang, Guodongfang Zhao, Jingjing Ren, Bailan Feng, Shifeng Zhang, Wenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23618
Pdf URL: https://arxiv.org/pdf/2506.23618
Copy Paste: [[2506.23618]] TurboVSR: Fantastic Video Upscalers and Where to Find Them(https://arxiv.org/abs/2506.23618)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$\times$2048) image SR show surprising fine details.

Title: Blending Concepts with Text-to-Image Diffusion Models

Authors: Lorenzo Olearo, Giorgio Longari, Alessandro Raganato, Rafael Peñaloza, Simone Melzi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23630
Pdf URL: https://arxiv.org/pdf/2506.23630
Copy Paste: [[2506.23630]] Blending Concepts with Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.23630)
Keywords: diffusion
Abstract: Diffusion models have dramatically advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease. In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework. Specifically, concept blending merges the key attributes of multiple concepts (expressed as textual prompts) into a single, novel image that captures the essence of each concept. We investigate four blending methods, each exploiting different aspects of the diffusion pipeline (e.g., prompt scheduling, embedding interpolation, or layer-wise conditioning). Through systematic experimentation across diverse concept categories, such as merging concrete concepts, synthesizing compound words, transferring artistic styles, and blending architectural landmarks, we show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning. Our extensive user study, involving 100 participants, reveals that no single approach dominates in all scenarios: each blending technique excels under certain conditions, with factors like prompt ordering, conceptual distance, and random seed affecting the outcome. These findings highlight the remarkable compositional potential of diffusion models while exposing their sensitivity to seemingly minor input variations.

Title: Unified Multimodal Understanding via Byte-Pair Visual Encoding

Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23639
Pdf URL: https://arxiv.org/pdf/2506.23639
Copy Paste: [[2506.23639]] Unified Multimodal Understanding via Byte-Pair Visual Encoding(https://arxiv.org/abs/2506.23639)
Keywords: foundation model
Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

Title: VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

Authors: Peng Huang, Junhu Fu, Bowen Guo, Zeju Li, Yuanyuan Wang, Yi Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23641
Pdf URL: https://arxiv.org/pdf/2506.23641
Copy Paste: [[2506.23641]] VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation(https://arxiv.org/abs/2506.23641)
Keywords: diffusion, generative
Abstract: As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.

Title: MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis

Authors: Zhe Liu, Yuhao Huang, Lian Liu, Chengrui Zhang, Haotian Lin, Tong Han, Zhiyuan Zhu, Yanlin Chen, Yuerui Chen, Dong Ni, Zhongshan Gou, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23648
Pdf URL: https://arxiv.org/pdf/2506.23648
Copy Paste: [[2506.23648]] MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis(https://arxiv.org/abs/2506.23648)
Keywords: anomaly
Abstract: Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer's diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: this https URL.

Title: On the Domain Robustness of Contrastive Vision-Language Models

Authors: Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23663
Pdf URL: https://arxiv.org/pdf/2506.23663
Copy Paste: [[2506.23663]] On the Domain Robustness of Contrastive Vision-Language Models(https://arxiv.org/abs/2506.23663)
Keywords: foundation model
Abstract: In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.

Title: A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement

Authors: Gaozheng Pei, Ke Ma, Dongpeng Zhang, Chengzhi Sun, Qianqian Xu, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23676
Pdf URL: https://arxiv.org/pdf/2506.23676
Copy Paste: [[2506.23676]] A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement(https://arxiv.org/abs/2506.23676)
Keywords: diffusion
Abstract: Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.

Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23690
Pdf URL: https://arxiv.org/pdf/2506.23690
Copy Paste: [[2506.23690]] SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation(https://arxiv.org/abs/2506.23690)
Keywords: diffusion, generative
Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: this https URL

Title: Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Authors: Haoyang Chen, Dongfang Sun, Caoyuan Ma, Shiqin Wang, Kewei Zhang, Zheng Wang, Zhixiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23711
Pdf URL: https://arxiv.org/pdf/2506.23711
Copy Paste: [[2506.23711]] Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion(https://arxiv.org/abs/2506.23711)
Keywords: diffusion
Abstract: We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.

Title: System-Embedded Diffusion Bridge Models

Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23726
Pdf URL: https://arxiv.org/pdf/2506.23726
Copy Paste: [[2506.23726]] System-Embedded Diffusion Bridge Models(https://arxiv.org/abs/2506.23726)
Keywords: diffusion, generative
Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.

Title: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Authors: Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23729
Pdf URL: https://arxiv.org/pdf/2506.23729
Copy Paste: [[2506.23729]] Proteus-ID: ID-Consistent and Motion-Coherent Video Customization(https://arxiv.org/abs/2506.23729)
Keywords: diffusion, self-supervised
Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at this https URL.

Title: Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models

Authors: Michel Meintz, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23731
Pdf URL: https://arxiv.org/pdf/2506.23731
Copy Paste: [[2506.23731]] Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models(https://arxiv.org/abs/2506.23731)
Keywords: diffusion, generative
Abstract: Image generative models have become increasingly popular, but training them requires large datasets that are costly to collect and curate. To circumvent these costs, some parties may exploit existing models by using the generated images as training data for their own models. In general, watermarking is a valuable tool for detecting unauthorized use of generated images. However, when these images are used to train a new model, watermarking can only enable detection if the watermark persists through training and remains identifiable in the outputs of the newly trained model - a property known as radioactivity. We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs). We find that existing watermarking methods for DMs fail to retain radioactivity, as watermarks are either erased during encoding into the latent space or lost in the noising-denoising process (during the training in the latent space). Meanwhile, despite IARs having recently surpassed DMs in image generation quality and efficiency, no radioactive watermarking methods have been proposed for them. To overcome this limitation, we propose the first watermarking method tailored for IARs and with radioactivity in mind - drawing inspiration from techniques in large language models (LLMs), which share IARs' autoregressive paradigm. Our extensive experimental evaluation highlights our method's effectiveness in preserving radioactivity within IARs, enabling robust provenance tracking, and preventing unauthorized use of their generated images.

Title: Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Authors: Annika Mütze, Sadia Ilyas, Christian Dörpelkus, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23751
Pdf URL: https://arxiv.org/pdf/2506.23751
Copy Paste: [[2506.23751]] Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?(https://arxiv.org/abs/2506.23751)
Keywords: diffusion
Abstract: Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.

Title: Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors

Authors: Ce Wang, Wanjie Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23801
Pdf URL: https://arxiv.org/pdf/2506.23801
Copy Paste: [[2506.23801]] Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors(https://arxiv.org/abs/2506.23801)
Keywords: diffusion, generative
Abstract: Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.

Title: Adaptive Out-of-Control Point Pattern Detection in Sequential Random Finite Set Observations

Authors: Konstantinos Bourazas, Savvas Papaioannou, Panayiotis Kolios
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23802
Pdf URL: https://arxiv.org/pdf/2506.23802
Copy Paste: [[2506.23802]] Adaptive Out-of-Control Point Pattern Detection in Sequential Random Finite Set Observations(https://arxiv.org/abs/2506.23802)
Keywords: anomaly
Abstract: In this work we introduce a novel adaptive anomaly detection framework specifically designed for monitoring sequential random finite set (RFS) observations. Our approach effectively distinguishes between In-Control data (normal) and Out-Of-Control data (anomalies) by detecting deviations from the expected statistical behavior of the process. The primary contributions of this study include the development of an innovative RFS-based framework that not only learns the normal behavior of the data-generating process online but also dynamically adapts to behavioral shifts to accurately identify abnormal point patterns. To achieve this, we introduce a new class of RFS-based posterior distributions, named Power Discounting Posteriors (PD), which facilitate adaptation to systematic changes in data while enabling anomaly detection of point pattern data through a novel predictive posterior density function. The effectiveness of the proposed approach is demonstrated by extensive qualitative and quantitative simulation experiments.

Title: MadCLIP: Few-shot Medical Anomaly Detection with CLIP

Authors: Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23810
Pdf URL: https://arxiv.org/pdf/2506.23810
Copy Paste: [[2506.23810]] MadCLIP: Few-shot Medical Anomaly Detection with CLIP(https://arxiv.org/abs/2506.23810)
Keywords: anomaly
Abstract: An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at this https URL.

Title: Refine Any Object in Any Scene

Authors: Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23835
Pdf URL: https://arxiv.org/pdf/2506.23835
Copy Paste: [[2506.23835]] Refine Any Object in Any Scene(https://arxiv.org/abs/2506.23835)
Keywords: generative
Abstract: Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object understanding and appearance modeling. In this paper, we introduce Refine Any object In any ScenE (RAISE), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting from substituting degraded objects with proxies, via a 3D generative model with strong 3D understanding, RAISE progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DOF pose, followed by correcting spatial and appearance inconsistencies via registration-constrained enhancement. This two-stage refinement ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Extensive experiments on challenging benchmarks show that RAISE significantly outperforms state-of-the-art methods in both novel view synthesis and geometry completion tasks. RAISE is made publicly available at this https URL.

Title: VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Authors: Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23858
Pdf URL: https://arxiv.org/pdf/2506.23858
Copy Paste: [[2506.23858]] VMoBA: Mixture-of-Block Attention for Video Diffusion Models(https://arxiv.org/abs/2506.23858)
Keywords: diffusion
Abstract: The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

Title: Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages

Authors: Ikechukwu Ogbonna, Lesley Davidson, Soumya Banerjee, Abhishek Dasgupta, Laurence Kenney, Vikranth Harthikote Nagaraja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23958
Pdf URL: https://arxiv.org/pdf/2506.23958
Copy Paste: [[2506.23958]] Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages(https://arxiv.org/abs/2506.23958)
Keywords: generative
Abstract: Millions of people in African countries face barriers to accessing healthcare due to language and literacy gaps. This research tackles this challenge by transforming complex medical documents -- in this case, prosthetic device user manuals -- into accessible formats for underserved populations. This case study in cross-cultural translation is particularly pertinent/relevant for communities that receive donated prosthetic devices but may not receive the accompanying user documentation. Or, if available online, may only be available in formats (e.g., language and readability) that are inaccessible to local populations (e.g., English-language, high resource settings/cultural context). The approach is demonstrated using the widely spoken Pidgin dialect, but our open-source framework has been designed to enable rapid and easy extension to other languages/dialects. This work presents an AI-powered framework designed to process and translate complex medical documents, e.g., user manuals for prosthetic devices, into marginalised languages. The system enables users -- such as healthcare workers or patients -- to upload English-language medical equipment manuals, pose questions in their native language, and receive accurate, localised answers in real time. Technically, the system integrates a Retrieval-Augmented Generation (RAG) pipeline for processing and semantic understanding of the uploaded manuals. It then employs advanced Natural Language Processing (NLP) models for generative question-answering and multilingual translation. Beyond simple translation, it ensures accessibility to device instructions, treatment protocols, and safety information, empowering patients and clinicians to make informed healthcare decisions.

Title: Visual and Memory Dual Adapter for Multi-Modal Object Tracking

Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23972
Pdf URL: https://arxiv.org/pdf/2506.23972
Copy Paste: [[2506.23972]] Visual and Memory Dual Adapter for Multi-Modal Object Tracking(https://arxiv.org/abs/2506.23972)
Keywords: foundation model
Abstract: Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at this https URL.

Title: Ella: Embodied Social Agents with Lifelong Memory

Authors: Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, Chuang Gan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.24019
Pdf URL: https://arxiv.org/pdf/2506.24019
Copy Paste: [[2506.24019]] Ella: Embodied Social Agents with Lifelong Memory(https://arxiv.org/abs/2506.24019)
Keywords: foundation model
Abstract: We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at this https URL.

Title: Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data

Authors: Shubhabrata Mukherjee, Jack Lang, Obeen Kwon, Iryna Zenyuk, Valerie Brogden, Adam Weber, Daniela Ushizima
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2506.24039
Pdf URL: https://arxiv.org/pdf/2506.24039
Copy Paste: [[2506.24039]] Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data(https://arxiv.org/abs/2506.24039)
Keywords: foundation model
Abstract: Zero-shot and prompt-based technologies capitalized on using frequently occurring images to transform visual reasoning tasks, which explains why such technologies struggle with valuable yet scarce scientific image sets. In this work, we propose Zenesis, a comprehensive no-code interactive platform designed to minimize barriers posed by data readiness for scientific images. We develop lightweight multi-modal adaptation techniques that enable zero-shot operation on raw scientific data, along with human-in-the-loop refinement and heuristic-based temporal enhancement options. We demonstrate the performance of our approach through comprehensive comparison and validation on challenging Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) data of catalyst-loaded membranes. Zenesis significantly outperforms baseline methods, achieving an average accuracy of 0.947, an Intersection over Union (IOU) of 0.858, and a Dice score of 0.923 for amorphous catalyst samples and accuracy of 0.987, an IOU of 0.857, and a Dice score of 0.923 for crystalline samples. These results mark a substantial improvement over traditional methods like Otsu thresholding and even advanced models like Segment Anything Model (SAM) when used in isolation. Our results demonstrate that Zenesis is a powerful tool for scientific applications, particularly in fields where high-quality annotated datasets are unavailable, accelerating accurate analysis of experimental imaging.

Title: Faster Diffusion Models via Higher-Order Approximation

Authors: Gen Li, Yuchen Zhou, Yuting Wei, Yuxin Chen
Subjects: cs.LG, math.NA, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2506.24042
Pdf URL: https://arxiv.org/pdf/2506.24042
Copy Paste: [[2506.24042]] Faster Diffusion Models via Higher-Order Approximation(https://arxiv.org/abs/2506.24042)
Keywords: diffusion
Abstract: In this paper, we explore provable acceleration of diffusion models without any additional retraining. Focusing on the task of approximating a target data distribution in $\mathbb{R}^d$ to within $\varepsilon$ total-variation distance, we propose a principled, training-free sampling algorithm that requires only the order of $$ d^{1+2/K} \varepsilon^{-1/K} $$ score function evaluations (up to log factor) in the presence of accurate scores, where $K$ is an arbitrarily large fixed integer. This result applies to a broad class of target data distributions, without the need for assumptions such as smoothness or log-concavity. Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases -- without demanding higher-order smoothness on the score estimates as assumed in previous work. The proposed algorithm draws insight from high-order ODE solvers, leveraging high-order Lagrange interpolation and successive refinement to approximate the integral derived from the probability flow ODE.

Title: Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

Authors: Deng Li, Aming Wu, Yang Li, Yaowei Wang, Yahong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24063
Pdf URL: https://arxiv.org/pdf/2506.24063
Copy Paste: [[2506.24063]] Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios(https://arxiv.org/abs/2506.24063)
Keywords: diffusion
Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.

Title: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Authors: Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24085
Pdf URL: https://arxiv.org/pdf/2506.24085
Copy Paste: [[2506.24085]] Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention(https://arxiv.org/abs/2506.24085)
Keywords: diffusion, generative
Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

Title: MotionGPT3: Human Motion as a Second Modality

Authors: Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.24086
Pdf URL: https://arxiv.org/pdf/2506.24086
Copy Paste: [[2506.24086]] MotionGPT3: Human Motion as a Second Modality(https://arxiv.org/abs/2506.24086)
Keywords: diffusion
Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.

Title: Epona: Autoregressive Diffusion World Model for Autonomous Driving

Authors: Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24113
Pdf URL: https://arxiv.org/pdf/2506.24113
Copy Paste: [[2506.24113]] Epona: Autoregressive Diffusion World Model for Autonomous Driving(https://arxiv.org/abs/2506.24113)
Keywords: diffusion
Abstract: Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{this https URL}{this https URL}.

Title: TextMesh4D: High-Quality Text-to-4D Mesh Generation

Authors: Sisi Dai, Xinxin Su, Boyan Wan, Ruizhen Hu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24121
Pdf URL: https://arxiv.org/pdf/2506.24121
Copy Paste: [[2506.24121]] TextMesh4D: High-Quality Text-to-4D Mesh Generation(https://arxiv.org/abs/2506.24121)
Keywords: diffusion, generative
Abstract: Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.

Title: Calligrapher: Freestyle Text Image Customization

Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24123
Pdf URL: https://arxiv.org/pdf/2506.24123
Copy Paste: [[2506.24123]] Calligrapher: Freestyle Text Image Customization(https://arxiv.org/abs/2506.24123)
Keywords: diffusion, generative, in-context
Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.