2025-05-13

Title: Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation

Authors: Gabriele Rosi, Fabio Cermelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06280
Pdf URL: https://arxiv.org/pdf/2505.06280
Copy Paste: [[2505.06280]] Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation(https://arxiv.org/abs/2505.06280)
Keywords: foundation model
Abstract: Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.

Title: A Data-Driven Probabilistic Framework for Cascading Urban Risk Analysis Using Bayesian Networks

Authors: Chunduru Rohith Kumar, PHD Surya Shanmuk, Prabhala Naga Srinivas, Sri Venkatesh Lankalapalli, Debasis Dwibedy
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06281
Pdf URL: https://arxiv.org/pdf/2505.06281
Copy Paste: [[2505.06281]] A Data-Driven Probabilistic Framework for Cascading Urban Risk Analysis Using Bayesian Networks(https://arxiv.org/abs/2505.06281)
Keywords: generative
Abstract: The increasing complexity of cascading risks in urban systems necessitates robust, data-driven frameworks to model interdependencies across multiple domains. This study presents a foundational Bayesian network-based approach for analyzing cross-domain risk propagation across key urban domains, including air, water, electricity, agriculture, health, infrastructure, weather, and climate. Directed Acyclic Graphs (DAGs) are constructed using Bayesian Belief Networks (BBNs), with structure learning guided by Hill-Climbing search optimized through Bayesian Information Criterion (BIC) and K2 scoring. The framework is trained on a hybrid dataset that combines real-world urban indicators with synthetically generated data from Generative Adversarial Networks (GANs), and is further balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Conditional Probability Tables (CPTs) derived from the learned structures enable interpretable probabilistic reasoning and quantify the likelihood of cascading failures. The results identify key intra- and inter-domain risk factors and demonstrate the framework's utility for proactive urban resilience planning. This work establishes a scalable, interpretable foundation for cascading risk assessment and serves as a basis for future empirical research in this emerging interdisciplinary field.

Title: InfoNCE is a Free Lunch for Semantically guided Graph Contrastive Learning

Authors: Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06282
Pdf URL: https://arxiv.org/pdf/2505.06282
Copy Paste: [[2505.06282]] InfoNCE is a Free Lunch for Semantically guided Graph Contrastive Learning(https://arxiv.org/abs/2505.06282)
Keywords: self-supervised, foundation model
Abstract: As an important graph pre-training method, Graph Contrastive Learning (GCL) continues to play a crucial role in the ongoing surge of research on graph foundation models or LLM as enhancer for graphs. Traditional GCL optimizes InfoNCE by using augmentations to define self-supervised tasks, treating augmented pairs as positive samples and others as negative. However, this leads to semantically similar pairs being classified as negative, causing significant sampling bias and limiting performance. In this paper, we argue that GCL is essentially a Positive-Unlabeled (PU) learning problem, where the definition of self-supervised tasks should be semantically guided, i.e., augmented samples with similar semantics are considered positive, while others, with unknown semantics, are treated as unlabeled. From this perspective, the key lies in how to extract semantic information. To achieve this, we propose IFL-GCL, using InfoNCE as a "free lunch" to extract semantic information. Specifically, We first prove that under InfoNCE, the representation similarity of node pairs aligns with the probability that the corresponding contrastive sample is positive. Then we redefine the maximum likelihood objective based on the corrected samples, leading to a new InfoNCE loss function. Extensive experiments on both the graph pretraining framework and LLM as an enhancer show significantly improvements of IFL-GCL in both IID and OOD scenarios, achieving up to a 9.05% improvement, validating the effectiveness of semantically guided. Code for IFL-GCL is publicly available at: this https URL.

Title: UniCO: Towards a Unified Model for Combinatorial Optimization Problems

Authors: Zefang Zong, Xiaochen Wei, Guozhen Zhang, Chen Gao, Huandong Wang, Yong Li
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2505.06290
Pdf URL: https://arxiv.org/pdf/2505.06290
Copy Paste: [[2505.06290]] UniCO: Towards a Unified Model for Combinatorial Optimization Problems(https://arxiv.org/abs/2505.06290)
Keywords: self-supervised
Abstract: Combinatorial Optimization (CO) encompasses a wide range of problems that arise in many real-world scenarios. While significant progress has been made in developing learning-based methods for specialized CO problems, a unified model with a single architecture and parameter set for diverse CO problems remains elusive. Such a model would offer substantial advantages in terms of efficiency and convenience. In this paper, we introduce UniCO, a unified model for solving various CO problems. Inspired by the success of next-token prediction, we frame each problem-solving process as a Markov Decision Process (MDP), tokenize the corresponding sequential trajectory data, and train the model using a transformer backbone. To reduce token length in the trajectory data, we propose a CO-prefix design that aggregates static problem features. To address the heterogeneity of state and action tokens within the MDP, we employ a two-stage self-supervised learning approach. In this approach, a dynamic prediction model is first trained and then serves as a pre-trained model for subsequent policy generation. Experiments across 10 CO problems showcase the versatility of UniCO, emphasizing its ability to generalize to new, unseen problems with minimal fine-tuning, achieving even few-shot or zero-shot performance. Our framework offers a valuable complement to existing neural CO methods that focus on optimizing performance for individual problems.

Title: Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction

Authors: Yu Mao, Holger Pirk, Chun Jason Xue
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.06297
Pdf URL: https://arxiv.org/pdf/2505.06297
Copy Paste: [[2505.06297]] Lossless Compression of Large Language Model-Generated Text via Next-Token Prediction(https://arxiv.org/abs/2505.06297)
Keywords: generative
Abstract: As large language models (LLMs) continue to be deployed and utilized across domains, the volume of LLM-generated data is growing rapidly. This trend highlights the increasing importance of effective and lossless compression for such data in modern text management systems. However, compressing LLM-generated data presents unique challenges compared to traditional human- or machine-generated content. Traditional machine-generated data is typically derived from computational processes or device outputs, often highly structured and limited to low-level elements like labels or numerical values. This structure enables conventional lossless compressors to perform efficiently. In contrast, LLM-generated data is more complex and diverse, requiring new approaches for effective compression. In this work, we conduct the first systematic investigation of lossless compression techniques tailored specifically to LLM-generated data. Notably, because LLMs are trained via next-token prediction, we find that LLM-generated data is highly predictable for the models themselves. This predictability enables LLMs to serve as efficient compressors of their own outputs. Through extensive experiments with 14 representative LLMs and 8 LLM-generated datasets from diverse domains, we show that LLM-based prediction methods achieve remarkable compression rates, exceeding 20x, far surpassing the 3x rate achieved by Gzip, a widely used general-purpose compressor. Furthermore, this advantage holds across different LLM sizes and dataset types, demonstrating the robustness and practicality of LLM-based methods in lossless text compression under generative AI workloads.

Title: Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring

Authors: Junyu Xue, Xudong Wang, Xiaoling He, Shicheng Liu, Yi Wang, Guoming Tang
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2505.06330
Pdf URL: https://arxiv.org/pdf/2505.06330
Copy Paste: [[2505.06330]] Prompting Large Language Models for Training-Free Non-Intrusive Load Monitoring(https://arxiv.org/abs/2505.06330)
Keywords: in-context
Abstract: Non-intrusive Load Monitoring (NILM) aims to disaggregate aggregate household electricity consumption into individual appliance usage, enabling more effective energy management. While deep learning has advanced NILM, it remains limited by its dependence on labeled data, restricted generalization, and lack of interpretability. In this paper, we introduce the first prompt-based NILM framework that leverages Large Language Models (LLMs) with in-context learning. We design and evaluate prompt strategies that integrate appliance features, timestamps and contextual information, as well as representative time-series examples, using the REDD dataset. With optimized prompts, LLMs achieve competitive state detection accuracy, reaching an average F1-score of 0.676 on unseen households, and demonstrate robust generalization without the need for fine-tuning. LLMs also enhance interpretability by providing clear, human-readable explanations for their predictions. Our results show that LLMs can reduce data requirements, improve adaptability, and provide transparent energy disaggregation in NILM applications.

Title: NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines

Authors: Chathurangi Shyalika, Renjith Prasad, Fadi El Kalach, Revathy Venkataramanan, Ramtin Zand, Ramy Harik, Amit Sheth
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06333
Pdf URL: https://arxiv.org/pdf/2505.06333
Copy Paste: [[2505.06333]] NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines(https://arxiv.org/abs/2505.06333)
Keywords: anomaly
Abstract: In modern assembly pipelines, identifying anomalies is crucial in ensuring product quality and operational efficiency. Conventional single-modality methods fail to capture the intricate relationships required for precise anomaly prediction in complex predictive environments with abundant data and multiple modalities. This paper proposes a neurosymbolic AI and fusion-based approach for multimodal anomaly prediction in assembly pipelines. We introduce a time series and image-based fusion model that leverages decision-level fusion techniques. Our research builds upon three primary novel approaches in multimodal learning: time series and image-based decision-level fusion modeling, transfer learning for fusion, and knowledge-infused learning. We evaluate the novel method using our derived and publicly available multimodal dataset and conduct comprehensive ablation studies to assess the impact of our preprocessing techniques and fusion model compared to traditional baselines. The results demonstrate that a neurosymbolic AI-based fusion approach that uses transfer learning can effectively harness the complementary strengths of time series and image data, offering a robust and interpretable approach for anomaly prediction in assembly pipelines with enhanced performance. \noindent The datasets, codes to reproduce the results, supplementary materials, and demo are available at this https URL.

Title: The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Authors: Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan Wu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, Mosharaf Chowdhury
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06371
Pdf URL: https://arxiv.org/pdf/2505.06371
Copy Paste: [[2505.06371]] The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization(https://arxiv.org/abs/2505.06371)
Keywords: generative
Abstract: As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the this http URL Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding this http URL Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the this http URL Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The this http URL Benchmark is open-source and can be easily extended to various customized models and application scenarios.

Title: Engineering Risk-Aware, Security-by-Design Frameworks for Assurance of Large-Scale Autonomous AI Models

Authors: Krti Tallam
Subjects: cs.CR, cs.AI, cs.ET, cs.LG, cs.MA, eess.SY
Abstract URL: https://arxiv.org/abs/2505.06409
Pdf URL: https://arxiv.org/pdf/2505.06409
Copy Paste: [[2505.06409]] Engineering Risk-Aware, Security-by-Design Frameworks for Assurance of Large-Scale Autonomous AI Models(https://arxiv.org/abs/2505.06409)
Keywords: anomaly
Abstract: As AI models scale to billions of parameters and operate with increasing autonomy, ensuring their safe, reliable operation demands engineering-grade security and assurance frameworks. This paper presents an enterprise-level, risk-aware, security-by-design approach for large-scale autonomous AI systems, integrating standardized threat metrics, adversarial hardening techniques, and real-time anomaly detection into every phase of the development lifecycle. We detail a unified pipeline - from design-time risk assessments and secure training protocols to continuous monitoring and automated audit logging - that delivers provable guarantees of model behavior under adversarial and operational stress. Case studies in national security, open-source model governance, and industrial automation demonstrate measurable reductions in vulnerability and compliance overhead. Finally, we advocate cross-sector collaboration - uniting engineering teams, standards bodies, and regulatory agencies - to institutionalize these technical safeguards within a resilient, end-to-end assurance ecosystem for the next generation of AI.

Title: My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing

Authors: Jingrui He, Andrew Stephen McGough
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06436
Pdf URL: https://arxiv.org/pdf/2505.06436
Copy Paste: [[2505.06436]] My Emotion on your face: The use of Facial Keypoint Detection to preserve Emotions in Latent Space Editing(https://arxiv.org/abs/2505.06436)
Keywords: generative
Abstract: Generative Adversarial Network approaches such as StyleGAN/2 provide two key benefits: the ability to generate photo-realistic face images and possessing a semantically structured latent space from which these images are created. Many approaches have emerged for editing images derived from vectors in the latent space of a pre-trained StyleGAN/2 models by identifying semantically meaningful directions (e.g., gender or age) in the latent space. By moving the vector in a specific direction, the ideal result would only change the target feature while preserving all the other features. Providing an ideal data augmentation approach for gesture research as it could be used to generate numerous image variations whilst keeping the facial expressions intact. However, entanglement issues, where changing one feature inevitably affects other features, impacts the ability to preserve facial expressions. To address this, we propose the use of an addition to the loss function of a Facial Keypoint Detection model to restrict changes to the facial expressions. Building on top of an existing model, adding the proposed Human Face Landmark Detection (HFLD) loss, provided by a pre-trained Facial Keypoint Detection model, to the original loss function. We quantitatively and qualitatively evaluate the existing and our extended model, showing the effectiveness of our approach in addressing the entanglement issue and maintaining the facial expression. Our approach achieves up to 49% reduction in the change of emotion in our experiments. Moreover, we show the benefit of our approach by comparing with state-of-the-art models. By increasing the ability to preserve the facial gesture and expression during facial transformation, we present a way to create human face images with fixed expression but different appearances, making it a reliable data augmentation approach for Facial Gesture and Expression research.

Title: Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency

Authors: Binwen Liu, Peiyu Xu, Quan Yuan, Yihong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.06475
Pdf URL: https://arxiv.org/pdf/2505.06475
Copy Paste: [[2505.06475]] Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency(https://arxiv.org/abs/2505.06475)
Keywords: in-context
Abstract: We investigate in-context learning (ICL) through a meticulous experimental framework that systematically varies task complexity and model architecture. Extending beyond the linear regression baseline, we introduce Gaussian kernel regression and nonlinear dynamical system tasks, which emphasize temporal and recursive reasoning. We evaluate four distinct models: a GPT2-style Transformer, a Transformer with FlashAttention mechanism, a convolutional Hyena-based model, and the Mamba state-space model. Each model is trained from scratch on synthetic datasets and assessed for generalization during testing. Our findings highlight that model architecture significantly shapes ICL performance. The standard Transformer demonstrates robust performance across diverse tasks, while Mamba excels in temporally structured dynamics. Hyena effectively captures long-range dependencies but shows higher variance early in training, and FlashAttention offers computational efficiency but is more sensitive in low-data regimes. Further analysis uncovers locality-induced shortcuts in Gaussian kernel tasks, enhanced nonlinear separability through input range scaling, and the critical role of curriculum learning in mastering high-dimensional tasks.

Title: Learning from the Good Ones: Risk Profiling-Based Defenses Against Evasion Attacks on DNNs

Authors: Mohammed Elnawawy, Gargi Mitra, Shahrear Iqbal, Karthik Pattabiraman
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.06477
Pdf URL: https://arxiv.org/pdf/2505.06477
Copy Paste: [[2505.06477]] Learning from the Good Ones: Risk Profiling-Based Defenses Against Evasion Attacks on DNNs(https://arxiv.org/abs/2505.06477)
Keywords: anomaly
Abstract: Safety-critical applications such as healthcare and autonomous vehicles use deep neural networks (DNN) to make predictions and infer decisions. DNNs are susceptible to evasion attacks, where an adversary crafts a malicious data instance to trick the DNN into making wrong decisions at inference time. Existing defenses that protect DNNs against evasion attacks are either static or dynamic. Static defenses are computationally efficient but do not adapt to the evolving threat landscape, while dynamic defenses are adaptable but suffer from an increased computational overhead. To combine the best of both worlds, in this paper, we propose a novel risk profiling framework that uses a risk-aware strategy to selectively train static defenses using victim instances that exhibit the most resilient features and are hence more resilient against an evasion attack. We hypothesize that training existing defenses on instances that are less vulnerable to the attack enhances the adversarial detection rate by reducing false negatives. We evaluate the efficacy of our risk-aware selective training strategy on a blood glucose management system that demonstrates how training static anomaly detectors indiscriminately may result in an increased false negative rate, which could be life-threatening in safety-critical applications. Our experiments show that selective training on the less vulnerable patients achieves a recall increase of up to 27.5\% with minimal impact on precision compared to indiscriminate training.

Title: System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection

Authors: Jiawei Guo, Haipeng Cai
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06493
Pdf URL: https://arxiv.org/pdf/2505.06493
Copy Paste: [[2505.06493]] System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection(https://arxiv.org/abs/2505.06493)
Keywords: generative
Abstract: Large language models (LLMs) have gained widespread adoption across diverse applications due to their impressive generative capabilities. Their plug-and-play nature enables both developers and end users to interact with these models through simple prompts. However, as LLMs become more integrated into various systems in diverse domains, concerns around their security are growing. Existing studies mainly focus on threats arising from user prompts (e.g. prompt injection attack) and model output (e.g. model inversion attack), while the security of system prompts remains largely overlooked. This work bridges the critical gap. We introduce system prompt poisoning, a new attack vector against LLMs that, unlike traditional user prompt injection, poisons system prompts hence persistently impacts all subsequent user interactions and model responses. We systematically investigate four practical attack strategies in various poisoning scenarios. Through demonstration on both generative and reasoning LLMs, we show that system prompt poisoning is highly feasible without requiring jailbreak techniques, and effective across a wide range of tasks, including those in mathematics, coding, logical reasoning, and natural language processing. Importantly, our findings reveal that the attack remains effective even when user prompts employ advanced prompting techniques like chain-of-thought (CoT). We also show that such techniques, including CoT and retrieval-augmentation-generation (RAG), which are proven to be effective for improving LLM performance in a wide range of tasks, are significantly weakened in their effectiveness by system prompt poisoning.

Title: HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

Authors: Hang Wang, Zhi-Qi Cheng, Chenhao Lin, Chao Shen, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06512
Pdf URL: https://arxiv.org/pdf/2505.06512
Copy Paste: [[2505.06512]] HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation(https://arxiv.org/abs/2505.06512)
Keywords: diffusion
Abstract: Text-to-image synthesis has progressed to the point where models can generate visually compelling images from natural language prompts. Yet, existing methods often fail to reconcile high-level semantic fidelity with explicit spatial control, particularly in scenes involving multiple objects, nuanced relations, or complex layouts. To bridge this gap, we propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation. HCMA integrates two alignment modules into each diffusion sampling step: a global module that continuously aligns latent representations with textual descriptions to ensure scene-level coherence, and a local module that employs bounding-box layouts to anchor objects at specified locations, enabling fine-grained spatial control. Extensive experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines, achieving a 0.69 improvement in Frechet Inception Distance (FID) and a 0.0295 gain in CLIP Score. These results demonstrate HCMA's effectiveness in faithfully capturing intricate textual semantics while adhering to user-defined spatial constraints, offering a robust solution for semantically grounded image this http URL code is available at this https URL

Title: Improving Generalization of Medical Image Registration Foundation Model

Authors: Jing Hu, Kaiwei Yu, Hongjiang Xian, Shu Hu, Xin Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06527
Pdf URL: https://arxiv.org/pdf/2505.06527
Copy Paste: [[2505.06527]] Improving Generalization of Medical Image Registration Foundation Model(https://arxiv.org/abs/2505.06527)
Keywords: foundation model
Abstract: Deformable registration is a fundamental task in medical image processing, aiming to achieve precise alignment by establishing nonlinear correspondences between images. Traditional methods offer good adaptability and interpretability but are limited by computational efficiency. Although deep learning approaches have significantly improved registration speed and accuracy, they often lack flexibility and generalizability across different datasets and tasks. In recent years, foundation models have emerged as a promising direction, leveraging large and diverse datasets to learn universal features and transformation patterns for image registration, thus demonstrating strong cross-task transferability. However, these models still face challenges in generalization and robustness when encountering novel anatomical structures, varying imaging conditions, or unseen modalities. To address these limitations, this paper incorporates Sharpness-Aware Minimization (SAM) into foundation models to enhance their generalization and robustness in medical image registration. By optimizing the flatness of the loss landscape, SAM improves model stability across diverse data distributions and strengthens its ability to handle complex clinical scenarios. Experimental results show that foundation models integrated with SAM achieve significant improvements in cross-dataset registration performance, offering new insights for the advancement of medical image registration technology. Our code is available at this https URL}{this https URL\_sam.

Title: ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

Authors: Xianghao Kong, Qiaosong Qi, Yuanbin Wang, Anyi Rao, Biaolong Chen, Aixi Zhang, Si Liu, Hao Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06537
Pdf URL: https://arxiv.org/pdf/2505.06537
Copy Paste: [[2505.06537]] ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images(https://arxiv.org/abs/2505.06537)
Keywords: diffusion
Abstract: Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives. Moreover, the widely adopted motion module does not sufficiently model human body movement, leading to sub-optimal spatiotemporal consistency. To address these issues, we propose ProFashion, a fashion video generation framework leveraging multiple reference images to achieve improved view consistency and temporal coherency. To effectively leverage features from multiple reference images while maintaining a reasonable computational cost, we devise a Pose-aware Prototype Aggregator, which selects and aggregates global and fine-grained reference features according to pose information to form frame-wise prototypes, which serve as guidance in the denoising process. To further enhance motion consistency, we introduce a Flow-enhanced Prototype Instantiator, which exploits the human keypoint motion flow to guide an extra spatiotemporal attention process in the denoiser. To demonstrate the effectiveness of ProFashion, we extensively evaluate our method on the MRFashion-7K dataset we collected from the Internet. ProFashion also outperforms previous methods on the UBC Fashion dataset.

Title: HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models

Authors: Shuhan Zhuang, Mengqi Huang, Fengyi Fu, Nan Chen, Bohan Lei, Zhendong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06543
Pdf URL: https://arxiv.org/pdf/2505.06543
Copy Paste: [[2505.06543]] HDGlyph: A Hierarchical Disentangled Glyph-Based Framework for Long-Tail Text Rendering in Diffusion Models(https://arxiv.org/abs/2505.06543)
Keywords: diffusion
Abstract: Visual text rendering, which aims to accurately integrate specified textual content within generated images, is critical for various applications such as commercial design. Despite recent advances, current methods struggle with long-tail text cases, particularly when handling unseen or small-sized text. In this work, we propose a novel Hierarchical Disentangled Glyph-Based framework (HDGlyph) that hierarchically decouples text generation from non-text visual synthesis, enabling joint optimization of both common and long-tail text rendering. At the training stage, HDGlyph disentangles pixel-level representations via the Multi-Linguistic GlyphNet and the Glyph-Aware Perceptual Loss, ensuring robust rendering even for unseen characters. At inference time, HDGlyph applies Noise-Disentangled Classifier-Free Guidance and Latent-Disentangled Two-Stage Rendering (LD-TSR) scheme, which refines both background and small-sized text. Extensive evaluations show our model consistently outperforms others, with 5.08% and 11.7% accuracy gains in English and Chinese text rendering while maintaining high image quality. It also excels in long-tail scenarios with strong accuracy and visual performance.

Title: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection

Authors: Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06603
Pdf URL: https://arxiv.org/pdf/2505.06603
Copy Paste: [[2505.06603]] ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection(https://arxiv.org/abs/2505.06603)
Keywords: diffusion, generative, anomaly
Abstract: Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at this https URL.

Title: AI-Powered Anomaly Detection with Blockchain for Real-Time Security and Reliability in Autonomous Vehicles

Authors: Rathin Chandra Shit, Sharmila Subudhi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06632
Pdf URL: https://arxiv.org/pdf/2505.06632
Copy Paste: [[2505.06632]] AI-Powered Anomaly Detection with Blockchain for Real-Time Security and Reliability in Autonomous Vehicles(https://arxiv.org/abs/2505.06632)
Keywords: anomaly
Abstract: Autonomous Vehicles (AV) proliferation brings important and pressing security and reliability issues that must be dealt with to guarantee public safety and help their widespread adoption. The contribution of the proposed research is towards achieving more secure, reliable, and trustworthy autonomous transportation system by providing more capabilities for anomaly detection, data provenance, and real-time response in safety critical AV deployments. In this research, we develop a new framework that combines the power of Artificial Intelligence (AI) for real-time anomaly detection with blockchain technology to detect and prevent any malicious activity including sensor failures in AVs. Through Long Short-Term Memory (LSTM) networks, our approach continually monitors associated multi-sensor data streams to detect anomalous patterns that may represent cyberattacks as well as hardware malfunctions. Further, this framework employs a decentralized platform for securely storing sensor data and anomaly alerts in a blockchain ledger for data incorruptibility and authenticity, while offering transparent forensic features. Moreover, immediate automated response mechanisms are deployed using smart contracts when anomalies are found. This makes the AV system more resilient to attacks from both cyberspace and hardware component failure. Besides, we identify potential challenges of scalability in handling high frequency sensor data, computational constraint in resource constrained environment, and of distributed data storage in terms of privacy.

Title: Dataset Distillation with Probabilistic Latent Features

Authors: Zhe Li, Sarah Cechnicka, Cheng Ouyang, Katharina Breininger, Peter Schüffler, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06647
Pdf URL: https://arxiv.org/pdf/2505.06647
Copy Paste: [[2505.06647]] Dataset Distillation with Probabilistic Latent Features(https://arxiv.org/abs/2505.06647)
Keywords: generative
Abstract: As deep learning models grow in complexity and the volume of training data increases, reducing storage and computational costs becomes increasingly important. Dataset distillation addresses this challenge by synthesizing a compact set of synthetic data that can effectively replace the original dataset in downstream classification tasks. While existing methods typically rely on mapping data from pixel space to the latent space of a generative model, we propose a novel stochastic approach that models the joint distribution of latent features. This allows our method to better capture spatial structures and produce diverse synthetic samples, which benefits model training. Specifically, we introduce a low-rank multivariate normal distribution parameterized by a lightweight network. This design maintains low computational complexity and is compatible with various matching networks used in dataset distillation. After distillation, synthetic images are generated by feeding the learned latent features into a pretrained generator. These synthetic images are then used to train classification models, and performance is evaluated on real test set. We validate our method on several benchmarks, including ImageNet subsets, CIFAR-10, and the MedMNIST histopathological dataset. Our approach achieves state-of-the-art cross architecture performance across a range of backbone architectures, demonstrating its generality and effectiveness.

Title: TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

Authors: Junyi Peng, Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Černocký
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.06660
Pdf URL: https://arxiv.org/pdf/2505.06660
Copy Paste: [[2505.06660]] TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models(https://arxiv.org/abs/2505.06660)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.

Title: StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation

Authors: Ziyi Wang, Haipeng Li, Lin Sui, Tianhao Zhou, Hai Jiang, Lang Nie, Shuaicheng Liu
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2505.06668
Pdf URL: https://arxiv.org/pdf/2505.06668
Copy Paste: [[2505.06668]] StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation(https://arxiv.org/abs/2505.06668)
Keywords: diffusion
Abstract: We present StableMotion, a novel framework leverages knowledge (geometry and content priors) from pretrained large-scale image diffusion models to perform motion estimation, solving single-image-based image rectification tasks such as Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC). Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD) models as backbone and repurposes it into an image-to-motion estimator. To mitigate inconsistent output produced by diffusion models, we propose Adaptive Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive, high-fidelity result. Additionally, we present the concept of Sampling Steps Disaster (SSD), the counterintuitive scenario where increasing the number of sampling steps can lead to poorer outcomes, which enables our framework to achieve one-step inference. StableMotion is verified on two image rectification tasks and delivers state-of-the-art performance in both, as well as showing strong generalizability. Supported by SSD, StableMotion offers a speedup of 200 times compared to previous diffusion model-based methods.

Title: Video Dataset Condensation with Diffusion Models

Authors: Zhe Li, Hadrien Reynaud, Mischa Dombrowski, Sarah Cechnicka, Franciskus Xaverius Erick, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06670
Pdf URL: https://arxiv.org/pdf/2505.06670
Copy Paste: [[2505.06670]] Video Dataset Condensation with Diffusion Models(https://arxiv.org/abs/2505.06670)
Keywords: diffusion
Abstract: In recent years, the rapid expansion of dataset sizes and the increasing complexity of deep learning models have significantly escalated the demand for computational resources, both for data storage and model training. Dataset distillation has emerged as a promising solution to address this challenge by generating a compact synthetic dataset that retains the essential information from a large real dataset. However, existing methods often suffer from limited performance and poor data quality, particularly in the video domain. In this paper, we focus on video dataset distillation by employing a video diffusion model to generate high-quality synthetic videos. To enhance representativeness, we introduce Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos that effectively captures the characteristics of the original dataset. To further optimize computational efficiency, we explore a training-free clustering algorithm, Temporal-Aware Cluster-based Distillation (TAC-DT), to select representative videos without requiring additional training overhead. We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to $10.61\%$ over the state-of-the-art. Our method consistently outperforms existing approaches across all datasets, establishing a new benchmark for video dataset distillation.

Title: Jailbreaking the Text-to-Video Generative Models

Authors: Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rongcheng Tu, Wenbo Zhou, Xiaochun Cao, Dacheng Tao, Siew Kei Lam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06679
Pdf URL: https://arxiv.org/pdf/2505.06679
Copy Paste: [[2505.06679]] Jailbreaking the Text-to-Video Generative Models(https://arxiv.org/abs/2505.06679)
Keywords: diffusion, generative
Abstract: Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the \textit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.

Title: Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

Authors: Xiyuan Wei, Ming Lin, Fanjiang Ye, Fengguang Song, Liangliang Cao, My T. That, Tianbao Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06699
Pdf URL: https://arxiv.org/pdf/2505.06699
Copy Paste: [[2505.06699]] Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws(https://arxiv.org/abs/2505.06699)
Keywords: foundation model
Abstract: This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named $\textbf{model steering}$. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called $\textbf{DRRho risk minimization}$, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches.

Title: SimMIL: A Universal Weakly Supervised Pre-Training Framework for Multi-Instance Learning in Whole Slide Pathology Images

Authors: Yicheng Song, Tiancheng Lin, Die Peng, Su Yang, Yi Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06710
Pdf URL: https://arxiv.org/pdf/2505.06710
Copy Paste: [[2505.06710]] SimMIL: A Universal Weakly Supervised Pre-Training Framework for Multi-Instance Learning in Whole Slide Pathology Images(https://arxiv.org/abs/2505.06710)
Keywords: self-supervised
Abstract: Various multi-instance learning (MIL) based approaches have been developed and successfully applied to whole-slide pathological images (WSI). Existing MIL methods emphasize the importance of feature aggregators, but largely neglect the instance-level representation learning. They assume that the availability of a pre-trained feature extractor can be directly utilized or fine-tuned, which is not always the case. This paper proposes to pre-train feature extractor for MIL via a weakly-supervised scheme, i.e., propagating the weak bag-level labels to the corresponding instances for supervised learning. To learn effective features for MIL, we further delve into several key components, including strong data augmentation, a non-linear prediction head and the robust loss function. We conduct experiments on common large-scale WSI datasets and find it achieves better performance than other pre-training schemes (e.g., ImageNet pre-training and self-supervised learning) in different downstream tasks. We further show the compatibility and scalability of the proposed scheme by deploying it in fine-tuning the pathological-specific models and pre-training on merged multiple datasets. To our knowledge, this is the first work focusing on the representation learning for MIL.

Title: Learning Graph Representation of Agent Diffuser

Authors: Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan Dubiński, Ahmed Nabil Belbachir, Anis Yazidi
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.06761
Pdf URL: https://arxiv.org/pdf/2505.06761
Copy Paste: [[2505.06761]] Learning Graph Representation of Agent Diffuser(https://arxiv.org/abs/2505.06761)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models have significantly advanced text-to-image synthesis, demonstrating impressive text comprehension and zero-shot generalization. These models refine images from random noise based on textual prompts, with initial reliance on text input shifting towards enhanced visual fidelity over time. This transition suggests that static model parameters might not optimally address the distinct phases of generation. We introduce LGR-AD (Learning Graph Representation of Agent Diffusers), a novel multi-agent system designed to improve adaptability in dynamic computer vision tasks. LGR-AD models the generation process as a distributed system of interacting agents, each representing an expert sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. Our approach employs a coordination mechanism based on top-$k$ maximum spanning trees, optimizing the generation process. Each agent's decision-making is guided by a meta-model that minimizes a novel loss function, balancing accuracy and diversity. Theoretical analysis and extensive empirical evaluations show that LGR-AD outperforms traditional diffusion models across various benchmarks, highlighting its potential for scalable and flexible solutions in complex image generation tasks. Code is available at: this https URL

Title: Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology

Authors: Xiaohan Wang, Matthew Berger
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06804
Pdf URL: https://arxiv.org/pdf/2505.06804
Copy Paste: [[2505.06804]] Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology(https://arxiv.org/abs/2505.06804)
Keywords: diffusion, generative
Abstract: For domains that involve numerical simulation, it can be computationally expensive to run an ensemble of simulations spanning a parameter space of interest to a user. To this end, an attractive surrogate for simulation is the generative modeling of fields produced by an ensemble, allowing one to synthesize fields in a computationally cheap, yet accurate, manner. However, for the purposes of visual analysis, a limitation of generative models is their lack of control, as it is unclear what one should expect when sampling a field from a model. In this paper we study how to make generative models of fields more controllable, so that users can specify features of interest, in particular topological features, that they wish to see in the output. We propose topology guidance, a method for guiding the sampling process of a generative model, specifically a diffusion model, such that a topological description specified as input is satisfied in the generated output. Central to our method, we couple a coordinate-based neural network used to represent fields, with a diffusion model used for generation. We show how to use topologically-relevant signals provided by the coordinate-based network to help guide the denoising process of a diffusion model. This enables us to faithfully represent a user's specified topology, while ensuring that the output field remains within the generative data distribution. Specifically, we study 2D vector field topology, evaluating our method over an ensemble of fluid flows, where we show that generated vector fields faithfully adhere to the location, and type, of critical points over the spatial domain. We further show the benefits of our method in aiding the comparison of ensembles, allowing one to explore commonalities and differences in distributions along prescribed topological features.

Title: Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Authors: Zhengmi Tang, Yuto Mitsui, Tomo Miyazaki, Shinichiro Omachi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06855
Pdf URL: https://arxiv.org/pdf/2505.06855
Copy Paste: [[2505.06855]] Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies(https://arxiv.org/abs/2505.06855)
Keywords: self-supervised
Abstract: Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.

Title: Image Classification Using a Diffusion Model as a Pre-Training Model

Authors: Kosuke Ukita, Ye Xiaolong, Tsuyoshi Okita
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.06890
Pdf URL: https://arxiv.org/pdf/2505.06890
Copy Paste: [[2505.06890]] Image Classification Using a Diffusion Model as a Pre-Training Model(https://arxiv.org/abs/2505.06890)
Keywords: diffusion, self-supervised
Abstract: In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.

Title: Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion

Authors: Timing Li, Bing Cao, Pengfei Zhu, Bin Xiao, Qinghua Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06920
Pdf URL: https://arxiv.org/pdf/2505.06920
Copy Paste: [[2505.06920]] Bi-directional Self-Registration for Misaligned Infrared-Visible Image Fusion(https://arxiv.org/abs/2505.06920)
Keywords: self-supervised
Abstract: Acquiring accurately aligned multi-modal image pairs is fundamental for achieving high-quality multi-modal image fusion. To address the lack of ground truth in current multi-modal image registration and fusion methods, we propose a novel self-supervised \textbf{B}i-directional \textbf{S}elf-\textbf{R}egistration framework (\textbf{B-SR}). Specifically, B-SR utilizes a proxy data generator (PDG) and an inverse proxy data generator (IPDG) to achieve self-supervised global-local registration. Visible-infrared image pairs with spatially misaligned differences are aligned to obtain global differences through the registration module. The same image pairs are processed by PDG, such as cropping, flipping, stitching, etc., and then aligned to obtain local differences. IPDG converts the obtained local differences into pseudo-global differences, which are used to perform global-local difference consistency with the global differences. Furthermore, aiming at eliminating the effect of modal gaps on the registration module, we design a neighborhood dynamic alignment loss to achieve cross-modal image edge alignment. Extensive experiments on misaligned multi-modal images demonstrate the effectiveness of the proposed method in multi-modal image alignment and fusion against the competing methods. Our code will be publicly available.

Title: A systematic review of challenges and proposed solutions in modeling multimodal data

Authors: Maryam Farhadizadeh (1 and 2), Maria Weymann (2 and 3), Michael Blaß (4), Johann Kraus (5), Christopher Gundler (4), Sebastian Walter (6), Noah Hempen (1), Harald Binde (2 and 3), Nadine Binder (1 and 2) ((1) Institute of General Practice/Family Medicine, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (2) Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Germany, (3) Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany, (4) Institute for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf, Germany, (5) Institute of Medical Systems Biology, Ulm University, Germany, (6) Department of Computer Science, Faculty of Engineering - University of Freiburg, Germany)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.06945
Pdf URL: https://arxiv.org/pdf/2505.06945
Copy Paste: [[2505.06945]] A systematic review of challenges and proposed solutions in modeling multimodal data(https://arxiv.org/abs/2505.06945)
Keywords: generative
Abstract: Multimodal data modeling has emerged as a powerful approach in clinical research, enabling the integration of diverse data types such as imaging, genomics, wearable sensors, and electronic health records. Despite its potential to improve diagnostic accuracy and support personalized care, modeling such heterogeneous data presents significant technical challenges. This systematic review synthesizes findings from 69 studies to identify common obstacles, including missing modalities, limited sample sizes, dimensionality imbalance, interpretability issues, and finding the optimal fusion techniques. We highlight recent methodological advances, such as transfer learning, generative models, attention mechanisms, and neural architecture search that offer promising solutions. By mapping current trends and innovations, this review provides a comprehensive overview of the field and offers practical insights to guide future research and development in multimodal modeling for medical applications.

Title: Unsupervised Learning for Class Distribution Mismatch

Authors: Pan Du, Wangbo Zhao, Xinai Lu, Nian Liu, Zhikai Li, Chaoyu Gong, Suyun Zhao, Hong Chen, Cuiping Li, Kai Wang, Yang You
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.06948
Pdf URL: https://arxiv.org/pdf/2505.06948
Copy Paste: [[2505.06948]] Unsupervised Learning for Class Distribution Mismatch(https://arxiv.org/abs/2505.06948)
Keywords: diffusion
Abstract: Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM's superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.

Title: Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation

Authors: Seokjun Kwon, Jeongmin Shin, Namil Kim, Soonmin Hwang, Yukyung Choi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.06951
Pdf URL: https://arxiv.org/pdf/2505.06951
Copy Paste: [[2505.06951]] Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation(https://arxiv.org/abs/2505.06951)
Keywords: self-supervised
Abstract: In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross-spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self-supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre-trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state-of-the-art supervised methods.

Title: Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation

Authors: Md. Naimur Asif Borno, Md Sakib Hossain Shovon, Asmaa Soliman Al-Moisheer, Mohammad Ali Moni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.06995
Pdf URL: https://arxiv.org/pdf/2505.06995
Copy Paste: [[2505.06995]] Replay-Based Continual Learning with Dual-Layered Distillation and a Streamlined U-Net for Efficient Text-to-Image Generation(https://arxiv.org/abs/2505.06995)
Keywords: diffusion
Abstract: Recent advancements in text-to-image diffusion models are hindered by high computational demands, limiting accessibility and scalability. This paper introduces KDC-Diff, a novel stable diffusion framework that enhances efficiency while maintaining image quality. KDC-Diff features a streamlined U-Net architecture with nearly half the parameters of the original U-Net (482M), significantly reducing model complexity. We propose a dual-layered distillation strategy to ensure high-fidelity generation, transferring semantic and structural insights from a teacher to a compact student model while minimizing quality degradation. Additionally, replay-based continual learning is integrated to mitigate catastrophic forgetting, allowing the model to retain prior knowledge while adapting to new data. Despite operating under extremely low computational resources, KDC-Diff achieves state-of-the-art performance on the Oxford Flowers and Butterflies & Moths 100 Species datasets, demonstrating competitive metrics such as FID, CLIP, and LPIPS. Moreover, it significantly reduces inference time compared to existing models. These results establish KDC-Diff as a highly efficient and adaptable solution for text-to-image generation, particularly in computationally constrained environments.

Title: CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation

Authors: Peng Li, Suizhi Ma, Jialiang Chen, Yuan Liu, Chongyi Zhang, Wei Xue, Wenhan Luo, Alla Sheffer, Wenping Wang, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07003
Pdf URL: https://arxiv.org/pdf/2505.07003
Copy Paste: [[2505.07003]] CMD: Controllable Multiview Diffusion for 3D Editing and Progressive Generation(https://arxiv.org/abs/2505.07003)
Keywords: diffusion
Abstract: Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.

Title: A Vision-Language Foundation Model for Leaf Disease Identification

Authors: Khang Nguyen Quoc, Lan Le Thi Thu, Luyl-Da Quach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07019
Pdf URL: https://arxiv.org/pdf/2505.07019
Copy Paste: [[2505.07019]] A Vision-Language Foundation Model for Leaf Disease Identification(https://arxiv.org/abs/2505.07019)
Keywords: foundation model
Abstract: Leaf disease identification plays a pivotal role in smart agriculture. However, many existing studies still struggle to integrate image and textual modalities to compensate for each other's limitations. Furthermore, many of these approaches rely on pretraining with constrained datasets such as ImageNet, which lack domain-specific information. We propose SCOLD (Soft-target COntrastive learning for Leaf Disease identification), a context-aware vision-language foundation model tailored to address these challenges for agricultural tasks. SCOLD is developed using a diverse corpus of plant leaf images and corresponding symptom descriptions, comprising over 186,000 image-caption pairs aligned with 97 unique concepts. Through task-agnostic pretraining, SCOLD leverages contextual soft targets to mitigate overconfidence in contrastive learning by smoothing labels, thereby improving model generalization and robustness on fine-grained classification tasks. Experimental results demonstrate that SCOLD outperforms existing vision-language models such as OpenAI-CLIP-L, BioCLIP, and SigLIP2 across several benchmarks, including zero-shot and few-shot classification, image-text retrieval, and image classification, while maintaining a competitive parameter footprint. Ablation studies further highlight SCOLD's effectiveness in contrast to its counterparts. The proposed approach significantly advances the agricultural vision-language foundation model, offering strong performance with minimal or no supervised fine-tuning. This work lays a solid groundwork for future research on models trained with long-form and simplified contexts, tasks involving class ambiguity, and multi-modal systems for intelligent plant disease diagnostics. The code for this study is available at this https URL

Title: DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models

Authors: Junhao Xia, Chaoyang Zhang, Yecheng Zhang, Chengyang Zhou, Zhichang Wang, Bochun Liu, Dongshuo Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07057
Pdf URL: https://arxiv.org/pdf/2505.07057
Copy Paste: [[2505.07057]] DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models(https://arxiv.org/abs/2505.07057)
Keywords: diffusion
Abstract: Video generation based on diffusion models presents a challenging multimodal task, with video editing emerging as a pivotal direction in this field. Recent video editing approaches primarily fall into two categories: training-required and training-free methods. While training-based methods incur high computational costs, training-free alternatives often yield suboptimal performance. To address these limitations, we propose DAPE, a high-quality yet cost-effective two-stage parameter-efficient fine-tuning (PEFT) framework for video editing. In the first stage, we design an efficient norm-tuning method to enhance temporal consistency in generated videos. The second stage introduces a vision-friendly adapter to improve visual quality. Additionally, we identify critical shortcomings in existing benchmarks, including limited category diversity, imbalanced object distribution, and inconsistent frame counts. To mitigate these issues, we curate a large dataset benchmark comprising 232 videos with rich annotations and 6 editing prompts, enabling objective and comprehensive evaluation of advanced methods. Extensive experiments on existing datasets (BalanceCC, LOVEU-TGVE, RAVE) and our proposed benchmark demonstrate that DAPE significantly improves temporal coherence and text-video alignment while outperforming previous state-of-the-art approaches.

Title: Seed1.5-VL Technical Report

Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07062
Pdf URL: https://arxiv.org/pdf/2505.07062
Copy Paste: [[2505.07062]] Seed1.5-VL Technical Report(https://arxiv.org/abs/2505.07062)
Keywords: foundation model
Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at this https URL (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Title: Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

Authors: Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Subjects: cs.LG, cond-mat.dis-nn, stat.ML
Abstract URL: https://arxiv.org/abs/2505.07070
Pdf URL: https://arxiv.org/pdf/2505.07070
Copy Paste: [[2505.07070]] Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures(https://arxiv.org/abs/2505.07070)
Keywords: generative
Abstract: How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.

Title: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

Authors: Zihang Liu, Zhenyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07071
Pdf URL: https://arxiv.org/pdf/2505.07071
Copy Paste: [[2505.07071]] Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution(https://arxiv.org/abs/2505.07071)
Keywords: diffusion
Abstract: Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference. To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth. Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at this https URL.

Title: Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering

Authors: Payal Varshney, Adriano Lucieri, Christoph Balada, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07073
Pdf URL: https://arxiv.org/pdf/2505.07073
Copy Paste: [[2505.07073]] Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering(https://arxiv.org/abs/2505.07073)
Keywords: diffusion
Abstract: Concept-based explanations have emerged as an effective approach within Explainable Artificial Intelligence, enabling interpretable insights by aligning model decisions with human-understandable concepts. However, existing methods rely on computationally intensive procedures and struggle to efficiently capture complex, semantic concepts. Recently, the Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT) framework, introduced by Varshney et al. (2025), attempts to identify concepts via dimension-wise traversal of the latent space of a Variational Autoencoder trained on counterfactual trajectories. Extending the CDCT framework, this work introduces Concept Directions via Latent Clustering (CDLC), which extracts global, class-specific concept directions by clustering latent difference vectors derived from factual and diffusion-generated counterfactual image pairs. CDLC substantially reduces computational complexity by eliminating the exhaustive latent dimension traversal required in CDCT and enables the extraction of multidimensional semantic concepts encoded across the latent dimensions. This approach is validated on a real-world skin lesion dataset, demonstrating that the extracted concept directions align with clinically recognized dermoscopic features and, in some cases, reveal dataset-specific biases or unknown biomarkers. These results highlight that CDLC is interpretable, scalable, and applicable across high-stakes domains and diverse data modalities.

Title: Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression

Authors: Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Marco Fabris, Gian Antonio Susto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07119
Pdf URL: https://arxiv.org/pdf/2505.07119
Copy Paste: [[2505.07119]] Towards Scalable IoT Deployment for Visual Anomaly Detection via Efficient Compression(https://arxiv.org/abs/2505.07119)
Keywords: anomaly
Abstract: Visual Anomaly Detection (VAD) is a key task in industrial settings, where minimizing waste and operational costs is essential. Deploying deep learning models within Internet of Things (IoT) environments introduces specific challenges due to the limited computational power and bandwidth of edge devices. This study investigates how to perform VAD effectively under such constraints by leveraging compact and efficient processing strategies. We evaluate several data compression techniques, examining the trade-off between system latency and detection accuracy. Experiments on the MVTec AD benchmark demonstrate that significant compression can be achieved with minimal loss in anomaly detection performance compared to uncompressed data.

Title: Real-Time Bit-Level Encryption of Full High-Definition Video Without Diffusion

Authors: Dong Jiang, Hui-ran Luo, Zi-jian Cui, Xi-jue Zhao, Lin-sheng Huang, Liang-liang Lu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.07158
Pdf URL: https://arxiv.org/pdf/2505.07158
Copy Paste: [[2505.07158]] Real-Time Bit-Level Encryption of Full High-Definition Video Without Diffusion(https://arxiv.org/abs/2505.07158)
Keywords: diffusion
Abstract: Despite the widespread adoption of Shannon's confusion-diffusion architecture in image encryption, the implementation of diffusion to sequentially establish inter-pixel dependencies for attaining plaintext sensitivity constrains algorithmic parallelism, while the execution of multiple rounds of diffusion operations to meet the required sensitivity metrics incurs excessive computational overhead. Consequently, the pursuit of plaintext sensitivity through diffusion operations is the primary factor limiting the computational efficiency and throughput of video encryption algorithms, rendering them inadequate to meet the demands of real-time encryption for high-resolution video. To address the performance limitation, this paper proposes a real-time video encryption protocol based on heterogeneous parallel computing, which incorporates the SHA-256 hashes of original frames as input, employs multiple CPU threads to concurrently generate encryption-related data, and deploys numerous GPU threads to simultaneously encrypt pixels. By leveraging the extreme input sensitivity of the SHA hash, the proposed protocol achieves the required plaintext sensitivity metrics with only a single round of confusion and XOR operations, significantly reducing computational overhead. Furthermore, through eliminating the reliance on diffusion, it realizes the allocation of a dedicated GPU thread for encrypting each pixel within every channel, effectively enhancing algorithm's parallelism. The experimental results demonstrate that our approach not only exhibits superior statistical properties and robust security but also achieving delay-free bit-level encryption for 1920$\times$1080 resolution (full high definition) video at 30 FPS, with an average encryption time of 25.84 ms on a server equipped with an Intel Xeon Gold 6226R CPU and an NVIDIA GeForce RTX 3090 GPU.

Title: Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework

Authors: Jun Li, Hongzhang Zhu, Tao Chen, Xiaohua Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07165
Pdf URL: https://arxiv.org/pdf/2505.07165
Copy Paste: [[2505.07165]] Generalizable Pancreas Segmentation via a Dual Self-Supervised Learning Framework(https://arxiv.org/abs/2505.07165)
Keywords: self-supervised
Abstract: Recently, numerous pancreas segmentation methods have achieved promising performance on local single-source datasets. However, these methods don't adequately account for generalizability issues, and hence typically show limited performance and low stability on test data from other sources. Considering the limited availability of distinct data sources, we seek to improve the generalization performance of a pancreas segmentation model trained with a single-source dataset, i.e., the single source generalization task. In particular, we propose a dual self-supervised learning model that incorporates both global and local anatomical contexts. Our model aims to fully exploit the anatomical features of the intra-pancreatic and extra-pancreatic regions, and hence enhance the characterization of the high-uncertainty regions for more robust generalization. Specifically, we first construct a global-feature contrastive self-supervised learning module that is guided by the pancreatic spatial structure. This module obtains complete and consistent pancreatic features through promoting intra-class cohesion, and also extracts more discriminative features for differentiating between pancreatic and non-pancreatic tissues through maximizing inter-class separation. It mitigates the influence of surrounding tissue on the segmentation outcomes in high-uncertainty regions. Subsequently, a local-image restoration self-supervised learning module is introduced to further enhance the characterization of the high uncertainty regions. In this module, informative anatomical contexts are actually learned to recover randomly corrupted appearance patterns in those regions.

Title: Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning

Authors: Zexian Yang, Dian Li, Dayan Wu, Gang Liu, Weiping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07172
Pdf URL: https://arxiv.org/pdf/2505.07172
Copy Paste: [[2505.07172]] Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning(https://arxiv.org/abs/2505.07172)
Keywords: in-context
Abstract: Despite significant advancements in multimodal reasoning tasks, existing Large Vision-Language Models (LVLMs) are prone to producing visually ungrounded responses when interpreting associated images. In contrast, when humans embark on learning new knowledge, they often rely on a set of fundamental pre-study principles: reviewing outlines to grasp core concepts, summarizing key points to guide their focus and enhance understanding. However, such preparatory actions are notably absent in the current instruction tuning processes. This paper presents Re-Critic, an easily scalable rationale-augmented framework designed to incorporate fundamental rules and chain-of-thought (CoT) as a bridge to enhance reasoning abilities. Specifically, Re-Critic develops a visual rationale synthesizer that scalably augments raw instructions with rationale explanation. To probe more contextually grounded responses, Re-Critic employs an in-context self-critic mechanism to select response pairs for preference tuning. Experiments demonstrate that models fine-tuned with our rationale-augmented dataset yield gains that extend beyond hallucination-specific tasks to broader multimodal reasoning tasks.

Title: Incomplete In-context Learning

Authors: Wenqiang Wang, Yangshijie Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07251
Pdf URL: https://arxiv.org/pdf/2505.07251
Copy Paste: [[2505.07251]] Incomplete In-context Learning(https://arxiv.org/abs/2505.07251)
Keywords: in-context
Abstract: Large vision language models (LVLMs) achieve remarkable performance through Vision In-context Learning (VICL), a process that depends significantly on demonstrations retrieved from an extensive collection of annotated examples (retrieval database). Existing studies often assume that the retrieval database contains annotated examples for all labels. However, in real-world scenarios, delays in database updates or incomplete data annotation may result in the retrieval database containing labeled samples for only a subset of classes. We refer to this phenomenon as an \textbf{incomplete retrieval database} and define the in-context learning under this condition as \textbf{Incomplete In-context Learning (IICL)}. To address this challenge, we propose \textbf{Iterative Judgments and Integrated Prediction (IJIP)}, a two-stage framework designed to mitigate the limitations of IICL. The Iterative Judgments Stage reformulates an $\boldsymbol{m}$-class classification problem into a series of $\boldsymbol{m}$ binary classification tasks, effectively converting the IICL setting into a standard VICL scenario. The Integrated Prediction Stage further refines the classification process by leveraging both the input image and the predictions from the Iterative Judgments Stage to enhance overall classification accuracy. IJIP demonstrates considerable performance across two LVLMs and two datasets under three distinct conditions of label incompleteness, achieving the highest accuracy of 93.9\%. Notably, even in scenarios where labels are fully available, IJIP still achieves the best performance of all six baselines. Furthermore, IJIP can be directly applied to \textbf{Prompt Learning} and is adaptable to the \textbf{text domain}.

Title: Synthetic Similarity Search in Automotive Production

Authors: Christoph Huber, Ludwig Schleeh, Dino Knoll, Michael Guthe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07256
Pdf URL: https://arxiv.org/pdf/2505.07256
Copy Paste: [[2505.07256]] Synthetic Similarity Search in Automotive Production(https://arxiv.org/abs/2505.07256)
Keywords: foundation model
Abstract: Visual quality inspection in automotive production is essential for ensuring the safety and reliability of vehicles. Computer vision (CV) has become a popular solution for these inspections due to its cost-effectiveness and reliability. However, CV models require large, annotated datasets, which are costly and time-consuming to collect. To reduce the need for extensive training data, we propose a novel image classification pipeline that combines similarity search using a vision-based foundation model with synthetic data. Our approach leverages a DINOv2 model to transform input images into feature vectors, which are then compared to pre-classified reference images using cosine distance measurements. By utilizing synthetic data instead of real images as references, our pipeline achieves high classification accuracy without relying on real data. We evaluate this approach in eight real-world inspection scenarios and demonstrate that it meets the high performance requirements of production environments.

Title: AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Authors: Kai Hua, Steven Wu, Ge Zhang, Ke Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07293
Pdf URL: https://arxiv.org/pdf/2505.07293
Copy Paste: [[2505.07293]] AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection(https://arxiv.org/abs/2505.07293)
Keywords: in-context
Abstract: Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

Title: Generative Pre-trained Autoregressive Diffusion Transformer

Authors: Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiying Lu, Haoyang Huang, Jianlong Yuan, Nan Duan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07344
Pdf URL: https://arxiv.org/pdf/2505.07344
Copy Paste: [[2505.07344]] Generative Pre-trained Autoregressive Diffusion Transformer(https://arxiv.org/abs/2505.07344)
Keywords: diffusion, generative
Abstract: In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.

Title: QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines

Authors: Ohjoon Kwon, Changsu Lee, Jihye Back, Lim Sun Suk, Inho Kang, Donghyeon Jeon
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.07345
Pdf URL: https://arxiv.org/pdf/2505.07345
Copy Paste: [[2505.07345]] QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines(https://arxiv.org/abs/2505.07345)
Keywords: generative
Abstract: Large language models (LLMs) have been widely used for relevance assessment in information retrieval. However, our study demonstrates that combining two distinct small language models (SLMs) with different architectures can outperform LLMs in this task. Our approach -- QUPID -- integrates a generative SLM with an embedding-based SLM, achieving higher relevance judgment accuracy while reducing computational costs compared to state-of-the-art LLM solutions. This computational efficiency makes QUPID highly scalable for real-world search systems processing millions of queries daily. In experiments across diverse document types, our method demonstrated consistent performance improvements (Cohen's Kappa of 0.646 versus 0.387 for leading LLMs) while offering 60x faster inference times. Furthermore, when integrated into production search pipelines, QUPID improved nDCG@5 scores by 1.9%. These findings underscore how architectural diversity in model combinations can significantly enhance both search relevance and operational efficiency in information retrieval systems.

Title: From Search To Sampling: Generative Models For Robust Algorithmic Recourse

Authors: Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.07351
Pdf URL: https://arxiv.org/pdf/2505.07351
Copy Paste: [[2505.07351]] From Search To Sampling: Generative Models For Robust Algorithmic Recourse(https://arxiv.org/abs/2505.07351)
Keywords: generative
Abstract: Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe's training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: this https URL.

Title: Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection

Authors: Yuqi Cheng, Yunkang Cao, Dongfang Wang, Weiming Shen, Wenlong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07375
Pdf URL: https://arxiv.org/pdf/2505.07375
Copy Paste: [[2505.07375]] Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection(https://arxiv.org/abs/2505.07375)
Keywords: anomaly
Abstract: Point cloud anomaly detection is essential for various industrial applications. The huge computation and storage costs caused by the increasing product classes limit the application of single-class unsupervised methods, necessitating the development of multi-class unsupervised methods. However, the feature similarity between normal and anomalous points from different class data leads to the feature confusion problem, which greatly hinders the performance of multi-class methods. Therefore, we introduce a multi-class point cloud anomaly detection method, named GLFM, leveraging global-local feature matching to progressively separate data that are prone to confusion across multiple classes. Specifically, GLFM is structured into three stages: Stage-I proposes an anomaly synthesis pipeline that stretches point clouds to create abundant anomaly data that are utilized to adapt the point cloud feature extractor for better feature representation. Stage-II establishes the global and local memory banks according to the global and local feature distributions of all the training data, weakening the impact of feature confusion on the establishment of the memory bank. Stage-III implements anomaly detection of test data leveraging its feature distance from global and local memory banks. Extensive experiments on the MVTec 3D-AD, Real3D-AD and actual industry parts dataset showcase our proposed GLFM's superior point cloud anomaly detection performance. The code is available at this https URL.

Title: Unified Continuous Generative Models

Authors: Peng Sun, Yi Jiang, Tao Lin
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07447
Pdf URL: https://arxiv.org/pdf/2505.07447
Copy Paste: [[2505.07447]] Unified Continuous Generative Models(https://arxiv.org/abs/2505.07447)
Keywords: diffusion, generative
Abstract: Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: this https URL.

Title: You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts

Authors: Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07477
Pdf URL: https://arxiv.org/pdf/2505.07477
Copy Paste: [[2505.07477]] You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts(https://arxiv.org/abs/2505.07477)
Keywords: diffusion
Abstract: Diffusion models (DMs) have recently demonstrated remarkable success in modeling large-scale data distributions. However, many downstream tasks require guiding the generated content based on specific differentiable metrics, typically necessitating backpropagation during the generation process. This approach is computationally expensive, as generating with DMs often demands tens to hundreds of recursive network calls, resulting in high memory usage and significant time consumption. In this paper, we propose a more efficient alternative that approaches the problem from the perspective of parallel denoising. We show that full backpropagation throughout the entire generation process is unnecessary. The downstream metrics can be optimized by retaining the computational graph of only one step during generation, thus providing a shortcut for gradient propagation. The resulting method, which we call Shortcut Diffusion Optimization (SDO), is generic, high-performance, and computationally lightweight, capable of optimizing all parameter types in diffusion sampling. We demonstrate the effectiveness of SDO on several real-world tasks, including controlling generation by optimizing latent and aligning the DMs by fine-tuning network parameters. Compared to full backpropagation, our approach reduces computational costs by $\sim 90\%$ while maintaining superior performance. Code is available at this https URL.

Title: Addressing degeneracies in latent interpolation for diffusion models

Authors: Erik Landolsi, Fredrik Kahl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07481
Pdf URL: https://arxiv.org/pdf/2505.07481
Copy Paste: [[2505.07481]] Addressing degeneracies in latent interpolation for diffusion models(https://arxiv.org/abs/2505.07481)
Keywords: diffusion
Abstract: There is an increasing interest in using image-generating diffusion models for deep data augmentation and image morphing. In this context, it is useful to interpolate between latents produced by inverting a set of input images, in order to generate new images representing some mixture of the inputs. We observe that such interpolation can easily lead to degenerate results when the number of inputs is large. We analyze the cause of this effect theoretically and experimentally, and suggest a suitable remedy. The suggested approach is a relatively simple normalization scheme that is easy to use whenever interpolation between latents is needed. We measure image quality using FID and CLIP embedding distance and show experimentally that baseline interpolation methods lead to a drop in quality metrics long before the degeneration issue is clearly visible. In contrast, our method significantly reduces the degeneration effect and leads to improved quality metrics also in non-degenerate situations.

Title: EAGLE: Contrastive Learning for Efficient Graph Anomaly Detection

Authors: Jing Ren, Mingliang Hou, Zhixuan Liu, Xiaomei Bai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07508
Pdf URL: https://arxiv.org/pdf/2505.07508
Copy Paste: [[2505.07508]] EAGLE: Contrastive Learning for Efficient Graph Anomaly Detection(https://arxiv.org/abs/2505.07508)
Keywords: anomaly
Abstract: Graph anomaly detection is a popular and vital task in various real-world scenarios, which has been studied for several decades. Recently, many studies extending deep learning-based methods have shown preferable performance on graph anomaly detection. However, existing methods are lack of efficiency that is definitely necessary for embedded devices. Towards this end, we propose an Efficient Anomaly detection model on heterogeneous Graphs via contrastive LEarning (EAGLE) by contrasting abnormal nodes with normal ones in terms of their distances to the local context. The proposed method first samples instance pairs on meta path-level for contrastive learning. Then, a graph autoencoder-based model is applied to learn informative node embeddings in an unsupervised way, which will be further combined with the discriminator to predict the anomaly scores of nodes. Experimental results show that EAGLE outperforms the state-of-the-art methods on three heterogeneous network datasets.

Title: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07538
Pdf URL: https://arxiv.org/pdf/2505.07538
Copy Paste: [[2505.07538]] Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning(https://arxiv.org/abs/2505.07538)
Keywords: diffusion
Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: this https URL.

Title: Noise Optimized Conditional Diffusion for Domain Adaptation

Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.07548
Pdf URL: https://arxiv.org/pdf/2505.07548
Copy Paste: [[2505.07548]] Noise Optimized Conditional Diffusion for Domain Adaptation(https://arxiv.org/abs/2505.07548)
Keywords: diffusion, generative
Abstract: Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples (\textbf{hcpl-tds}) often leads to inaccurate cross-domain statistical alignment, causing DA failures. To address this challenge, we propose \textbf{N}oise \textbf{O}ptimized \textbf{C}onditional \textbf{D}iffusion for \textbf{D}omain \textbf{A}daptation (\textbf{NOCDDA}), which seamlessly integrates the generative capabilities of conditional diffusion models with the decision-making requirements of DA to achieve task-coupled optimization for efficient adaptation. For robust cross-domain consistency, we modify the DA classifier to align with the conditional diffusion classifier within a unified optimization framework, enabling forward training on noise-varying cross-domain samples. Furthermore, we argue that the conventional $ \mathcal{N}(\mathbf{0}, \mathbf{I}) $ initialization in diffusion models often generates class-confused hcpl-tds, compromising discriminative DA. To resolve this, we introduce a class-aware noise optimization strategy that refines sampling regions for reverse class-specific hcpl-tds generation, effectively enhancing cross-domain alignment. Extensive experiments across 5 benchmark datasets and 29 DA tasks demonstrate significant performance gains of \textbf{NOCDDA} over 31 state-of-the-art methods, validating its robustness and effectiveness.

Title: Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs

Authors: Kamil Jeziorek, Tomasz Kryjak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07556
Pdf URL: https://arxiv.org/pdf/2505.07556
Copy Paste: [[2505.07556]] Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs(https://arxiv.org/abs/2505.07556)
Keywords: self-supervised
Abstract: Event cameras offer significant advantages over traditional frame-based sensors. These include microsecond temporal resolution, robustness under varying lighting conditions and low power consumption. Nevertheless, the effective processing of their sparse, asynchronous event streams remains challenging. Existing approaches to this problem can be categorised into two distinct groups. The first group involves the direct processing of event data with neural models, such as Spiking Neural Networks or Graph Convolutional Neural Networks. However, this approach is often accompanied by a compromise in terms of qualitative performance. The second group involves the conversion of events into dense representations with handcrafted aggregation functions, which can boost accuracy at the cost of temporal fidelity. This paper introduces a novel Self-Supervised Event Representation (SSER) method leveraging Gated Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event timestamps and polarities without temporal discretisation. The recurrent layers are trained in a self-supervised manner to maximise the fidelity of event-time encoding. The inference is performed with event representations generated asynchronously, thus ensuring compatibility with high-throughput sensors. The experimental validation demonstrates that SSER outperforms aggregation-based baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx object detection datasets. Furthermore, the paper presents the first hardware implementation of recurrent representation for event data on a System-on-Chip FPGA, achieving sub-microsecond latency and power consumption between 1-2 W, suitable for real-time, power-efficient applications. Code is available at this https URL.

Title: Evaluating Modern Visual Anomaly Detection Approaches in Semiconductor Manufacturing: A Comparative Study

Authors: Manuel Barusco, Francesco Borsatti, Youssef Ben Khalifa, Davide Dalle Pezze, Gian Antonio Susto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07576
Pdf URL: https://arxiv.org/pdf/2505.07576
Copy Paste: [[2505.07576]] Evaluating Modern Visual Anomaly Detection Approaches in Semiconductor Manufacturing: A Comparative Study(https://arxiv.org/abs/2505.07576)
Keywords: anomaly
Abstract: Semiconductor manufacturing is a complex, multistage process. Automated visual inspection of Scanning Electron Microscope (SEM) images is indispensable for minimizing equipment downtime and containing costs. Most previous research considers supervised approaches, assuming a sufficient number of anomalously labeled samples. On the contrary, Visual Anomaly Detection (VAD), an emerging research domain, focuses on unsupervised learning, avoiding the costly defect collection phase while providing explanations of the predictions. We introduce a benchmark for VAD in the semiconductor domain by leveraging the MIIC dataset. Our results demonstrate the efficacy of modern VAD approaches in this field.

Title: Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions

Authors: Yi Zhang, Wenye Zhou, Ruonan Lin, Xin Yang, Hao Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07611
Pdf URL: https://arxiv.org/pdf/2505.07611
Copy Paste: [[2505.07611]] Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions(https://arxiv.org/abs/2505.07611)
Keywords: self-supervised
Abstract: Traffic accident prediction and detection are critical for enhancing road safety,and vision-based traffic accident anticipation (Vision-TAA) has emerged as a promising approach in the era of deep this http URL paper reviews 147 recent studies,focusing on the application of supervised,unsupervised,and hybrid deep learning models for accident prediction,alongside the use of real-world and synthetic this http URL methodologies are categorized into four key approaches: image and video feature-based prediction, spatiotemporal feature-based prediction, scene understanding,and multimodal data this http URL these methods demonstrate significant potential,challenges such as data scarcity,limited generalization to complex scenarios,and real-time performance constraints remain prevalent. This review highlights opportunities for future research,including the integration of multimodal data fusion, self-supervised learning,and Transformer-based architectures to enhance prediction accuracy and this http URL synthesizing existing advancements and identifying critical gaps, this paper provides a foundational reference for developing robust and adaptive Vision-TAA systems,contributing to road safety and traffic management.

Title: ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Authors: Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M. Rehg, Tobias Hinz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07652
Pdf URL: https://arxiv.org/pdf/2505.07652
Copy Paste: [[2505.07652]] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models(https://arxiv.org/abs/2505.07652)
Keywords: diffusion
Abstract: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in this https URL

Title: Multimodal Survival Modeling in the Age of Foundation Models

Authors: Steven Song, Morgan Borjigin-Wang, Irene Madejski, Robert L. Grossman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07683
Pdf URL: https://arxiv.org/pdf/2505.07683
Copy Paste: [[2505.07683]] Multimodal Survival Modeling in the Age of Foundation Models(https://arxiv.org/abs/2505.07683)
Keywords: foundation model
Abstract: The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference through its harmonized genomics, clinical, and image data. Prior studies have trained bespoke cancer survival prediction models from unimodal or multimodal TCGA data. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive meaningful feature embeddings, agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the feasibility of training classical, multimodal survival models over zero-shot embeddings extracted by FMs. We show the ease and additive effect of multimodal fusion, outperforming unimodal models. We demonstrate the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we modernize survival modeling by leveraging FMs and information extraction from pathology reports.

Title: Spoken Language Understanding on Unseen Tasks With In-Context Learning

Authors: Neeraj Agrawal, Sriram Ganapathy
Subjects: cs.CL, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2505.07731
Pdf URL: https://arxiv.org/pdf/2505.07731
Copy Paste: [[2505.07731]] Spoken Language Understanding on Unseen Tasks With In-Context Learning(https://arxiv.org/abs/2505.07731)
Keywords: in-context
Abstract: Spoken language understanding (SLU) tasks involve diverse skills that probe the information extraction, classification and/or generation capabilities of models. In this setting, task-specific training data may not always be available. While traditional task-specific SLU models are unable to cater to such requirements, the speech-text large language models (LLMs) offer a promising alternative with emergent abilities. However, out of-the-box, our evaluations indicate that the zero/few-shot performance of prominent open-source speech-text LLMs on SLU tasks are not up to the mark. In this paper, we introduce a novel approach to robust task-agnostic fine-tuning using randomized class labels. With this proposed fine-tuning, we illustrate that the performance of the speech-text LLMs on an unseen task is significantly improved over standard approaches. Critically, the proposed approach avoids the requirement of task-specific data annotations for enabling new tasks in speech-text LLMs.

Title: LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

Authors: Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07734
Pdf URL: https://arxiv.org/pdf/2505.07734
Copy Paste: [[2505.07734]] LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention(https://arxiv.org/abs/2505.07734)
Keywords: diffusion, generative
Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vision Transformer (LAMM-ViT), a Vision Transformer designed for robust facial forgery detection. This model integrates distinct Region-Guided Multi-Head Attention (RG-MHA) and Layer-aware Mask Modulation (LAMM) components within each layer. RG-MHA utilizes facial landmarks to create regional attention masks, guiding the model to scrutinize architectural inconsistencies across different facial areas. Crucially, the separate LAMM module dynamically generates layer-specific parameters, including mask weights and gating values, based on network context. These parameters then modulate the behavior of RG-MHA, enabling adaptive adjustment of regional focus across network depths. This architecture facilitates the capture of subtle, hierarchical forgery cues ubiquitous among diverse generation techniques, such as GANs and Diffusion Models. In cross-model generalization tests, LAMM-ViT demonstrates superior performance, achieving 94.09% mean ACC (a +5.45% improvement over SoTA) and 98.62% mean AP (a +3.09% improvement). These results demonstrate LAMM-ViT's exceptional ability to generalize and its potential for reliable deployment against evolving synthetic media threats.

Title: Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

Authors: Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07747
Pdf URL: https://arxiv.org/pdf/2505.07747
Copy Paste: [[2505.07747]] Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets(https://arxiv.org/abs/2505.07747)
Keywords: diffusion, generative
Abstract: While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.

Title: Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation

Authors: Arya Grayeli, Vipin Swarup, Steven E. Noel
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2505.07777
Pdf URL: https://arxiv.org/pdf/2505.07777
Copy Paste: [[2505.07777]] Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation(https://arxiv.org/abs/2505.07777)
Keywords: generative
Abstract: Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this paper, we introduce a novel machine learning model for generating high-fidelity synthetic network flow datasets that are representative of real-world networks. Our approach involves the generation of dynamic multigraphs using a stochastic Kronecker graph generator for structure generation and a tabular generative adversarial network for feature generation. We further employ an XGBoost (eXtreme Gradient Boosting) model for graph alignment, ensuring accurate overlay of features onto the generated graph structure. We evaluate our model using new metrics that assess both the accuracy and diversity of the synthetic graphs. Our results demonstrate improvements in accuracy over previous large-scale graph generation methods while maintaining similar efficiency. We also explore the trade-off between accuracy and diversity in synthetic graph dataset creation, a topic not extensively covered in related works. Our contributions include the synthesis and evaluation of large real-world netflow datasets and the definition of new metrics for evaluating synthetic graph generative models.

Title: Learning Dynamics in Continual Pre-Training for Large Language Models

Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07796
Pdf URL: https://arxiv.org/pdf/2505.07796
Copy Paste: [[2505.07796]] Learning Dynamics in Continual Pre-Training for Large Language Models(https://arxiv.org/abs/2505.07796)
Keywords: foundation model
Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

Title: Continuous Visual Autoregressive Generation via Score Maximization

Authors: Chenze Shao, Fandong Meng, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07812
Pdf URL: https://arxiv.org/pdf/2505.07812
Copy Paste: [[2505.07812]] Continuous Visual Autoregressive Generation via Score Maximization(https://arxiv.org/abs/2505.07812)
Keywords: diffusion, generative
Abstract: Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: this https URL.

Title: DanceGRPO: Unleashing GRPO on Visual Generation

Authors: Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.07818
Pdf URL: https://arxiv.org/pdf/2505.07818
Copy Paste: [[2505.07818]] DanceGRPO: Unleashing GRPO on Visual Generation(https://arxiv.org/abs/2505.07818)
Keywords: diffusion, foundation model, generative
Abstract: Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.