2025-05-06

Title: Multi-party Collaborative Attention Control for Image Customization

Authors: Han Yang, Chuanguang Yang, Qiuli Wang, Zhulin An, Weilun Feng, Libo Huang, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01428
Pdf URL: https://arxiv.org/pdf/2505.01428
Copy Paste: [[2505.01428]] Multi-party Collaborative Attention Control for Image Customization(https://arxiv.org/abs/2505.01428)
Keywords: diffusion
Abstract: The rapid advancement of diffusion models has increased the need for customized image generation. However, current customization methods face several limitations: 1) typically accept either image or text conditions alone; 2) customization in complex visual scenarios often leads to subject leakage or confusion; 3) image-conditioned outputs tend to suffer from inconsistent backgrounds; and 4) high computational costs. To address these issues, this paper introduces Multi-party Collaborative Attention Control (MCA-Ctrl), a tuning-free method that enables high-quality image customization using both text and complex visual conditions. Specifically, MCA-Ctrl leverages two key operations within the self-attention layer to coordinate multiple parallel diffusion processes and guide the target image generation. This approach allows MCA-Ctrl to capture the content and appearance of specific subjects while maintaining semantic consistency with the conditional input. Additionally, to mitigate subject leakage and confusion issues common in complex visual scenarios, we introduce a Subject Localization Module that extracts precise subject and editable image layers based on user instructions. Extensive quantitative and human evaluation experiments show that MCA-Ctrl outperforms existing methods in zero-shot image customization, effectively resolving the mentioned issues.

Title: Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models

Authors: Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01430
Pdf URL: https://arxiv.org/pdf/2505.01430
Copy Paste: [[2505.01430]] Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models(https://arxiv.org/abs/2505.01430)
Keywords: diffusion, generative
Abstract: The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.

Title: Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials

Authors: Tengfei Xing, Xiaodan Ren, Jie Li
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01438
Pdf URL: https://arxiv.org/pdf/2505.01438
Copy Paste: [[2505.01438]] Global Stress Generation and Spatiotemporal Super-Resolution Physics-Informed Operator under Dynamic Loading for Two-Phase Random Materials(https://arxiv.org/abs/2505.01438)
Keywords: diffusion
Abstract: Material stress analysis is a critical aspect of material design and performance optimization. Under dynamic loading, the global stress evolution in materials exhibits complex spatiotemporal characteristics, especially in two-phase random materials (TRMs). Such kind of material failure is often associated with stress concentration, and the phase boundaries are key locations where stress concentration occurs. In practical engineering applications, the spatiotemporal resolution of acquired microstructural data and its dynamic stress evolution is often limited. This poses challenges for deep learning methods in generating high-resolution spatiotemporal stress fields, particularly for accurately capturing stress concentration regions. In this study, we propose a framework for global stress generation and spatiotemporal super-resolution in TRMs under dynamic loading. First, we introduce a diffusion model-based approach, named as Spatiotemporal Stress Diffusion (STS-diffusion), for generating global spatiotemporal stress data. This framework incorporates Space-Time U-Net (STU-net), and we systematically investigate the impact of different attention positions on model accuracy. Next, we develop a physics-informed network for spatiotemporal super-resolution, termed as Spatiotemporal Super-Resolution Physics-Informed Operator (ST-SRPINN). The proposed ST-SRPINN is an unsupervised learning method. The influence of data-driven and physics-informed loss function weights on model accuracy is explored in detail. Benefiting from physics-based constraints, ST-SRPINN requires only low-resolution stress field data during training and can upscale the spatiotemporal resolution of stress fields to arbitrary magnifications.

Title: OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Authors: Shengkai Chen, Yifang Yin, Jinming Cao, Shili Xiang, Zhenguang Liu, Roger Zimmermann
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.01448
Pdf URL: https://arxiv.org/pdf/2505.01448
Copy Paste: [[2505.01448]] OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models(https://arxiv.org/abs/2505.01448)
Keywords: foundation model
Abstract: Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.

Title: COSMOS: Predictable and Cost-Effective Adaptation of LLMs

Authors: Jiayu Wang, Aws Albarghouthi, Frederic Sala
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01449
Pdf URL: https://arxiv.org/pdf/2505.01449
Copy Paste: [[2505.01449]] COSMOS: Predictable and Cost-Effective Adaptation of LLMs(https://arxiv.org/abs/2505.01449)
Keywords: in-context
Abstract: Large language models (LLMs) achieve remarkable performance across numerous tasks by using a diverse array of adaptation strategies. However, optimally selecting a model and adaptation strategy under resource constraints is challenging and often requires extensive experimentation. We investigate whether it is possible to accurately predict both performance and cost without expensive trials. We formalize the strategy selection problem for LLMs and introduce COSMOS, a unified prediction framework that efficiently estimates adaptation outcomes at minimal cost. We instantiate and study the capability of our framework via a pair of powerful predictors: embedding-augmented lightweight proxy models to predict fine-tuning performance, and low-sample scaling laws to forecast retrieval-augmented in-context learning. Extensive evaluation across eight representative benchmarks demonstrates that COSMOS achieves high prediction accuracy while reducing computational costs by 92.72% on average, and up to 98.71% in resource-intensive scenarios. Our results show that efficient prediction of adaptation outcomes is not only feasible but can substantially reduce the computational overhead of LLM deployment while maintaining performance standards.

Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

Authors: Zongxia Li, Xiyang Wu, Yubin Qin, Guangyao Shi, Hongyang Du, Dinesh Manocha, Tianyi Zhou, Jordan Lee Boyd-Graber
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01481
Pdf URL: https://arxiv.org/pdf/2505.01481
Copy Paste: [[2505.01481]] VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos(https://arxiv.org/abs/2505.01481)
Keywords: foundation model
Abstract: Synthetic video generation with foundation models has gained attention for its realism and wide applications. While these models produce high-quality frames, they often fail to respect common sense and physical laws, resulting in abnormal content. Existing metrics like VideoScore emphasize general quality but ignore such violations and lack interpretability. A more insightful approach is using multi-modal large language models (MLLMs) as interpretable evaluators, as seen in FactScore. Yet, MLLMs' ability to detect abnormalities in synthetic videos remains underexplored. To address this, we introduce VideoHallu, a benchmark featuring synthetic videos from models like Veo2, Sora, and Kling, paired with expert-designed QA tasks solvable via human-level reasoning across various categories. We assess several SoTA MLLMs, including GPT-4o, Gemini-2.5-Pro, Qwen-2.5-VL, and newer models like Video-R1 and VideoChat-R1. Despite strong real-world performance on MVBench and MovieChat, these models still hallucinate on basic commonsense and physics tasks in synthetic settings, underscoring the challenge of hallucination. We further fine-tune SoTA MLLMs using Group Relative Policy Optimization (GRPO) on real and synthetic commonsense/physics data. Results show notable accuracy gains, especially with counterexample integration, advancing MLLMs' reasoning capabilities. Our data is available at this https URL.

Title: Explainable Machine Learning for Cyberattack Identification from Traffic Flows

Authors: Yujing Zhou, Marc L. Jacquet, Robel Dawit, Skyler Fabre, Dev Sarawat, Faheem Khan, Madison Newell, Yongxin Liu, Dahai Liu, Hongyun Chen, Jian Wang, Huihui Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2505.01488
Pdf URL: https://arxiv.org/pdf/2505.01488
Copy Paste: [[2505.01488]] Explainable Machine Learning for Cyberattack Identification from Traffic Flows(https://arxiv.org/abs/2505.01488)
Keywords: anomaly
Abstract: The increasing automation of traffic management systems has made them prime targets for cyberattacks, disrupting urban mobility and public safety. Traditional network-layer defenses are often inaccessible to transportation agencies, necessitating a machine learning-based approach that relies solely on traffic flow data. In this study, we simulate cyberattacks in a semi-realistic environment, using a virtualized traffic network to analyze disruption patterns. We develop a deep learning-based anomaly detection system, demonstrating that Longest Stop Duration and Total Jam Distance are key indicators of compromised signals. To enhance interpretability, we apply Explainable AI (XAI) techniques, identifying critical decision factors and diagnosing misclassification errors. Our analysis reveals two primary challenges: transitional data inconsistencies, where mislabeled recovery-phase traffic misleads the model, and model limitations, where stealth attacks in low-traffic conditions evade detection. This work enhances AI-driven traffic security, improving both detection accuracy and trustworthiness in smart transportation systems.

Title: WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation

Authors: Daoan Zhang, Che Jiang, Ruoshi Xu, Biaoxiang Chen, Zijian Jin, Yutian Lu, Jianguo Zhang, Liang Yong, Jiebo Luo, Shengda Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01490
Pdf URL: https://arxiv.org/pdf/2505.01490
Copy Paste: [[2505.01490]] WorldGenBench: A World-Knowledge-Integrated Benchmark for Reasoning-Driven Text-to-Image Generation(https://arxiv.org/abs/2505.01490)
Keywords: diffusion
Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models still struggle with prompts that require rich world knowledge and implicit reasoning: both of which are critical for producing semantically accurate, coherent, and contextually appropriate images in real-world scenarios. To address this gap, we introduce \textbf{WorldGenBench}, a benchmark designed to systematically evaluate T2I models' world knowledge grounding and implicit inferential capabilities, covering both the humanities and nature domains. We propose the \textbf{Knowledge Checklist Score}, a structured metric that measures how well generated images satisfy key semantic expectations. Experiments across 21 state-of-the-art models reveal that while diffusion models lead among open-source methods, proprietary auto-regressive models like GPT-4o exhibit significantly stronger reasoning and knowledge integration. Our findings highlight the need for deeper understanding and inference capabilities in next-generation T2I systems. Project Page: \href{this https URL}{this https URL}

Title: The DCR Delusion: Measuring the Privacy Risk of Synthetic Data

Authors: Zexi Yao, Nataša Krčo, Georgi Ganev, Yves-Alexandre de Montjoye
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01524
Pdf URL: https://arxiv.org/pdf/2505.01524
Copy Paste: [[2505.01524]] The DCR Delusion: Measuring the Privacy Risk of Synthetic Data(https://arxiv.org/abs/2505.01524)
Keywords: diffusion
Abstract: Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synthetic dataset, practitioners and researchers often rely on simpler proxy metrics such as Distance to Closest Record (DCR). These metrics estimate privacy by measuring the similarity between the training data and generated synthetic data. This similarity is also compared against that between the training data and a disjoint holdout set of real records to construct a binary privacy test. If the synthetic data is not more similar to the training data than the holdout set is, it passes the test and is considered private. In this work we show that, while computationally inexpensive, DCR and other distance-based metrics fail to identify privacy leakage. Across multiple datasets and both classical models such as Baynet and CTGAN and more recent diffusion models, we show that datasets deemed private by proxy metrics are highly vulnerable to MIAs. We similarly find both the binary privacy test and the continuous measure based on these metrics to be uninformative of actual membership inference risk. We further show that these failures are consistent across different metric hyperparameter settings and record selection methods. Finally, we argue DCR and other distance-based metrics to be flawed by design and show a example of a simple leakage they miss in practice. With this work, we hope to motivate practitioners to move away from proxy metrics to MIAs as the rigorous, comprehensive standard of evaluating privacy of synthetic data, in particular to make claims of datasets being legally anonymous.

Title: Contextures: Representations from Contexts

Authors: Runtian Zhai, Kai Yang, Che-Ping Tsai, Burak Varici, Zico Kolter, Pradeep Ravikumar
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.01557
Pdf URL: https://arxiv.org/pdf/2505.01557
Copy Paste: [[2505.01557]] Contextures: Representations from Contexts(https://arxiv.org/abs/2505.01557)
Keywords: self-supervised, foundation model
Abstract: Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms -- supervised, self-supervised, and manifold learning -- can all be studied from such a perspective. We also prove that the representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of the contexture theory is that once the model is large enough to approximate the top singular functions, further scaling up the model size yields diminishing returns. Therefore, scaling is not all we need, and further improvement requires better contexts. To this end, we study how to evaluate the usefulness of a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.

Title: A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning

Authors: Anan Yaghmour, Melba M. Crawford, Saurabh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01558
Pdf URL: https://arxiv.org/pdf/2505.01558
Copy Paste: [[2505.01558]] A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning(https://arxiv.org/abs/2505.01558)
Keywords: foundation model, generative
Abstract: Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high-performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training. We further provide new mathematical insights into MAE-based generative learning for domain-invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.

Title: PainFormer: a Vision Foundation Model for Automatic Pain Assessment

Authors: Stefanos Gkikas, Raul Fernandez Rojas, Manolis Tsiknakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01571
Pdf URL: https://arxiv.org/pdf/2505.01571
Copy Paste: [[2505.01571]] PainFormer: a Vision Foundation Model for Automatic Pain Assessment(https://arxiv.org/abs/2505.01571)
Keywords: foundation model
Abstract: Pain is a manifold condition that impacts a significant percentage of the population. Accurate and reliable pain evaluation for the people suffering is crucial to developing effective and advanced pain management protocols. Automatic pain assessment systems provide continuous monitoring and support decision-making processes, ultimately aiming to alleviate distress and prevent functionality decline. This study introduces PainFormer, a vision foundation model based on multi-task learning principles trained simultaneously on 14 tasks/datasets with a total of 10.9 million samples. Functioning as an embedding extractor for various input modalities, the foundation model provides feature representations to the Embedding-Mixer, a transformer-based module that performs the final pain assessment. Extensive experiments employing behavioral modalities-including RGB, synthetic thermal, and estimated depth videos-and physiological modalities such as ECG, EMG, GSR, and fNIRS revealed that PainFormer effectively extracts high-quality embeddings from diverse input modalities. The proposed framework is evaluated on two pain datasets, BioVid and AI4Pain, and directly compared to 73 different methodologies documented in the literature. Experiments conducted in unimodal and multimodal settings demonstrate state-of-the-art performances across modalities and pave the way toward general-purpose models for automatic pain assessment.

Title: Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification

Authors: Sicong Li, Qianqian Xu, Zhiyong Yang, Zitai Wang, Linchao Zhang, Xiaochun Cao, Qingming Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.01660
Pdf URL: https://arxiv.org/pdf/2505.01660
Copy Paste: [[2505.01660]] Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification(https://arxiv.org/abs/2505.01660)
Keywords: foundation model
Abstract: Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.

Title: Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

Authors: Alexander Davis, Rafael Souza, Jia-Hao Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01711
Pdf URL: https://arxiv.org/pdf/2505.01711
Copy Paste: [[2505.01711]] Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings(https://arxiv.org/abs/2505.01711)
Keywords: foundation model
Abstract: Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.

Title: Vision and Intention Boost Large Language Model in Long-Term Action Anticipation

Authors: Congqi Cao, Lanshu Hu, Yating Yu, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01713
Pdf URL: https://arxiv.org/pdf/2505.01713
Copy Paste: [[2505.01713]] Vision and Intention Boost Large Language Model in Long-Term Action Anticipation(https://arxiv.org/abs/2505.01713)
Keywords: in-context
Abstract: Long-term action anticipation (LTA) aims to predict future actions over an extended period. Previous approaches primarily focus on learning exclusively from video data but lack prior knowledge. Recent researches leverage large language models (LLMs) by utilizing text-based inputs which suffer severe information loss. To tackle these limitations single-modality methods face, we propose a novel Intention-Conditioned Vision-Language (ICVL) model in this study that fully leverages the rich semantic information of visual data and the powerful reasoning capabilities of LLMs. Considering intention as a high-level concept guiding the evolution of actions, we first propose to employ a vision-language model (VLM) to infer behavioral intentions as comprehensive textual features directly from video inputs. The inferred intentions are then fused with visual features through a multi-modality fusion strategy, resulting in intention-enhanced visual representations. These enhanced visual representations, along with textual prompts, are fed into LLM for future action anticipation. Furthermore, we propose an effective example selection strategy jointly considers visual and textual similarities, providing more relevant and informative examples for in-context learning. Extensive experiments with state-of-the-art performance on Ego4D, EPIC-Kitchens-55, and EGTEA GAZE+ datasets fully demonstrate the effectiveness and superiority of the proposed method.

Title: PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Authors: Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01729
Pdf URL: https://arxiv.org/pdf/2505.01729
Copy Paste: [[2505.01729]] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth(https://arxiv.org/abs/2505.01729)
Keywords: diffusion, self-supervised, generative
Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Title: Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01746
Pdf URL: https://arxiv.org/pdf/2505.01746
Copy Paste: [[2505.01746]] Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion(https://arxiv.org/abs/2505.01746)
Keywords: diffusion
Abstract: Generating gestures from human speech has gained tremendous progress in animating virtual avatars. While the existing methods enable synthesizing gestures cooperated by individual self-talking, they overlook the practicality of concurrent gesture modeling with two-person interactive conversations. Moreover, the lack of high-quality datasets with concurrent co-speech gestures also limits handling this issue. To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. Additionally, we propose Co$^3$Gesture, a novel framework that enables coherent concurrent co-speech gesture synthesis including two-person interactive movements. Considering the asymmetric body dynamics of two speakers, our framework is built upon two cooperative generation branches conditioned on separated speaker audio. Specifically, to enhance the coordination of human postures with respect to corresponding speaker audios while interacting with the conversational partner, we present a Temporal Interaction Module (TIM). TIM can effectively model the temporal association representation between two speakers' gesture sequences as interaction guidance and fuse it into the concurrent gesture generation. Then, we devise a mutual attention mechanism to further holistically boost learning dependencies of interacted concurrent motions, thereby enabling us to generate vivid and coherent gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at \href{this https URL}{\textit{this https URL}}.

Title: Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition

Authors: Amirmohammad Farzaneh, Osvaldo Simeone
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2505.01783
Pdf URL: https://arxiv.org/pdf/2505.01783
Copy Paste: [[2505.01783]] Context-Aware Online Conformal Anomaly Detection with Prediction-Powered Data Acquisition(https://arxiv.org/abs/2505.01783)
Keywords: anomaly
Abstract: Online anomaly detection is essential in fields such as cybersecurity, healthcare, and industrial monitoring, where promptly identifying deviations from expected behavior can avert critical failures or security breaches. While numerous anomaly scoring methods based on supervised or unsupervised learning have been proposed, current approaches typically rely on a continuous stream of real-world calibration data to provide assumption-free guarantees on the false discovery rate (FDR). To address the inherent challenges posed by limited real calibration data, we introduce context-aware prediction-powered conformal online anomaly detection (C-PP-COAD). Our framework strategically leverages synthetic calibration data to mitigate data scarcity, while adaptively integrating real data based on contextual cues. C-PP-COAD utilizes conformal p-values, active p-value statistics, and online FDR control mechanisms to maintain rigorous and reliable anomaly detection performance over time. Experiments conducted on both synthetic and real-world datasets demonstrate that C-PP-COAD significantly reduces dependency on real calibration data without compromising guaranteed FDR control.

Title: Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis

Authors: Chidimma Opara
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01800
Pdf URL: https://arxiv.org/pdf/2505.01800
Copy Paste: [[2505.01800]] Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis(https://arxiv.org/abs/2505.01800)
Keywords: generative
Abstract: The increasing sophistication of AI-generated texts highlights the urgent need for accurate and transparent detection tools, especially in educational settings, where verifying authorship is essential. Existing literature has demonstrated that the application of stylometric features with machine learning classifiers can yield excellent results. Building on this foundation, this study proposes a comprehensive framework that integrates stylometric analysis with psycholinguistic theories, offering a clear and interpretable approach to distinguishing between AI-generated and human-written texts. This research specifically maps 31 distinct stylometric features to cognitive processes such as lexical retrieval, discourse planning, cognitive load management, and metacognitive self-monitoring. In doing so, it highlights the unique psycholinguistic patterns found in human writing. Through the intersection of computational linguistics and cognitive science, this framework contributes to the development of reliable tools aimed at preserving academic integrity in the era of generative AI.

Title: $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge

Authors: Core Francisco Park, Zechen Zhang, Hidenori Tanaka
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.01812
Pdf URL: https://arxiv.org/pdf/2505.01812
Copy Paste: [[2505.01812]] $\textit{New News}$: System-2 Fine-tuning for Robust Integration of New Knowledge(https://arxiv.org/abs/2505.01812)
Keywords: in-context
Abstract: Humans and intelligent animals can effortlessly internalize new information ("news") and accurately extract the implications for performing downstream tasks. While large language models (LLMs) can achieve this through in-context learning (ICL) when the news is explicitly given as context, fine-tuning remains challenging for the models to consolidate learning in weights. In this paper, we introduce $\textit{New News}$, a dataset composed of hypothetical yet plausible news spanning multiple domains (mathematics, coding, discoveries, leaderboards, events), accompanied by downstream evaluation questions whose correct answers critically depend on understanding and internalizing the news. We first demonstrate a substantial gap between naive fine-tuning and in-context learning (FT-ICL gap) on our news dataset. To address this gap, we explore a suite of self-play data generation protocols -- paraphrases, implications and Self-QAs -- designed to distill the knowledge from the model with context into the weights of the model without the context, which we term $\textit{System-2 Fine-tuning}$ (Sys2-FT). We systematically evaluate ICL and Sys2-FT performance across data domains and model scales with the Qwen 2.5 family of models. Our results demonstrate that the self-QA protocol of Sys2-FT significantly improves models' in-weight learning of the news. Furthermore, we discover the $\textit{contexual shadowing effect}$, where training with the news $\textit{in context}$ followed by its rephrases or QAs degrade learning of the news. Finally, we show preliminary evidence of an emerging scaling law of Sys2-FT.

Title: Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning

Authors: Jifeng Hu, Sili Huang, Zhejian Yang, Shengchao Hu, Li Shen, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01822
Pdf URL: https://arxiv.org/pdf/2505.01822
Copy Paste: [[2505.01822]] Analytic Energy-Guided Policy Optimization for Offline Reinforcement Learning(https://arxiv.org/abs/2505.01822)
Keywords: diffusion
Abstract: Conditional decision generation with diffusion models has shown powerful competitiveness in reinforcement learning (RL). Recent studies reveal the relation between energy-function-guidance diffusion models and constrained RL problems. The main challenge lies in estimating the intermediate energy, which is intractable due to the log-expectation formulation during the generation process. To address this issue, we propose the Analytic Energy-guided Policy Optimization (AEPO). Specifically, we first provide a theoretical analysis and the closed-form solution of the intermediate guidance when the diffusion model obeys the conditional Gaussian transformation. Then, we analyze the posterior Gaussian distribution in the log-expectation formulation and obtain the target estimation of the log-expectation under mild assumptions. Finally, we train an intermediate energy neural network to approach the target estimation of log-expectation formulation. We apply our method in 30+ offline RL tasks to demonstrate the effectiveness of our method. Extensive experiments illustrate that our method surpasses numerous representative baselines in D4RL offline reinforcement learning benchmarks.

Title: PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach

Authors: Nitin Rai, Arnold W. Schumann, Nathan Boyd
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2505.01823
Pdf URL: https://arxiv.org/pdf/2505.01823
Copy Paste: [[2505.01823]] PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach(https://arxiv.org/abs/2505.01823)
Keywords: diffusion, generative
Abstract: Collecting large-scale crop disease images in the field is labor-intensive and time-consuming. Generative models (GMs) offer an alternative by creating synthetic samples that resemble real-world images. However, existing research primarily relies on Generative Adversarial Networks (GANs)-based image-to-image translation and lack a comprehensive analysis of computational requirements in agriculture. Therefore, this research explores a multi-modal text-to-image approach for generating synthetic crop disease images and is the first to provide computational benchmarking in this context. We trained three Stable Diffusion (SD) variants-SDXL, SD3.5M (medium), and SD3.5L (large)-and fine-tuned them using Dreambooth and Low-Rank Adaptation (LoRA) fine-tuning techniques to enhance generalization. SD3.5M outperformed the others, with an average memory usage of 18 GB, power consumption of 180 W, and total energy use of 1.02 kWh/500 images (0.002 kWh per image) during inference task. Our results demonstrate SD3.5M's ability to generate 500 synthetic images from just 36 in-field samples in 1.5 hours. We recommend SD3.5M for efficient crop disease data generation.

Title: DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion

Authors: Haoteng Li, Zhao Yang, Zezhong Qian, Gongpeng Zhao, Yuqi Huang, Jun Yu, Huazheng Zhou, Longjun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01857
Pdf URL: https://arxiv.org/pdf/2505.01857
Copy Paste: [[2505.01857]] DualDiff: Dual-branch Diffusion Model for Autonomous Driving with Semantic Fusion(https://arxiv.org/abs/2505.01857)
Keywords: diffusion
Abstract: Accurate and high-fidelity driving scene reconstruction relies on fully leveraging scene information as conditioning. However, existing approaches, which primarily use 3D bounding boxes and binary maps for foreground and background control, fall short in capturing the complexity of the scene and integrating multi-modal information. In this paper, we propose DualDiff, a dual-branch conditional diffusion model designed to enhance multi-view driving scene generation. We introduce Occupancy Ray Sampling (ORS), a semantic-rich 3D representation, alongside numerical driving scene representation, for comprehensive foreground and background control. To improve cross-modal information integration, we propose a Semantic Fusion Attention (SFA) mechanism that aligns and fuses features across modalities. Furthermore, we design a foreground-aware masked (FGM) loss to enhance the generation of tiny objects. DualDiff achieves state-of-the-art performance in FID score, as well as consistently better results in downstream BEV segmentation and 3D object detection tasks.

Title: Rethinking Score Distilling Sampling for 3D Editing and Generation

Authors: Xingyu Miao, Haoran Duan, Yang Long, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01888
Pdf URL: https://arxiv.org/pdf/2505.01888
Copy Paste: [[2505.01888]] Rethinking Score Distilling Sampling for 3D Editing and Generation(https://arxiv.org/abs/2505.01888)
Keywords: diffusion
Abstract: Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing. The code is available on: this https URL.

Title: BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

Authors: Evan R. Antoniuk, Shehtab Zaman, Tal Ben-Nun, Peggy Li, James Diffenderfer, Busra Demirci, Obadiah Smolenski, Tim Hsu, Anna M. Hiszpanski, Kenneth Chiu, Bhavya Kailkhura, Brian Van Essen
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2505.01912
Pdf URL: https://arxiv.org/pdf/2505.01912
Copy Paste: [[2505.01912]] BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models(https://arxiv.org/abs/2505.01912)
Keywords: foundation model, generative, in-context
Abstract: Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $\boldsymbol{b}$enchmarks for $\boldsymbol{o}$ut-$\boldsymbol{o}$f-distribution $\boldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.

Title: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection

Authors: Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01969
Pdf URL: https://arxiv.org/pdf/2505.01969
Copy Paste: [[2505.01969]] MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection(https://arxiv.org/abs/2505.01969)
Keywords: anomaly
Abstract: 3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. Therefore, this paper presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1\% and 9.3\% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The source code will be released upon acceptance.

Title: Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation

Authors: Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Khang Nguyen, Yasuhiko Nakashima
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.01984
Pdf URL: https://arxiv.org/pdf/2505.01984
Copy Paste: [[2505.01984]] Lifelong Whole Slide Image Analysis: Online Vision-Language Adaptation and Past-to-Present Gradient Distillation(https://arxiv.org/abs/2505.01984)
Keywords: foundation model
Abstract: Whole Slide Images (WSIs) play a crucial role in accurate cancer diagnosis and prognosis, as they provide tissue details at the cellular level. However, the rapid growth of computational tasks involving WSIs poses significant challenges. Given that WSIs are gigapixels in size, they present difficulties in terms of storage, processing, and model training. Therefore, it is essential to develop lifelong learning approaches for WSI analysis. In scenarios where slides are distributed across multiple institutes, we aim to leverage them to develop a unified online model as a computational tool for cancer diagnosis in clinical and hospital settings. In this study, we introduce ADaFGrad, a method designed to enhance lifelong learning for whole-slide image (WSI) analysis. First, we leverage pathology vision-language foundation models to develop a framework that enables interaction between a slide's regional tissue features and a predefined text-based prototype buffer. Additionally, we propose a gradient-distillation mechanism that mimics the gradient of a logit with respect to the classification-head parameters across past and current iterations in a continual-learning setting. We construct a sequence of six TCGA datasets for training and evaluation. Experimental results show that ADaFGrad outperforms both state-of-the-art WSI-specific and conventional continual-learning methods after only a few training epochs, exceeding them by up to +5.068% in the class-incremental learning scenario while exhibiting the least forgetting (i.e., retaining the most knowledge from previous tasks). Moreover, ADaFGrad surpasses its baseline by as much as +40.084% in accuracy, further demonstrating the effectiveness of the proposed modules.

Title: Always Skip Attention

Authors: Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.01996
Pdf URL: https://arxiv.org/pdf/2505.01996
Copy Paste: [[2505.01996]] Always Skip Attention(https://arxiv.org/abs/2505.01996)
Keywords: self-supervised
Abstract: We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.

Title: GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning

Authors: Rui Lv, Zaixi Zhang, Kai Zhang, Qi Liu, Weibo Gao, Jiawei Liu, Jiaxia Yan, Linan Yue, Fangzhou Yao
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.02027
Pdf URL: https://arxiv.org/pdf/2505.02027
Copy Paste: [[2505.02027]] GraphPrompter: Multi-stage Adaptive Prompt Optimization for Graph In-Context Learning(https://arxiv.org/abs/2505.02027)
Keywords: in-context
Abstract: Graph In-Context Learning, with the ability to adapt pre-trained graph models to novel and diverse downstream graphs without updating any parameters, has gained much attention in the community. The key to graph in-context learning is to perform downstream graphs conditioned on chosen prompt examples. Existing methods randomly select subgraphs or edges as prompts, leading to noisy graph prompts and inferior model performance. Additionally, due to the gap between pre-training and testing graphs, when the number of classes in the testing graphs is much greater than that in the training, the in-context learning ability will also significantly deteriorate. To tackle the aforementioned challenges, we develop a multi-stage adaptive prompt optimization method GraphPrompter, which optimizes the entire process of generating, selecting, and using graph prompts for better in-context learning capabilities. Firstly, Prompt Generator introduces a reconstruction layer to highlight the most informative edges and reduce irrelevant noise for graph prompt construction. Furthermore, in the selection stage, Prompt Selector employs the $k$-nearest neighbors algorithm and pre-trained selection layers to dynamically choose appropriate samples and minimize the influence of irrelevant prompts. Finally, we leverage a Prompt Augmenter with a cache replacement strategy to enhance the generalization capability of the pre-trained model on new datasets. Extensive experiments show that GraphPrompter effectively enhances the in-context learning ability of graph models. On average across all the settings, our approach surpasses the state-of-the-art baselines by over 8%. Our code is released at this https URL.

Title: Secrets of GFlowNets' Learning Behavior: A Theoretical Study

Authors: Tianshu Yu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02035
Pdf URL: https://arxiv.org/pdf/2505.02035
Copy Paste: [[2505.02035]] Secrets of GFlowNets' Learning Behavior: A Theoretical Study(https://arxiv.org/abs/2505.02035)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) have emerged as a powerful paradigm for generating composite structures, demonstrating considerable promise across diverse applications. While substantial progress has been made in exploring their modeling validity and connections to other generative frameworks, the theoretical understanding of their learning behavior remains largely uncharted. In this work, we present a rigorous theoretical investigation of GFlowNets' learning behavior, focusing on four fundamental dimensions: convergence, sample complexity, implicit regularization, and robustness. By analyzing these aspects, we seek to elucidate the intricate mechanisms underlying GFlowNet's learning dynamics, shedding light on its strengths and limitations. Our findings contribute to a deeper understanding of the factors influencing GFlowNet performance and provide insights into principled guidelines for their effective design and deployment. This study not only bridges a critical gap in the theoretical landscape of GFlowNets but also lays the foundation for their evolution as a reliable and interpretable framework for generative modeling. Through this, we aspire to advance the theoretical frontiers of GFlowNets and catalyze their broader adoption in the AI community.

Title: Regression s all you need for medical image translation

Authors: Sebastian Rassmann, David Kügler, Christian Ewert, Martin Reuter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02048
Pdf URL: https://arxiv.org/pdf/2505.02048
Copy Paste: [[2505.02048]] Regression s all you need for medical image translation(https://arxiv.org/abs/2505.02048)
Keywords: diffusion, generative
Abstract: The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.

Title: What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction

Authors: Eitan Wagner, Omri Abend
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02072
Pdf URL: https://arxiv.org/pdf/2505.02072
Copy Paste: [[2505.02072]] What do Language Model Probabilities Represent? From Distribution Estimation to Response Prediction(https://arxiv.org/abs/2505.02072)
Keywords: in-context
Abstract: The notion of language modeling has gradually shifted in recent years from a distribution over finite-length strings to general-purpose prediction models for textual inputs and outputs, following appropriate alignment phases. This paper analyzes the distinction between distribution estimation and response prediction in the context of LLMs, and their often conflicting goals. We examine the training phases of LLMs, which include pretraining, in-context learning, and preference tuning, and also the common use cases for their output probabilities, which include completion probabilities and explicit probabilities as output. We argue that the different settings lead to three distinct intended output distributions. We demonstrate that NLP works often assume that these distributions should be similar, which leads to misinterpretations of their experimental findings. Our work sets firmer formal foundations for the interpretation of LLMs, which will inform ongoing work on the interpretation and use of LLMs' induced distributions.

Title: Lightweight Defense Against Adversarial Attacks in Time Series Classification

Authors: Yi Han (Independent Researcher, Australia)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02073
Pdf URL: https://arxiv.org/pdf/2505.02073
Copy Paste: [[2505.02073]] Lightweight Defense Against Adversarial Attacks in Time Series Classification(https://arxiv.org/abs/2505.02073)
Keywords: foundation model
Abstract: As time series classification (TSC) gains prominence, ensuring robust TSC models against adversarial attacks is crucial. While adversarial defense is well-studied in Computer Vision (CV), the TSC field has primarily relied on adversarial training (AT), which is computationally expensive. In this paper, five data augmentation-based defense methods tailored for time series are developed, with the most computationally intensive method among them increasing the computational resources by only 14.07% compared to the original TSC model. Moreover, the deployment process for these methods is straightforward. By leveraging these advantages of our methods, we create two combined methods. One of these methods is an ensemble of all the proposed techniques, which not only provides better defense performance than PGD-based AT but also enhances the generalization ability of TSC models. Moreover, the computational resources required for our ensemble are less than one-third of those required for PGD-based AT. These methods advance robust TSC in data mining. Furthermore, as foundation models are increasingly explored for time series feature learning, our work provides insights into integrating data augmentation-based adversarial defense with large-scale pre-trained models in future research.

Title: Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation

Authors: Volodymyr Havrylov, Haiwen Huang, Dan Zhang, Andreas Geiger
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02075
Pdf URL: https://arxiv.org/pdf/2505.02075
Copy Paste: [[2505.02075]] Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation(https://arxiv.org/abs/2505.02075)
Keywords: foundation model
Abstract: Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks. As VFMs' popularity grows, there is an increasing interest in understanding their effectiveness for dense prediction tasks. However, VFMs typically produce low-resolution features, limiting their direct applicability in this context. One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution. To assess the effectiveness of this approach, we investigate Interactive Segmentation (IS) as a novel benchmark for evaluating feature upsampling methods on VFMs. Due to its inherent multimodal input, consisting of an image and a set of user-defined clicks, as well as its dense mask output, IS creates a challenging environment that demands comprehensive visual scene understanding. Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality. The code is released at this https URL

Title: Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Authors: Alexey Nekrasov, Malcolm Burdorf, Stewart Worrall, Bastian Leibe, Julie Stephany Berrio Perez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02148
Pdf URL: https://arxiv.org/pdf/2505.02148
Copy Paste: [[2505.02148]] Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving(https://arxiv.org/abs/2505.02148)
Keywords: anomaly
Abstract: To operate safely, autonomous vehicles (AVs) need to detect and handle unexpected objects or anomalies on the road. While significant research exists for anomaly detection and segmentation in 2D, research progress in 3D is underexplored. Existing datasets lack high-quality multimodal data that are typically found in AVs. This paper presents a novel dataset for anomaly segmentation in driving scenarios. To the best of our knowledge, it is the first publicly available dataset focused on road anomaly segmentation with dense 3D semantic labeling, incorporating both LiDAR and camera data, as well as sequential information to enable anomaly detection across various ranges. This capability is critical for the safe navigation of autonomous vehicles. We adapted and evaluated several baseline models for 3D segmentation, highlighting the challenges of 3D anomaly detection in driving environments. Our dataset and evaluation code will be openly available, facilitating the testing and performance comparison of different approaches.

Title: Sparfels: Fast Reconstruction from Sparse Unposed Imagery

Authors: Shubhendu Jena, Amine Ouasfi, Mae Younes, Adnane Boukhayma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02178
Pdf URL: https://arxiv.org/pdf/2505.02178
Copy Paste: [[2505.02178]] Sparfels: Fast Reconstruction from Sparse Unposed Imagery(https://arxiv.org/abs/2505.02178)
Keywords: foundation model
Abstract: We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.

Title: ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications

Authors: Tao Zhu, Qi Yu, Xinru Dong, Shiyu Li, Yue Liu, Jinlong Jiang, Lei Shu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02179
Pdf URL: https://arxiv.org/pdf/2505.02179
Copy Paste: [[2505.02179]] ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications(https://arxiv.org/abs/2505.02179)
Keywords: anomaly
Abstract: Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at this https URL.

Title: Robust AI-Generated Face Detection with Imbalanced Data

Authors: Yamini Sri Krubha, Aryana Hou, Braden Vester, Web Walker, Xin Wang, Li Lin, Shu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02182
Pdf URL: https://arxiv.org/pdf/2505.02182
Copy Paste: [[2505.02182]] Robust AI-Generated Face Detection with Imbalanced Data(https://arxiv.org/abs/2505.02182)
Keywords: generative
Abstract: Deepfakes, created using advanced AI techniques such as Variational Autoencoder and Generative Adversarial Networks, have evolved from research and entertainment applications into tools for malicious activities, posing significant threats to digital trust. Current deepfake detection techniques have evolved from CNN-based methods focused on local artifacts to more advanced approaches using vision transformers and multimodal models like CLIP, which capture global anomalies and improve cross-domain generalization. Despite recent progress, state-of-the-art deepfake detectors still face major challenges in handling distribution shifts from emerging generative models and addressing severe class imbalance between authentic and fake samples in deepfake datasets, which limits their robustness and detection accuracy. To address these challenges, we propose a framework that combines dynamic loss reweighting and ranking-based optimization, which achieves superior generalization and performance under imbalanced dataset conditions. The code is available at this https URL.

Title: DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

Authors: Wenchuan Wang, Mengqi Huang, Yijing Tu, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02192
Pdf URL: https://arxiv.org/pdf/2505.02192
Copy Paste: [[2505.02192]] DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization(https://arxiv.org/abs/2505.02192)
Keywords: diffusion
Abstract: Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention through focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrades. To address this, we introduce DualReal, a novel framework that, employs adaptive joint training to collaboratively construct interdependencies between dimensions. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically selects a training phase (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion quality metrics.

Title: Improving Physical Object State Representation in Text-to-Image Generative Systems

Authors: Tianle Chen, Chaitanya Chakka, Deepti Ghadiyaram
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02236
Pdf URL: https://arxiv.org/pdf/2505.02236
Copy Paste: [[2505.02236]] Improving Physical Object State Representation in Text-to-Image Generative Systems(https://arxiv.org/abs/2505.02236)
Keywords: generative
Abstract: Current text-to-image generative models struggle to accurately represent object states (e.g., "a table without a bottle," "an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.

Title: Quantizing Diffusion Models from a Sampling-Aware Perspective

Authors: Qian Zeng, Jie Song, Yuanyu Wan, Huiqiong Wang, Mingli Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02242
Pdf URL: https://arxiv.org/pdf/2505.02242
Copy Paste: [[2505.02242]] Quantizing Diffusion Models from a Sampling-Aware Perspective(https://arxiv.org/abs/2505.02242)
Keywords: diffusion
Abstract: Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.

Title: Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset

Authors: Jakub Wąsala, Bartłomiej Wrzalski, Kornelia Noculak, Yuliia Tarasenko, Oliwer Krupa, Jan Kocoń, Grzegorz Chodak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02255
Pdf URL: https://arxiv.org/pdf/2505.02255
Copy Paste: [[2505.02255]] Enhancing AI Face Realism: Cost-Efficient Quality Improvement in Distilled Diffusion Models with a Fully Synthetic Dataset(https://arxiv.org/abs/2505.02255)
Keywords: diffusion, generative
Abstract: This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.

Title: Entropy-Guided Sampling of Flat Modes in Discrete Spaces

Authors: Pinaki Mohanty, Riddhiman Bhattacharya, Ruqi Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02296
Pdf URL: https://arxiv.org/pdf/2505.02296
Copy Paste: [[2505.02296]] Entropy-Guided Sampling of Flat Modes in Discrete Spaces(https://arxiv.org/abs/2505.02296)
Keywords: generative
Abstract: Sampling from flat modes in discrete spaces is a crucial yet underexplored problem. Flat modes represent robust solutions and have broad applications in combinatorial optimization and discrete generative modeling. However, existing sampling algorithms often overlook the mode volume and struggle to capture flat modes effectively. To address this limitation, we propose \emph{Entropic Discrete Langevin Proposal} (EDLP), which incorporates local entropy into the sampling process through a continuous auxiliary variable under a joint distribution. The local entropy term guides the discrete sampler toward flat modes with a small overhead. We provide non-asymptotic convergence guarantees for EDLP in locally log-concave discrete distributions. Empirically, our method consistently outperforms traditional approaches across tasks that require sampling from flat basins, including Bernoulli distribution, restricted Boltzmann machines, combinatorial optimization, and binary neural networks.

Title: Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition

Authors: Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu, Qiguang Miao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.02304
Pdf URL: https://arxiv.org/pdf/2505.02304
Copy Paste: [[2505.02304]] Generative Sign-description Prompts with Multi-positive Contrastive Learning for Sign Language Recognition(https://arxiv.org/abs/2505.02304)
Keywords: generative
Abstract: Sign language recognition (SLR) faces fundamental challenges in creating accurate annotations due to the inherent complexity of simultaneous manual and non-manual signals. To the best of our knowledge, this is the first work to integrate generative large language models (LLMs) into SLR tasks. We propose a novel Generative Sign-description Prompts Multi-positive Contrastive learning (GSP-MC) method that leverages retrieval-augmented generation (RAG) with domain-specific LLMs, incorporating multi-step prompt engineering and expert-validated sign language corpora to produce precise multipart descriptions. The GSP-MC method also employs a dual-encoder architecture to bidirectionally align hierarchical skeleton features with multiple text descriptions (global, synonym, and part level) through probabilistic matching. Our approach combines global and part-level losses, optimizing KL divergence to ensure robust alignment across all relevant text-skeleton pairs while capturing both sign-level semantics and detailed part dynamics. Experiments demonstrate state-of-the-art performance against existing methods on the Chinese SLR500 (reaching 97.1%) and Turkish AUTSL datasets (97.07% accuracy). The method's cross-lingual effectiveness highlight its potential for developing inclusive communication technologies.

Title: VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Authors: Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2505.02331
Pdf URL: https://arxiv.org/pdf/2505.02331
Copy Paste: [[2505.02331]] VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection(https://arxiv.org/abs/2505.02331)
Keywords: self-supervised
Abstract: Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage 1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage 2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

Title: RM-R1: Reward Modeling as Reasoning

Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02387
Pdf URL: https://arxiv.org/pdf/2505.02387
Copy Paste: [[2505.02387]] RM-R1: Reward Modeling as Reasoning(https://arxiv.org/abs/2505.02387)
Keywords: generative
Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at this https URL.

Title: Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection

Authors: Sungheon Jeong, Jihong Park, Mohsen Imani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02393
Pdf URL: https://arxiv.org/pdf/2505.02393
Copy Paste: [[2505.02393]] Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection(https://arxiv.org/abs/2505.02393)
Keywords: anomaly
Abstract: Most existing video anomaly detectors rely solely on RGB frames, which lack the temporal resolution needed to capture abrupt or transient motion cues, key indicators of anomalous events. To address this limitation, we propose Image-Event Fusion for Video Anomaly Detection (IEF-VAD), a framework that synthesizes event representations directly from RGB videos and fuses them with image features through a principled, uncertainty-aware process. The system (i) models heavy-tailed sensor noise with a Student`s-t likelihood, deriving value-level inverse-variance weights via a Laplace approximation; (ii) applies Kalman-style frame-wise updates to balance modalities over time; and (iii) iteratively refines the fused latent state to erase residual cross-modal noise. Without any dedicated event sensor or frame-level labels, IEF-VAD sets a new state of the art across multiple real-world anomaly detection benchmarks. These findings highlight the utility of synthetic event representations in emphasizing motion cues that are often underrepresented in RGB frames, enabling accurate and robust video understanding across diverse applications without requiring dedicated event sensors. Code and models are available at this https URL.

Title: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models

Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.02417
Pdf URL: https://arxiv.org/pdf/2505.02417
Copy Paste: [[2505.02417]] T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models(https://arxiv.org/abs/2505.02417)
Keywords: diffusion
Abstract: Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.

Title: Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Authors: Biao Gong, Cheng Zou, Dandan Zheng, Hu Yu, Jingdong Chen, Jianxin Sun, Junbo Zhao, Jun Zhou, Kaixiang Ji, Lixiang Ru, Libin Wang, Qingpei Guo, Rui Liu, Weilong Chai, Xinyu Xiao, Ziyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02471
Pdf URL: https://arxiv.org/pdf/2505.02471
Copy Paste: [[2505.02471]] Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction(https://arxiv.org/abs/2505.02471)
Keywords: diffusion
Abstract: We introduce Ming-Lite-Uni, an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel multi-scale learnable tokens and multi-scale representation alignment strategy. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results demonstrate the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. All code and model weights are open-sourced to foster further exploration within the community. Notably, this work aligns with concurrent multimodal AI milestones - such as ChatGPT-4o with native image generation updated in March 25, 2025 - underscoring the broader significance of unified models like Ming-Lite-Uni on the path toward AGI. Ming-Lite-Uni is in alpha stage and will soon be further refined.

Title: Text to Image Generation and Editing: A Survey

Authors: Pengfei Yang, Ngai-Man Cheung, Xinda Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02527
Pdf URL: https://arxiv.org/pdf/2505.02527
Copy Paste: [[2505.02527]] Text to Image Generation and Editing: A Survey(https://arxiv.org/abs/2505.02527)
Keywords: diffusion, foundation model
Abstract: Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.

Title: Bielik v3 Small: Technical Report

Authors: Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.02550
Pdf URL: https://arxiv.org/pdf/2505.02550
Copy Paste: [[2505.02550]] Bielik v3 Small: Technical Report(https://arxiv.org/abs/2505.02550)
Keywords: generative
Abstract: We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

Title: Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Authors: Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02567
Pdf URL: https://arxiv.org/pdf/2505.02567
Copy Paste: [[2505.02567]] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities(https://arxiv.org/abs/2505.02567)
Keywords: diffusion
Abstract: Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey will be available on GitHub soon.

Title: RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet

Authors: Eliraz Orfaig, Inna Stainvas, Igal Bilik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02586
Pdf URL: https://arxiv.org/pdf/2505.02586
Copy Paste: [[2505.02586]] RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet(https://arxiv.org/abs/2505.02586)
Keywords: diffusion
Abstract: This work introduces RGBX-DiffusionDet, an object detection framework extending the DiffusionDet model to fuse the heterogeneous 2D data (X) with RGB imagery via an adaptive multimodal encoder. To enable cross-modal interaction, we design the dynamic channel reduction within a convolutional block attention module (DCR-CBAM), which facilitates cross-talk between subnetworks by dynamically highlighting salient channel features. Furthermore, the dynamic multi-level aggregation block (DMLAB) is proposed to refine spatial feature representations through adaptive multiscale fusion. Finally, novel regularization losses that enforce channel saliency and spatial selectivity are introduced, leading to compact and discriminative feature embeddings. Extensive experiments using RGB-Depth (KITTI), a novel annotated RGB-Polarimetric dataset, and RGB-Infrared (M$^3$FD) benchmark dataset were conducted. We demonstrate consistent superiority of the proposed approach over the baseline RGB-only DiffusionDet. The modular architecture maintains the original decoding complexity, ensuring efficiency. These results establish the proposed RGBX-DiffusionDet as a flexible multimodal object detection approach, providing new insights into integrating diverse 2D sensing modalities into diffusion-based detection pipelines.

Title: Mirror Mean-Field Langevin Dynamics

Authors: Anming Gu, Juno Kim
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2505.02621
Pdf URL: https://arxiv.org/pdf/2505.02621
Copy Paste: [[2505.02621]] Mirror Mean-Field Langevin Dynamics(https://arxiv.org/abs/2505.02621)
Keywords: diffusion
Abstract: The mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over $\mathbb{R}^d$, and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of $\mathbb{R}^d$ by proposing the \emph{mirror mean-field Langevin dynamics} (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

Title: Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models

Authors: Sassan Mokhtar, Arian Mousakhan, Silvio Galesso, Jawad Tayyub, Thomas Brox
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02626
Pdf URL: https://arxiv.org/pdf/2505.02626
Copy Paste: [[2505.02626]] Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models(https://arxiv.org/abs/2505.02626)
Keywords: anomaly
Abstract: Recent advances in visual industrial anomaly detection have demonstrated exceptional performance in identifying and segmenting anomalous regions while maintaining fast inference speeds. However, anomaly classification-distinguishing different types of anomalies-remains largely unexplored despite its critical importance in real-world inspection tasks. To address this gap, we propose VELM, a novel LLM-based pipeline for anomaly classification. Given the critical importance of inference speed, we first apply an unsupervised anomaly detection method as a vision expert to assess the normality of an observation. If an anomaly is detected, the LLM then classifies its type. A key challenge in developing and evaluating anomaly classification models is the lack of precise annotations of anomaly classes in existing datasets. To address this limitation, we introduce MVTec-AC and VisA-AC, refined versions of the widely used MVTec-AD and VisA datasets, which include accurate anomaly class labels for rigorous evaluation. Our approach achieves a state-of-the-art anomaly classification accuracy of 80.4% on MVTec-AD, exceeding the prior baselines by 5%, and 84% on MVTec-AC, demonstrating the effectiveness of VELM in understanding and categorizing anomalies. We hope our methodology and benchmark inspire further research in anomaly classification, helping bridge the gap between detection and comprehensive anomaly characterization.

Title: MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02648
Pdf URL: https://arxiv.org/pdf/2505.02648
Copy Paste: [[2505.02648]] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation(https://arxiv.org/abs/2505.02648)
Keywords: diffusion
Abstract: Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.

Title: Sim2Real in endoscopy segmentation with a novel structure aware image translation

Authors: Clara Tomasini, Luis Riazuelo, Ana C. Murillo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02654
Pdf URL: https://arxiv.org/pdf/2505.02654
Copy Paste: [[2505.02654]] Sim2Real in endoscopy segmentation with a novel structure aware image translation(https://arxiv.org/abs/2505.02654)
Keywords: generative
Abstract: Automatic segmentation of anatomical landmarks in endoscopic images can provide assistance to doctors and surgeons for diagnosis, treatments or medical training. However, obtaining the annotations required to train commonly used supervised learning methods is a tedious and difficult task, in particular for real images. While ground truth annotations are easier to obtain for synthetic data, models trained on such data often do not generalize well to real data. Generative approaches can add realistic texture to it, but face difficulties to maintain the structure of the original scene. The main contribution in this work is a novel image translation model that adds realistic texture to simulated endoscopic images while keeping the key scene layout information. Our approach produces realistic images in different endoscopy scenarios. We demonstrate these images can effectively be used to successfully train a model for a challenging end task without any real labeled data. In particular, we demonstrate our approach for the task of fold segmentation in colonoscopy images. Folds are key anatomical landmarks that can occlude parts of the colon mucosa and possible polyps. Our approach generates realistic images maintaining the shape and location of the original folds, after the image-style-translation, better than existing methods. We run experiments both on a novel simulated dataset for fold segmentation, and real data from the EndoMapper (EM) dataset. All our new generated data and new EM metadata is being released to facilitate further research, as no public benchmark is currently available for the task of fold segmentation.

Title: Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data

Authors: Saeed Shurrab, Aadim Nepal, Terrence J. Lee-St. John, Nicola G. Ghazi, Bartlomiej Piechowski-Jozwiak, Farah E. Shamout
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02677
Pdf URL: https://arxiv.org/pdf/2505.02677
Copy Paste: [[2505.02677]] Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data(https://arxiv.org/abs/2505.02677)
Keywords: self-supervised, foundation model
Abstract: Stroke is a major public health problem, affecting millions worldwide. Deep learning has recently demonstrated promise for enhancing the diagnosis and risk prediction of stroke. However, existing methods rely on costly medical imaging modalities, such as computed tomography. Recent studies suggest that retinal imaging could offer a cost-effective alternative for cerebrovascular health assessment due to the shared clinical pathways between the retina and the brain. Hence, this study explores the impact of leveraging retinal images and clinical data for stroke detection and risk prediction. We propose a multimodal deep neural network that processes Optical Coherence Tomography (OCT) and infrared reflectance retinal scans, combined with clinical data, such as demographics, vital signs, and diagnosis codes. We pretrained our model using a self-supervised learning framework using a real-world dataset consisting of $37$ k scans, and then fine-tuned and evaluated the model using a smaller labeled subset. Our empirical findings establish the predictive ability of the considered modalities in detecting lasting effects in the retina associated with acute stroke and forecasting future risk within a specific time horizon. The experimental results demonstrate the effectiveness of our proposed framework by achieving $5$\% AUROC improvement as compared to the unimodal image-only baseline, and $8$\% improvement compared to an existing state-of-the-art foundation model. In conclusion, our study highlights the potential of retinal imaging in identifying high-risk patients and improving long-term outcomes.

Title: fastabx: A library for efficient computation of ABX discriminability

Authors: Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.02692
Pdf URL: https://arxiv.org/pdf/2505.02692
Copy Paste: [[2505.02692]] fastabx: A library for efficient computation of ABX discriminability(https://arxiv.org/abs/2505.02692)
Keywords: self-supervised
Abstract: We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at this https URL.

Title: Using Knowledge Graphs to harvest datasets for efficient CLIP model training

Authors: Simon Ging, Sebastian Walter, Jelena Bratulić, Johannes Dienert, Hannah Bast, Thomas Brox
Subjects: cs.CV, cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.02746
Pdf URL: https://arxiv.org/pdf/2505.02746
Copy Paste: [[2505.02746]] Using Knowledge Graphs to harvest datasets for efficient CLIP model training(https://arxiv.org/abs/2505.02746)
Keywords: foundation model
Abstract: Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover well -- and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.

Title: Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

Authors: Yankai Jiang, Peng Zhang, Donglin Yang, Yuan Tian, Hai Lin, Xiaosong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02753
Pdf URL: https://arxiv.org/pdf/2505.02753
Copy Paste: [[2505.02753]] Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models(https://arxiv.org/abs/2505.02753)
Keywords: diffusion, anomaly
Abstract: We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at this https URL.

Title: Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models

Authors: Matthew Dahl
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.02763
Pdf URL: https://arxiv.org/pdf/2505.02763
Copy Paste: [[2505.02763]] Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models(https://arxiv.org/abs/2505.02763)
Keywords: in-context
Abstract: Legal practice requires careful adherence to procedural rules. In the United States, few are more complex than those found in The Bluebook: A Uniform System of Citation. Compliance with this system's 500+ pages of byzantine formatting instructions is the raison d'etre of thousands of student law review editors and the bete noire of lawyers everywhere. To evaluate whether large language models (LLMs) are able to adhere to the procedures of such a complicated system, we construct an original dataset of 866 Bluebook tasks and test flagship LLMs from OpenAI, Anthropic, Google, Meta, and DeepSeek. We show (1) that these models produce fully compliant Bluebook citations only 69%-74% of the time and (2) that in-context learning on the Bluebook's underlying system of rules raises accuracy only to 77%. These results caution against using off-the-shelf LLMs to automate aspects of the law where fidelity to procedure is paramount.

Title: Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Authors: Kuofeng Gao, Yufei Zhu, Yiming Li, Jiawang Bai, Yong Yang, Zhifeng Li, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.02824
Pdf URL: https://arxiv.org/pdf/2505.02824
Copy Paste: [[2505.02824]] Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.02824)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.

Title: No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.02831
Pdf URL: https://arxiv.org/pdf/2505.02831
Copy Paste: [[2505.02831]] No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves(https://arxiv.org/abs/2505.02831)
Keywords: diffusion, foundation model, generative
Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance generation quality of the diffusion transformers. However, existing approaches necessitate to either introduce an additional and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation A}lignment (SRA), a simple yet straightforward method that obtain representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in earlier layer with higher noise to that in later layer with lower noise to progressively enhance the overall representation learning during only generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that heavily dependent on powerful external representation priors.