2025-05-23

Title: Generative AI for Autonomous Driving: A Review

Authors: Katharina Winter, Abhishek Vivekanandan, Rupert Polley, Yinzhe Shen, Christian Schlauch, Mohamed-Khalil Bouzidi, Bojan Derajic, Natalie Grabowsky, Annajoyce Mariani, Dennis Rochau, Giovanni Lucente, Harsh Yadav, Firas Mualla, Adam Molin, Sebastian Bernhard, Christian Wirth, Ömer Şahin Taş, Nadja Klein, Fabian B. Flohr, Hanno Gottschalk
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.15863
Pdf URL: https://arxiv.org/pdf/2505.15863
Copy Paste: [[2505.15863]] Generative AI for Autonomous Driving: A Review(https://arxiv.org/abs/2505.15863)
Keywords: diffusion, generative
Abstract: Generative AI (GenAI) is rapidly advancing the field of Autonomous Driving (AD), extending beyond traditional applications in text, image, and video generation. We explore how generative models can enhance automotive tasks, such as static map creation, dynamic scenario generation, trajectory forecasting, and vehicle motion planning. By examining multiple generative approaches ranging from Variational Autoencoder (VAEs) over Generative Adversarial Networks (GANs) and Invertible Neural Networks (INNs) to Generative Transformers (GTs) and Diffusion Models (DMs), we highlight and compare their capabilities and limitations for AD-specific applications. Additionally, we discuss hybrid methods integrating conventional techniques with generative approaches, and emphasize their improved adaptability and robustness. We also identify relevant datasets and outline open research questions to guide future developments in GenAI. Finally, we discuss three core challenges: safety, interpretability, and realtime capabilities, and present recommendations for image generation, dynamic scenario generation, and planning.

Title: Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities

Authors: Can Rong, Xin Zhang, Yanxin Xi, Hongjie Sui, Jingtao Ding, Yong Li
Subjects: cs.CV, cs.CY, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15870
Pdf URL: https://arxiv.org/pdf/2505.15870
Copy Paste: [[2505.15870]] Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities(https://arxiv.org/abs/2505.15870)
Keywords: diffusion, foundation model
Abstract: Commuting Origin-destination~(OD) flows, capturing daily population mobility of citizens, are vital for sustainable development across cities around the world. However, it is challenging to obtain the data due to the high cost of travel surveys and privacy concerns. Surprisingly, we find that satellite imagery, publicly available across the globe, contains rich urban semantic signals to support high-quality OD flow generation, with over 98\% expressiveness of traditional multisource hard-to-collect urban sociodemographic, economics, land use, and point of interest data. This inspires us to design a novel data generator, GlODGen, which can generate OD flow data for any cities of interest around the world. Specifically, GlODGen first leverages Vision-Language Geo-Foundation Models to extract urban semantic signals related to human mobility from satellite imagery. These features are then combined with population data to form region-level representations, which are used to generate OD flows via graph diffusion models. Extensive experiments on 4 continents and 6 representative cities show that GlODGen has great generalizability across diverse urban environments on different continents and can generate OD flow data for global cities highly consistent with real-world mobility data. We implement GlODGen as an automated tool, seamlessly integrating data acquisition and curation, urban semantic feature extraction, and OD flow generation together. It has been released at this https URL.

Title: Challenger: Affordable Adversarial Driving Video Generation

Authors: Zhiyuan Xu, Bohan Li, Huan-ang Gao, Mingju Gao, Yong Chen, Ming Liu, Chenxu Yan, Hang Zhao, Shuo Feng, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15880
Pdf URL: https://arxiv.org/pdf/2505.15880
Copy Paste: [[2505.15880]] Challenger: Affordable Adversarial Driving Video Generation(https://arxiv.org/abs/2505.15880)
Keywords: diffusion
Abstract: Generating photorealistic driving videos has seen significant progress recently, but current methods largely focus on ordinary, non-adversarial scenarios. Meanwhile, efforts to generate adversarial driving scenarios often operate on abstract trajectory or BEV representations, falling short of delivering realistic sensor data that can truly stress-test autonomous driving (AD) systems. In this work, we introduce Challenger, a framework that produces physically plausible yet photorealistic adversarial driving videos. Generating such videos poses a fundamental challenge: it requires jointly optimizing over the space of traffic interactions and high-fidelity sensor observations. Challenger makes this affordable through two techniques: (1) a physics-aware multi-round trajectory refinement process that narrows down candidate adversarial maneuvers, and (2) a tailored trajectory scoring function that encourages realistic yet adversarial behavior while maintaining compatibility with downstream video synthesis. As tested on the nuScenes dataset, Challenger generates a diverse range of aggressive driving scenarios-including cut-ins, sudden lane changes, tailgating, and blind spot intrusions-and renders them into multiview photorealistic videos. Extensive evaluations show that these scenarios significantly increase the collision rate of state-of-the-art end-to-end AD models (UniAD, VAD, SparseDrive, and DiffusionDrive), and importantly, adversarial behaviors discovered for one model often transfer to others.

Title: MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Subjects: cs.LG, cs.AI, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.15946
Pdf URL: https://arxiv.org/pdf/2505.15946
Copy Paste: [[2505.15946]] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding(https://arxiv.org/abs/2505.15946)
Keywords: diffusion, generative
Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: this https URL.

Title: OViP: Online Vision-Language Preference Learning

Authors: Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15963
Pdf URL: https://arxiv.org/pdf/2505.15963
Copy Paste: [[2505.15963]] OViP: Online Vision-Language Preference Learning(https://arxiv.org/abs/2505.15963)
Keywords: diffusion
Abstract: Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.

Title: Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders

Authors: Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15970
Pdf URL: https://arxiv.org/pdf/2505.15970
Copy Paste: [[2505.15970]] Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders(https://arxiv.org/abs/2505.15970)
Keywords: foundation model
Abstract: The ImageNet hierarchy provides a structured taxonomy of object categories, offering a valuable lens through which to analyze the representations learned by deep vision models. In this work, we conduct a comprehensive analysis of how vision models encode the ImageNet hierarchy, leveraging Sparse Autoencoders (SAEs) to probe their internal representations. SAEs have been widely used as an explanation tool for large language models (LLMs), where they enable the discovery of semantically meaningful features. Here, we extend their use to vision models to investigate whether learned representations align with the ontological structure defined by the ImageNet taxonomy. Our results show that SAEs uncover hierarchical relationships in model activations, revealing an implicit encoding of taxonomic structure. We analyze the consistency of these representations across different layers of the popular vision foundation model DINOv2 and provide insights into how deep vision models internalize hierarchical category information by increasing information in the class token through each layer. Our study establishes a framework for systematic hierarchical analysis of vision model representations and highlights the potential of SAEs as a tool for probing semantic structure in deep networks.

Title: Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

Authors: Qiang Zhu, Kuan Lu, Menghao Huo, Yuxiao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16001
Pdf URL: https://arxiv.org/pdf/2505.16001
Copy Paste: [[2505.16001]] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning(https://arxiv.org/abs/2505.16001)
Keywords: diffusion
Abstract: Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.

Title: Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16024
Pdf URL: https://arxiv.org/pdf/2505.16024
Copy Paste: [[2505.16024]] Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging(https://arxiv.org/abs/2505.16024)
Keywords: diffusion, generative
Abstract: Diffusion trajectory distillation methods aim to accelerate sampling in diffusion models, which produce high-quality outputs but suffer from slow sampling speeds. These methods train a student model to approximate the multi-step denoising process of a pretrained teacher model in a single step, enabling one-shot generation. However, theoretical insights into the trade-off between different distillation strategies and generative quality remain limited, complicating their optimization and selection. In this work, we take a first step toward addressing this gap. Specifically, we reinterpret trajectory distillation as an operator merging problem in the linear regime, where each step of the teacher model is represented as a linear operator acting on noisy data. These operators admit a clear geometric interpretation as projections and rescalings corresponding to the noise schedule. During merging, signal shrinkage occurs as a convex combination of operators, arising from both discretization and limited optimization time of the student model. We propose a dynamic programming algorithm to compute the optimal merging strategy that maximally preserves signal fidelity. Additionally, we demonstrate the existence of a sharp phase transition in the optimal strategy, governed by data covariance structures. Our findings enhance the theoretical understanding of diffusion trajectory distillation and offer practical insights for improving distillation strategies.

Title: OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

Authors: Burak Erinç Çetin, Yıldırım Özen, Elif Naz Demiryılmaz, Kaan Engür, Cagri Toraman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16036
Pdf URL: https://arxiv.org/pdf/2505.16036
Copy Paste: [[2505.16036]] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models(https://arxiv.org/abs/2505.16036)
Keywords: generative
Abstract: Generative large language models present significant potential but also raise critical ethical concerns. Most studies focus on narrow ethical dimensions, and also limited diversity of languages and models. To address these gaps, we conduct a broad ethical evaluation of 29 recent open-source large language models using a novel data collection including four ethical aspects: Robustness, reliability, safety, and fairness. We analyze model behavior in both a commonly used language, English, and a low-resource language, Turkish. Our aim is to provide a comprehensive ethical assessment and guide safer model development by filling existing gaps in evaluation breadth, language coverage, and model diversity. Our experimental results, based on LLM-as-a-Judge, reveal that optimization efforts for many open-source models appear to have prioritized safety and fairness, and demonstrated good robustness while reliability remains a concern. We demonstrate that ethical evaluation can be effectively conducted independently of the language used. In addition, models with larger parameter counts tend to exhibit better ethical performance, with Gemma and Qwen models demonstrating the most ethical behavior among those evaluated.

Title: An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection

Authors: Shuvashis Sarker, Shamim Rahim Refat, Faika Fairuj Preotee, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16039
Pdf URL: https://arxiv.org/pdf/2505.16039
Copy Paste: [[2505.16039]] An Exploratory Approach Towards Investigating and Explaining Vision Transformer and Transfer Learning for Brain Disease Detection(https://arxiv.org/abs/2505.16039)
Keywords: generative
Abstract: The brain is a highly complex organ that manages many important tasks, including movement, memory and thinking. Brain-related conditions, like tumors and degenerative disorders, can be hard to diagnose and treat. Magnetic Resonance Imaging (MRI) serves as a key tool for identifying these conditions, offering high-resolution images of brain structures. Despite this, interpreting MRI scans can be complicated. This study tackles this challenge by conducting a comparative analysis of Vision Transformer (ViT) and Transfer Learning (TL) models such as VGG16, VGG19, Resnet50V2, MobilenetV2 for classifying brain diseases using MRI data from Bangladesh based dataset. ViT, known for their ability to capture global relationships in images, are particularly effective for medical imaging tasks. Transfer learning helps to mitigate data constraints by fine-tuning pre-trained models. Furthermore, Explainable AI (XAI) methods such as GradCAM, GradCAM++, LayerCAM, ScoreCAM, and Faster-ScoreCAM are employed to interpret model predictions. The results demonstrate that ViT surpasses transfer learning models, achieving a classification accuracy of 94.39%. The integration of XAI methods enhances model transparency, offering crucial insights to aid medical professionals in diagnosing brain diseases with greater precision.

Title: Mesh-free sparse identification of nonlinear dynamics

Authors: Mars Liyao Gao, J. Nathan Kutz, Bernat Font
Subjects: cs.LG, cs.AI, physics.data-an
Abstract URL: https://arxiv.org/abs/2505.16058
Pdf URL: https://arxiv.org/pdf/2505.16058
Copy Paste: [[2505.16058]] Mesh-free sparse identification of nonlinear dynamics(https://arxiv.org/abs/2505.16058)
Keywords: diffusion
Abstract: Identifying the governing equations of a dynamical system is one of the most important tasks for scientific modeling. However, this procedure often requires high-quality spatio-temporal data uniformly sampled on structured grids. In this paper, we propose mesh-free SINDy, a novel algorithm which leverages the power of neural network approximation as well as auto-differentiation to identify governing equations from arbitrary sensor placements and non-uniform temporal data sampling. We show that mesh-free SINDy is robust to high noise levels and limited data while remaining computationally efficient. In our implementation, the training procedure is straight-forward and nearly free of hyperparameter tuning, making mesh-free SINDy widely applicable to many scientific and engineering problems. In the experiments, we demonstrate its effectiveness on a series of PDEs including the Burgers' equation, the heat equation, the Korteweg-De Vries equation and the 2D advection-diffusion equation. We conduct detailed numerical experiments on all datasets, varying the noise levels and number of samples, and we also compare our approach to previous state-of-the-art methods. It is noteworthy that, even in high-noise and low-data scenarios, mesh-free SINDy demonstrates robust PDE discovery, achieving successful identification with up to 75% noise for the Burgers' equation using 5,000 samples and with as few as 100 samples and 1% noise. All of this is achieved within a training time of under one minute.

Title: Scalable Graph Generative Modeling via Substructure Sequences

Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.16130
Pdf URL: https://arxiv.org/pdf/2505.16130
Copy Paste: [[2505.16130]] Scalable Graph Generative Modeling via Substructure Sequences(https://arxiv.org/abs/2505.16130)
Keywords: foundation model, generative
Abstract: Graph neural networks (GNNs) has been predominantly driven by message-passing, where node representations are iteratively updated via local neighborhood aggregation. Despite their success, message-passing suffers from fundamental limitations -- including constrained expressiveness, over-smoothing, over-squashing, and limited capacity to model long-range dependencies. These issues hinder scalability: increasing data size or model size often fails to yield improved performance, limiting the viability of GNNs as backbones for graph foundation models. In this work, we explore pathways beyond message-passing and introduce Generative Graph Pattern Machine (G$^2$PM), a generative Transformer pre-training framework for graphs. G$^2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures, and employs generative pre-training over the sequences to learn generalizable, transferable representations. Empirically, G$^2$PM demonstrates strong scalability: on the ogbn-arxiv benchmark, it continues to improve with model sizes up to 60M parameters, outperforming prior generative approaches that plateau at significantly smaller scales (e.g., 3M). In addition, we systematically analyze the model design space, highlighting key architectural choices that contribute to its scalability and generalization. Across diverse tasks -- including node classification, graph classification, and transfer learning -- G$^2$PM consistently outperforms strong baselines, establishing a compelling foundation for scalable graph learning. The code and dataset are available at this https URL.

Title: Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Authors: Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16142
Pdf URL: https://arxiv.org/pdf/2505.16142
Copy Paste: [[2505.16142]] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.16142)
Keywords: generative
Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

Title: TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Authors: Yuhao Xue, Zhifei Zhang, Xinyang Jiang, Yifei Shen, Junyao Gao, Wentao Gu, Jiale Zhao, Miaojing Shi, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16166
Pdf URL: https://arxiv.org/pdf/2505.16166
Copy Paste: [[2505.16166]] TRAIL: Transferable Robust Adversarial Images via Latent diffusion(https://arxiv.org/abs/2505.16166)
Keywords: diffusion
Abstract: Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

Title: Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss

Authors: Abhay Kumara Sri Krishna Nandiraju, Gondy Leroy, David Kauchak, Arif Ahmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16172
Pdf URL: https://arxiv.org/pdf/2505.16172
Copy Paste: [[2505.16172]] Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss(https://arxiv.org/abs/2505.16172)
Keywords: generative
Abstract: Understanding health information is essential in achieving and maintaining a healthy life. We focus on simplifying health information for better understanding. With the availability of generative AI, the simplification process has become efficient and of reasonable quality, however, the algorithms remove information that may be crucial for comprehension. In this study, we compare generative AI to detect missing information in simplified text, evaluate its importance, and fix the text with the missing information. We collected 50 health information texts and simplified them using gpt-4-0613. We compare five approaches to identify missing elements and regenerate the text by inserting the missing elements. These five approaches involve adding missing entities and missing words in various ways: 1) adding all the missing entities, 2) adding all missing words, 3) adding the top-3 entities ranked by gpt-4-0613, and 4, 5) serving as controls for comparison, adding randomly chosen entities. We use cosine similarity and ROUGE scores to evaluate the semantic similarity and content overlap between the original, simplified, and reconstructed simplified text. We do this for both summaries and full text. Overall, we find that adding missing entities improves the text. Adding all the missing entities resulted in better text regeneration, which was better than adding the top-ranked entities or words, or random words. Current tools can identify these entities, but are not valuable in ranking them.

Title: Erased or Dormant? Rethinking Concept Erasure Through Reversibility

Authors: Ping Liu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16174
Pdf URL: https://arxiv.org/pdf/2505.16174
Copy Paste: [[2505.16174]] Erased or Dormant? Rethinking Concept Erasure Through Reversibility(https://arxiv.org/abs/2505.16174)
Keywords: diffusion, generative
Abstract: To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.

Title: Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Authors: Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, Anh Totti Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16181
Pdf URL: https://arxiv.org/pdf/2505.16181
Copy Paste: [[2505.16181]] Understanding Generative AI Capabilities in Everyday Image Editing Tasks(https://arxiv.org/abs/2505.16181)
Keywords: generative
Abstract: Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: this https URL

Title: An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16193
Pdf URL: https://arxiv.org/pdf/2505.16193
Copy Paste: [[2505.16193]] An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability(https://arxiv.org/abs/2505.16193)
Keywords: in-context
Abstract: The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

Title: DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution

Authors: Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16239
Pdf URL: https://arxiv.org/pdf/2505.16239
Copy Paste: [[2505.16239]] DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution(https://arxiv.org/abs/2505.16239)
Keywords: diffusion
Abstract: Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.

Title: IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection

Authors: Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16258
Pdf URL: https://arxiv.org/pdf/2505.16258
Copy Paste: [[2505.16258]] IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection(https://arxiv.org/abs/2505.16258)
Keywords: in-context
Abstract: Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: this https URL

Title: Interpretable Anomaly Detection in Encrypted Traffic Using SHAP with Machine Learning Models

Authors: Kalindi Singh, Aayush Kashyap, Aswani Kumar Cherukuri
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16261
Pdf URL: https://arxiv.org/pdf/2505.16261
Copy Paste: [[2505.16261]] Interpretable Anomaly Detection in Encrypted Traffic Using SHAP with Machine Learning Models(https://arxiv.org/abs/2505.16261)
Keywords: anomaly
Abstract: The widespread adoption of encrypted communication protocols such as HTTPS and TLS has enhanced data privacy but also rendered traditional anomaly detection techniques less effective, as they often rely on inspecting unencrypted payloads. This study aims to develop an interpretable machine learning-based framework for anomaly detection in encrypted network traffic. This study proposes a model-agnostic framework that integrates multiple machine learning classifiers, with SHapley Additive exPlanations SHAP to ensure post-hoc model interpretability. The models are trained and evaluated on three benchmark encrypted traffic datasets. Performance is assessed using standard classification metrics, and SHAP is used to explain model predictions by attributing importance to individual input features. SHAP visualizations successfully revealed the most influential traffic features contributing to anomaly predictions, enhancing the transparency and trustworthiness of the models. Unlike conventional approaches that treat machine learning as a black box, this work combines robust classification techniques with explainability through SHAP, offering a novel interpretable anomaly detection system tailored for encrypted traffic environments. While the framework is generalizable, real-time deployment and performance under adversarial conditions require further investigation. Future work may explore adaptive models and real-time interpretability in operational network environments. This interpretable anomaly detection framework can be integrated into modern security operations for encrypted environments, allowing analysts not only to detect anomalies with high precision but also to understand why a model made a particular decision a crucial capability in compliance-driven and mission-critical settings.

Title: Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Authors: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16265
Pdf URL: https://arxiv.org/pdf/2505.16265
Copy Paste: [[2505.16265]] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models(https://arxiv.org/abs/2505.16265)
Keywords: generative
Abstract: Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model's long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.

Title: SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

Authors: Guohao Huo, Ruiting Dai, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16304
Pdf URL: https://arxiv.org/pdf/2505.16304
Copy Paste: [[2505.16304]] SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation(https://arxiv.org/abs/2505.16304)
Keywords: foundation model
Abstract: To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet. The framework achieves cross-modal feature collaborative learning by integrating the vision foundation model SAM2, the state-space model Mamba, and the classical UNet. To mitigate domain discrepancies between medical and natural images, a Dynamic Feature Fusion Refiner is designed, which enhances small lesion feature extraction through multi-scale pooling and a dual-path calibration mechanism across channel and spatial dimensions. Furthermore, a Heterogeneous Omni-Attention Convergence Module (HOACM) is introduced, combining global contextual attention with branch-selective emphasis mechanisms to effectively fuse SAM2's local positional semantics and Mamba's long-range dependency modeling capabilities. Experiments on the ACDC cardiac MRI dataset demonstrate that the proposed model achieves a Dice coefficient of 0.9103 and an HD95 boundary error of 1.0859 mm, significantly outperforming existing methods, particularly in boundary localization for complex pathological structures such as right ventricular anomalies. This work provides an efficient and reliable solution for automated cardiac disease diagnosis, and the code will be open-sourced.

Title: Paired and Unpaired Image to Image Translation using Generative Adversarial Networks

Authors: Gaurav Kumar, Soham Satyadharma, Harpreet Singh
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16310
Pdf URL: https://arxiv.org/pdf/2505.16310
Copy Paste: [[2505.16310]] Paired and Unpaired Image to Image Translation using Generative Adversarial Networks(https://arxiv.org/abs/2505.16310)
Keywords: generative
Abstract: Image to image translation is an active area of research in the field of computer vision, enabling the generation of new images with different styles, textures, or resolutions while preserving their characteristic properties. Recent architectures leverage Generative Adversarial Networks (GANs) to transform input images from one domain to another. In this work, we focus on the study of both paired and unpaired image translation across multiple image domains. For the paired task, we used a conditional GAN model, and for the unpaired task, we trained it using cycle consistency loss. We experimented with different types of loss functions, multiple Patch-GAN sizes, and model architectures. New quantitative metrics - precision, recall, and FID score - were used for analysis. In addition, a qualitative study of the results of different experiments was conducted.

Title: NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Authors: Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Chunle Guo, Chongyi Li, Radu Timofte, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Jianhui Sun, Xinli Yue, Tianyi Wang, Huan Hou, Junda Lu, Xinyang Huang, Zitang Zhou, Zijian Zhang, Xuhui Zheng, Xuecheng Wu, Chong Peng, Xuezhi Cao, Trong-Hieu Nguyen-Mau, Minh-Hoang Le, Minh-Khoa Le-Phan, Duy-Nam Ly, Hai-Dang Nguyen, Minh-Triet Tran, Yukang Lin, Yan Hong, Chuanbiao Song, Siyuan Li, Jun Lan, Zhichao Zhang, Xinyue Li, Wei Sun, Zicheng Zhang, Yunhao Li, Xiaohong Liu, Guangtao Zhai, Zitong Xu, Huiyu Duan, Jiarui Wang, Guangji Ma, Liu Yang, Lu Liu, Qiang Hu, Xiongkuo Min, Zichuan Wang, Zhenchen Tang, Bo Peng, Jing Dong, Fengbin Guan, Zihao Yu, Yiting Lu, Wei Luo, Xin Li, Minhao Lin, Haofeng Chen, Xuanxuan He, Kele Xu, Qisheng Xu, Zijian Gao, Tianjiao Wan, Bo-Cheng Qiu, Chih-Chung Hsu, Chia-ming Lee, Yu-Fan Lin, Bo Yu, Zehao Wang, Da Mu, Mingxiu Chen, Junkang Fang, Huamei Sun, Wending Zhao, Zhiyu Wang, Wang Liu, Weikang Yu, Puhong Duan, Bin Sun, Xudong Kang, Shutao Li, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Jiarong He, Zhishan Qiao, Yongqing Huang, Zewen Chen, Zhe Pang, Juan Wang, Jian Guo, Zhizhuo Shao, Ziyu Feng, Bing Li, Weiming Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16314
Pdf URL: https://arxiv.org/pdf/2505.16314
Copy Paste: [[2505.16314]] NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment(https://arxiv.org/abs/2505.16314)
Keywords: generative
Abstract: This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

Title: TensorAR: Refinement is All You Need in Autoregressive Image Generation

Authors: Cheng Cheng, Lin Song, Yicheng Xiao, Yuxin Chen, Xuchong Zhang, Hongbin Sun, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16324
Pdf URL: https://arxiv.org/pdf/2505.16324
Copy Paste: [[2505.16324]] TensorAR: Refinement is All You Need in Autoregressive Image Generation(https://arxiv.org/abs/2505.16324)
Keywords: diffusion
Abstract: Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.

Title: FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design

Authors: Renjie Wei, Songqiang Xu, Qingyu Guo, Meng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16335
Pdf URL: https://arxiv.org/pdf/2505.16335
Copy Paste: [[2505.16335]] FPQVAR: Floating Point Quantization for Visual Autoregressive Model with FPGA Hardware Co-design(https://arxiv.org/abs/2505.16335)
Keywords: diffusion
Abstract: Visual autoregressive (VAR) modeling has marked a paradigm shift in image generation from next-token prediction to next-scale prediction. VAR predicts a set of tokens at each step from coarse to fine scale, leading to better image quality and faster inference speed compared to existing diffusion models. However, the large parameter size and computation cost hinder its deployment on edge devices. To reduce the memory and computation cost, we propose FPQVAR, an efficient post-training floating-point (FP) quantization framework for VAR featuring algorithm and hardware co-design. At the algorithm level, we first identify the challenges of quantizing VAR. To address them, we propose Dual Format Quantization for the highly imbalanced input activation. We further propose Group-wise Hadamard Transformation and GHT-Aware Learnable Transformation to address the time-varying outlier channels. At the hardware level, we design the first low-bit FP quantizer and multiplier with lookup tables on FPGA and propose the first FPGA-based VAR accelerator featuring low-bit FP computation and an elaborate two-level pipeline. Extensive experiments show that compared to the state-of-the-art quantization method, our proposed FPQVAR significantly improves Fréchet Inception Distance (FID) from 10.83 to 3.58, Inception Score (IS) from 175.9 to 241.5 under 4-bit quantization. FPQVAR also significantly improves the performance of 6-bit quantized VAR, bringing it on par with the FP16 model. Our accelerator on AMD-Xilinx VCK190 FPGA achieves a throughput of 1.1 image/s, which is 3.1x higher than the integer-based accelerator. It also demonstrates 3.6x and 2.8x higher energy efficiency compared to the integer-based accelerator and GPU baseline, respectively.

Title: Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

Authors: Amirreza Mahbod, Rupert Ecker, Ramona Woitek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16338
Pdf URL: https://arxiv.org/pdf/2505.16338
Copy Paste: [[2505.16338]] Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification(https://arxiv.org/abs/2505.16338)
Keywords: foundation model
Abstract: Accurate classification of skin lesions from dermatoscopic images is essential for diagnosis and treatment of skin cancer. In this study, we investigate the utility of a dermatology-specific foundation model, PanDerm, in comparison with two Vision Transformer (ViT) architectures (ViT base and Swin Transformer V2 base) for the task of skin lesion classification. Using frozen features extracted from PanDerm, we apply non-linear probing with three different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and TabNet. For the ViT-based models, we perform full fine-tuning to optimize classification performance. Our experiments on the HAM10000 and MSKCC datasets demonstrate that the PanDerm-based MLP model performs comparably to the fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer predictions leads to further performance improvements. Future work will explore additional foundation models, fine-tuning strategies, and advanced fusion techniques.

Title: Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Authors: Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16360
Pdf URL: https://arxiv.org/pdf/2505.16360
Copy Paste: [[2505.16360]] Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation(https://arxiv.org/abs/2505.16360)
Keywords: diffusion, foundation model
Abstract: Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.

Title: A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

Authors: Manuel Ruiz-Botella, Marta Sales-Pardo, Roger Guimerà
Subjects: cs.LG, cs.AI, physics.comp-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.16365
Pdf URL: https://arxiv.org/pdf/2505.16365
Copy Paste: [[2505.16365]] A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules(https://arxiv.org/abs/2505.16365)
Keywords: diffusion
Abstract: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.

Title: Privacy-Aware Cyberterrorism Network Analysis using Graph Neural Networks and Federated Learning

Authors: Anas Ali, Mubashar Husain, Peter Hans
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.16371
Pdf URL: https://arxiv.org/pdf/2505.16371
Copy Paste: [[2505.16371]] Privacy-Aware Cyberterrorism Network Analysis using Graph Neural Networks and Federated Learning(https://arxiv.org/abs/2505.16371)
Keywords: anomaly
Abstract: Cyberterrorism poses a formidable threat to digital infrastructures, with increasing reliance on encrypted, decentralized platforms that obscure threat actor activity. To address the challenge of analyzing such adversarial networks while preserving the privacy of distributed intelligence data, we propose a Privacy-Aware Federated Graph Neural Network (PA-FGNN) framework. PA-FGNN integrates graph attention networks, differential privacy, and homomorphic encryption into a robust federated learning pipeline tailored for cyberterrorism network analysis. Each client trains locally on sensitive graph data and exchanges encrypted, noise-perturbed model updates with a central aggregator, which performs secure aggregation and broadcasts global updates. We implement anomaly detection for flagging high-risk nodes and incorporate defenses against gradient poisoning. Experimental evaluations on simulated dark web and cyber-intelligence graphs demonstrate that PA-FGNN achieves over 91\% classification accuracy, maintains resilience under 20\% adversarial client behavior, and incurs less than 18\% communication overhead. Our results highlight that privacy-preserving GNNs can support large-scale cyber threat detection without compromising on utility, privacy, or robustness.

Title: Pose-invariant face recognition via feature-space pose frontalization

Authors: Nikolay Stanishev, Yuhang Lu, Touradj Ebrahimi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16412
Pdf URL: https://arxiv.org/pdf/2505.16412
Copy Paste: [[2505.16412]] Pose-invariant face recognition via feature-space pose frontalization(https://arxiv.org/abs/2505.16412)
Keywords: generative
Abstract: Pose-invariant face recognition has become a challenging problem for modern AI-based face recognition systems. It aims at matching a profile face captured in the wild with a frontal face registered in a database. Existing methods perform face frontalization via either generative models or learning a pose robust feature representation. In this paper, a new method is presented to perform face frontalization and recognition within the feature space. First, a novel feature space pose frontalization module (FSPFM) is proposed to transform profile images with arbitrary angles into frontal counterparts. Second, a new training paradigm is proposed to maximize the potential of FSPFM and boost its performance. The latter consists of a pre-training and an attention-guided fine-tuning stage. Moreover, extensive experiments have been conducted on five popular face recognition benchmarks. Results show that not only our method outperforms the state-of-the-art in the pose-invariant face recognition task but also maintains superior performance in other standard scenarios.

Title: Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

Authors: Soh Takahashi, Masaru Sasaki, Ken Takeda, Masafumi Oizumi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16419
Pdf URL: https://arxiv.org/pdf/2505.16419
Copy Paste: [[2505.16419]] Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment(https://arxiv.org/abs/2505.16419)
Keywords: self-supervised
Abstract: The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.

Title: $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion

Authors: Jing Bi, Pinxin Liu, Ali Vosoughi, Jiarui Wu, Jinxi He, Chenliang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16425
Pdf URL: https://arxiv.org/pdf/2505.16425
Copy Paste: [[2505.16425]] $I^2G$: Generating Instructional Illustrations via Text-Conditioned Diffusion(https://arxiv.org/abs/2505.16425)
Keywords: diffusion
Abstract: The effective communication of procedural knowledge remains a significant challenge in natural language processing (NLP), as purely textual instructions often fail to convey complex physical actions and spatial relationships. We address this limitation by proposing a language-driven framework that translates procedural text into coherent visual instructions. Our approach models the linguistic structure of instructional content by decomposing it into goal statements and sequential steps, then conditioning visual generation on these linguistic elements. We introduce three key innovations: (1) a constituency parser-based text encoding mechanism that preserves semantic completeness even with lengthy instructions, (2) a pairwise discourse coherence model that maintains consistency across instruction sequences, and (3) a novel evaluation protocol specifically designed for procedural language-to-image alignment. Our experiments across three instructional datasets (HTStep, CaptainCook4D, and WikiAll) demonstrate that our method significantly outperforms existing baselines in generating visuals that accurately reflect the linguistic content and sequential nature of instructions. This work contributes to the growing body of research on grounding procedural language in visual content, with applications spanning education, task guidance, and multimodal language understanding.

Title: MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM

Authors: Siwei Meng, Yawei Luo, Ping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16456
Pdf URL: https://arxiv.org/pdf/2505.16456
Copy Paste: [[2505.16456]] MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM(https://arxiv.org/abs/2505.16456)
Keywords: diffusion, generative
Abstract: Recent advances in static 3D generation have intensified the demand for physically consistent dynamic 3D content. However, existing video generation models, including diffusion-based methods, often prioritize visual realism while neglecting physical plausibility, resulting in implausible object dynamics. Prior approaches for physics-aware dynamic generation typically rely on large-scale annotated datasets or extensive model fine-tuning, which imposes significant computational and data collection burdens and limits scalability across scenarios. To address these challenges, we present MAGIC, a training-free framework for single-image physical property inference and dynamic generation, integrating pretrained image-to-video diffusion models with iterative LLM-based reasoning. Our framework generates motion-rich videos from a static image and closes the visual-to-physical gap through a confidence-driven LLM feedback loop that adaptively steers the diffusion model toward physics-relevant motion. To translate visual dynamics into controllable physical behavior, we further introduce a differentiable MPM simulator operating directly on 3D Gaussians reconstructed from the single image, enabling physically grounded, simulation-ready outputs without any supervision or model tuning. Experiments show that MAGIC outperforms existing physics-aware generative methods in inference accuracy and achieves greater temporal coherence than state-of-the-art video diffusion models.

Title: Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization

Authors: Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16467
Pdf URL: https://arxiv.org/pdf/2505.16467
Copy Paste: [[2505.16467]] Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization(https://arxiv.org/abs/2505.16467)
Keywords: generative
Abstract: Generative Large Language Models (LLMs) infer user's demographic information from subtle cues in the conversation -- a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models' latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model's internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.

Title: Consistent World Models via Foresight Diffusion

Authors: Yu Zhang, Xingzhuo Guo, Haoran Xu, Mingsheng Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16474
Pdf URL: https://arxiv.org/pdf/2505.16474
Copy Paste: [[2505.16474]] Consistent World Models via Foresight Diffusion(https://arxiv.org/abs/2505.16474)
Keywords: diffusion
Abstract: Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

Title: Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Authors: Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Zhaofeng He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16512
Pdf URL: https://arxiv.org/pdf/2505.16512
Copy Paste: [[2505.16512]] Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection(https://arxiv.org/abs/2505.16512)
Keywords: diffusion
Abstract: In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion model-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the first large-scale multimodal digital human forgery dataset based on diffusion models. Employing five latest digital human generation methods (Sonic, Hallo, etc.) and voice cloning method, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies show that the confusion rate between forged and real videos reaches 68%, and existing state-of-the-art (SOTA) detection models exhibit large drops in AUC values on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we further propose DigiShield, a detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves SOTA performance on both the DigiFakeAV and DF-TIMIT datasets. Experiments show that this method effectively identifies covert artifacts through fine-grained analysis of the temporal evolution of facial features in synthetic videos.

Title: Joint Relational Database Generation via Graph-Conditional Diffusion Models

Authors: Mohamed Amine Ketata, David Lüdke, Leo Schwinn, Stephan Günnemann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16527
Pdf URL: https://arxiv.org/pdf/2505.16527
Copy Paste: [[2505.16527]] Joint Relational Database Generation via Graph-Conditional Diffusion Models(https://arxiv.org/abs/2505.16527)
Keywords: diffusion, generative
Abstract: Building generative models for relational databases (RDBs) is important for applications like privacy-preserving data release and augmenting real datasets. However, most prior work either focuses on single-table generation or relies on autoregressive factorizations that impose a fixed table order and generate tables sequentially. This approach limits parallelism, restricts flexibility in downstream applications like missing value imputation, and compounds errors due to commonly made conditional independence assumptions. We propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM). GRDM leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.

Title: HOFT: Householder Orthogonal Fine-tuning

Authors: Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16531
Pdf URL: https://arxiv.org/pdf/2505.16531
Copy Paste: [[2505.16531]] HOFT: Householder Orthogonal Fine-tuning(https://arxiv.org/abs/2505.16531)
Keywords: foundation model
Abstract: Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.

Title: SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion

Authors: Asrar Alruwayqi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16535
Pdf URL: https://arxiv.org/pdf/2505.16535
Copy Paste: [[2505.16535]] SHaDe: Compact and Consistent Dynamic 3D Reconstruction via Tri-Plane Deformation and Latent Diffusion(https://arxiv.org/abs/2505.16535)
Keywords: diffusion, generative
Abstract: We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.

Title: TextureSAM: Towards a Texture Aware Foundation Model for Segmentation

Authors: Inbal Cohen, Boaz Meivar, Peihan Tu, Shai Avidan, Gal Oren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16540
Pdf URL: https://arxiv.org/pdf/2505.16540
Copy Paste: [[2505.16540]] TextureSAM: Towards a Texture Aware Foundation Model for Segmentation(https://arxiv.org/abs/2505.16540)
Keywords: foundation model
Abstract: Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.

Title: Towards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations

Authors: Trung V. Phan, George A. Kevrekidis, Soledad Villar, Yannis G. Kevrekidis, Juan M. Bello-Rivas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16549
Pdf URL: https://arxiv.org/pdf/2505.16549
Copy Paste: [[2505.16549]] Towards Coordinate- and Dimension-Agnostic Machine Learning for Partial Differential Equations(https://arxiv.org/abs/2505.16549)
Keywords: diffusion
Abstract: The machine learning methods for data-driven identification of partial differential equations (PDEs) are typically defined for a given number of spatial dimensions and a choice of coordinates the data have been collected in. This dependence prevents the learned evolution equation from generalizing to other spaces. In this work, we reformulate the problem in terms of coordinate- and dimension-independent representations, paving the way toward what we call ``spatially liberated" PDE learning. To this end, we employ a machine learning approach to predict the evolution of scalar field systems expressed in the formalism of exterior calculus, which is coordinate-free and immediately generalizes to arbitrary dimensions by construction. We demonstrate the performance of this approach in the FitzHugh-Nagumo and Barkley reaction-diffusion models, as well as the Patlak-Keller-Segel model informed by in-situ chemotactic bacteria observations. We provide extensive numerical experiments that demonstrate that our approach allows for seamless transitions across various spatial contexts. We show that the field dynamics learned in one space can be used to make accurate predictions in other spaces with different dimensions, coordinate systems, boundary conditions, and curvatures.

Title: M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Authors: Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16565
Pdf URL: https://arxiv.org/pdf/2505.16565
Copy Paste: [[2505.16565]] M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion(https://arxiv.org/abs/2505.16565)
Keywords: diffusion
Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.

Title: Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images

Authors: Michael Neri, Sara Baldoni
Subjects: cs.CV, cs.CR, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2505.16650
Pdf URL: https://arxiv.org/pdf/2505.16650
Copy Paste: [[2505.16650]] Unsupervised Network Anomaly Detection with Autoencoders and Traffic Images(https://arxiv.org/abs/2505.16650)
Keywords: anomaly
Abstract: Due to the recent increase in the number of connected devices, the need to promptly detect security issues is emerging. Moreover, the high number of communication flows creates the necessity of processing huge amounts of data. Furthermore, the connected devices are heterogeneous in nature, having different computational capacities. For this reason, in this work we propose an image-based representation of network traffic which allows to realize a compact summary of the current network conditions with 1-second time windows. The proposed representation highlights the presence of anomalies thus reducing the need for complex processing architectures. Finally, we present an unsupervised learning approach which effectively detects the presence of anomalies. The code and the dataset are available at this https URL.

Title: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Authors: Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zelin Peng, Zhiwei Yang, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16652
Pdf URL: https://arxiv.org/pdf/2505.16652
Copy Paste: [[2505.16652]] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding(https://arxiv.org/abs/2505.16652)
Keywords: in-context
Abstract: Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

Title: SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images

Authors: Kaiyu Guo, Tan Pan, Chen Jiang, Zijian Wang, Brian C. Lovell, Limei Han, Yuan Cheng, Mahsa Baktashmotlagh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16659
Pdf URL: https://arxiv.org/pdf/2505.16659
Copy Paste: [[2505.16659]] SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images(https://arxiv.org/abs/2505.16659)
Keywords: anomaly
Abstract: Medical anomaly detection (AD) is crucial for early clinical intervention, yet it faces challenges due to limited access to high-quality medical imaging data, caused by privacy concerns and data silos. Few-shot learning has emerged as a promising approach to alleviate these limitations by leveraging the large-scale prior knowledge embedded in vision-language models (VLMs). Recent advancements in few-shot medical AD have treated normal and abnormal cases as a one-class classification problem, often overlooking the distinction among multiple anomaly categories. Thus, in this paper, we propose a framework tailored for few-shot medical anomaly detection in the scenario where the identification of multiple anomaly categories is required. To capture the detailed radiological signs of medical anomaly categories, our framework incorporates diverse textual descriptions for each category generated by a Large-Language model, under the assumption that different anomalies in medical images may share common radiological signs in each category. Specifically, we introduce SD-MAD, a two-stage Sign-Driven few-shot Multi-Anomaly Detection framework: (i) Radiological signs are aligned with anomaly categories by amplifying inter-anomaly discrepancy; (ii) Aligned signs are selected further to mitigate the effect of the under-fitting and uncertain-sample issue caused by limited medical data, employing an automatic sign selection strategy at inference. Moreover, we propose three protocols to comprehensively quantify the performance of multi-anomaly detection. Extensive experiments illustrate the effectiveness of our method.

Title: Quantum Feature Optimization for Enhanced Clustering of Blockchain Transaction Data

Authors: Yun-Cheng Tsai, Samuel Yen-Chi Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16672
Pdf URL: https://arxiv.org/pdf/2505.16672
Copy Paste: [[2505.16672]] Quantum Feature Optimization for Enhanced Clustering of Blockchain Transaction Data(https://arxiv.org/abs/2505.16672)
Keywords: self-supervised
Abstract: Blockchain transaction data exhibits high dimensionality, noise, and intricate feature entanglement, presenting significant challenges for traditional clustering algorithms. In this study, we conduct a comparative analysis of three clustering approaches: (1) Classical K-Means Clustering, applied to pre-processed feature representations; (2) Hybrid Clustering, wherein classical features are enhanced with quantum random features extracted using randomly initialized quantum neural networks (QNNs); and (3) Fully Quantum Clustering, where a QNN is trained in a self-supervised manner leveraging a SwAV-based loss function to optimize the feature space for clustering directly. The proposed experimental framework systematically investigates the impact of quantum circuit depth and the number of learned prototypes, demonstrating that even shallow quantum circuits can effectively extract meaningful non-linear representations, significantly improving clustering performance.

Title: Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge

Authors: Marcella Astrid, Abdelrahman Shabayek, Djamila Aouada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16674
Pdf URL: https://arxiv.org/pdf/2505.16674
Copy Paste: [[2505.16674]] Zero-Shot Anomaly Detection in Battery Thermal Images Using Visual Question Answering with Prior Knowledge(https://arxiv.org/abs/2505.16674)
Keywords: anomaly
Abstract: Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.

Title: On the Out-of-Distribution Generalization of Self-Supervised Learning

Authors: Wenwen Qiang, Jingyao Wang, Zeen Song, Jiangmeng Li, Changwen Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16675
Pdf URL: https://arxiv.org/pdf/2505.16675
Copy Paste: [[2505.16675]] On the Out-of-Distribution Generalization of Self-Supervised Learning(https://arxiv.org/abs/2505.16675)
Keywords: self-supervised
Abstract: In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.

Title: Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds

Authors: Jordan Dotzel, Tony Montes, Mohamed S. Abdelfattah, Zhiru Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16679
Pdf URL: https://arxiv.org/pdf/2505.16679
Copy Paste: [[2505.16679]] Semantic Compression of 3D Objects for Open and Collaborative Virtual Worlds(https://arxiv.org/abs/2505.16679)
Keywords: generative
Abstract: Traditional methods for 3D object compression operate only on structural information within the object vertices, polygons, and textures. These methods are effective at compression rates up to 10x for standard object sizes but quickly deteriorate at higher compression rates with texture artifacts, low-polygon counts, and mesh gaps. In contrast, semantic compression ignores structural information and operates directly on the core concepts to push to extreme levels of compression. In addition, it uses natural language as its storage format, which makes it natively human-readable and a natural fit for emerging applications built around large-scale, collaborative projects within augmented and virtual reality. It deprioritizes structural information like location, size, and orientation and predicts the missing information with state-of-the-art deep generative models. In this work, we construct a pipeline for 3D semantic compression from public generative models and explore the quality-compression frontier for 3D object compression. We apply this pipeline to achieve rates as high as 105x for 3D objects taken from the Objaverse dataset and show that semantic compression can outperform traditional methods in the important quality-preserving region around 100x compression.

Title: Learning Genomic Structure from $k$-mers

Authors: Filip Thor, Carl Nettelblad
Subjects: cs.LG, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.16680
Pdf URL: https://arxiv.org/pdf/2505.16680
Copy Paste: [[2505.16680]] Learning Genomic Structure from $k$-mers(https://arxiv.org/abs/2505.16680)
Keywords: self-supervised
Abstract: Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data using contrastive learning, in which an encoder model is trained to produce embeddings that cluster together sequences from the same genomic region. The sequential nature of genomic regions is preserved in the form of trajectories through this embedding space. Trained solely to reflect the structure of the genome, the resulting model provides a general representation of $k$-mer sequences, suitable for a range of downstream tasks involving read data. We apply our framework to learn the structure of the $E.\ coli$ genome, and demonstrate its use in simulated ancient DNA (aDNA) read mapping and identification of structural variations. Furthermore, we illustrate the potential of using this type of model for metagenomic species identification. We show how incorporating a domain-specific noise model can enhance embedding robustness, and how a supervised contrastive learning setting can be adopted when a linear reference genome is available, by introducing a distance thresholding parameter $\Gamma$. The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly using specialized algorithms. Small prediction heads based on a pre-trained embedding are shown to perform on par with BWA-aln, the current gold standard approach for aDNA mapping, in terms of accuracy and runtime for short genomes. Given the method's favorable scaling properties with respect to total genome size, inference using our approach is highly promising for metagenomic applications and for mapping to genomes comparable in size to the human genome.

Title: One-Step Diffusion-Based Image Compression with Semantic Distillation

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.16687
Pdf URL: https://arxiv.org/pdf/2505.16687
Copy Paste: [[2505.16687]] One-Step Diffusion-Based Image Compression with Semantic Distillation(https://arxiv.org/abs/2505.16687)
Keywords: diffusion, generative
Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.

Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16694
Pdf URL: https://arxiv.org/pdf/2505.16694
Copy Paste: [[2505.16694]] Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence(https://arxiv.org/abs/2505.16694)
Keywords: in-context
Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

Authors: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16707
Pdf URL: https://arxiv.org/pdf/2505.16707
Copy Paste: [[2505.16707]] KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models(https://arxiv.org/abs/2505.16707)
Keywords: generative
Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

Title: Advancing Brainwave Modeling with a Codebook-Based Foundation Model

Authors: Konstantinos Barmpas, Na Lee, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.16724
Pdf URL: https://arxiv.org/pdf/2505.16724
Copy Paste: [[2505.16724]] Advancing Brainwave Modeling with a Codebook-Based Foundation Model(https://arxiv.org/abs/2505.16724)
Keywords: foundation model
Abstract: Recent advances in large-scale pre-trained Electroencephalogram (EEG) models have shown great promise, driving progress in Brain-Computer Interfaces (BCIs) and healthcare applications. However, despite their success, many existing pre-trained models have struggled to fully capture the rich information content of neural oscillations, a limitation that fundamentally constrains their performance and generalizability across diverse BCI tasks. This limitation is frequently rooted in suboptimal architectural design choices which constrain their representational capacity. In this work, we introduce LaBraM++, an enhanced Large Brainwave Foundation Model (LBM) that incorporates principled improvements grounded in robust signal processing foundations. LaBraM++ demonstrates substantial gains across a variety of tasks, consistently outperforming its originally-based architecture and achieving competitive results when compared to other open-source LBMs. Its superior performance and training efficiency highlight its potential as a strong foundation for future advancements in LBMs.

Title: Masked Conditioning for Deep Generative Models

Authors: Phillip Mueller, Jannik Wiese, Sebastian Mueller, Lars Mikelsons
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16725
Pdf URL: https://arxiv.org/pdf/2505.16725
Copy Paste: [[2505.16725]] Masked Conditioning for Deep Generative Models(https://arxiv.org/abs/2505.16725)
Keywords: diffusion, foundation model, generative
Abstract: Datasets in engineering domains are often small, sparsely labeled, and contain numerical as well as categorical conditions. Additionally. computational resources are typically limited in practical applications which hinders the adoption of generative models for engineering tasks. We introduce a novel masked-conditioning approach, that enables generative models to work with sparse, mixed-type data. We mask conditions during training to simulate sparse conditions at inference time. For this purpose, we explore the use of various sparsity schedules that show different strengths and weaknesses. In addition, we introduce a flexible embedding that deals with categorical as well as numerical conditions. We integrate our method into an efficient variational autoencoder as well as a latent diffusion model and demonstrate the applicability of our approach on two engineering-related datasets of 2D point clouds and images. Finally, we show that small models trained on limited data can be coupled with large pretrained foundation models to improve generation quality while retaining the controllability induced by our conditioning scheme.

Title: Forward-only Diffusion Probabilistic Models

Authors: Ziwei Luo, Fredrik K. Gustafsson, Jens Sjölund, Thomas B. Schön
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16733
Pdf URL: https://arxiv.org/pdf/2505.16733
Copy Paste: [[2505.16733]] Forward-only Diffusion Probabilistic Models(https://arxiv.org/abs/2505.16733)
Keywords: diffusion, generative
Abstract: This work presents a forward-only diffusion (FoD) approach for generative modelling. In contrast to traditional diffusion models that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent linear stochastic differential equation that involves a mean-reverting term in both the drift and diffusion functions. This mean-reversion property guarantees the convergence to clean data, naturally simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling a few-step non-Markov chain sampling during inference. The proposed FoD model, despite its simplicity, achieves competitive performance on various image-conditioned (e.g., image restoration) and unconditional generation tasks, demonstrating its effectiveness in generative modelling. Our code is available at this https URL.

Title: Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis

Authors: Iván Matas, Carmen Serrano, Miguel Nogales, David Moreno, Lara Ferrándiz, Teresa Ojeda, Begoña Acha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16773
Pdf URL: https://arxiv.org/pdf/2505.16773
Copy Paste: [[2505.16773]] Mitigating Overfitting in Medical Imaging: Self-Supervised Pretraining vs. ImageNet Transfer Learning for Dermatological Diagnosis(https://arxiv.org/abs/2505.16773)
Keywords: self-supervised
Abstract: Deep learning has transformed computer vision but relies heavily on large labeled datasets and computational resources. Transfer learning, particularly fine-tuning pretrained models, offers a practical alternative; however, models pretrained on natural image datasets such as ImageNet may fail to capture domain-specific characteristics in medical imaging. This study introduces an unsupervised learning framework that extracts high-value dermatological features instead of relying solely on ImageNet-based pretraining. We employ a Variational Autoencoder (VAE) trained from scratch on a proprietary dermatological dataset, allowing the model to learn a structured and clinically relevant latent space. This self-supervised feature extractor is then compared to an ImageNet-pretrained backbone under identical classification conditions, highlighting the trade-offs between general-purpose and domain-specific pretraining. Our results reveal distinct learning patterns. The self-supervised model achieves a final validation loss of 0.110 (-33.33%), while the ImageNet-pretrained model stagnates at 0.100 (-16.67%), indicating overfitting. Accuracy trends confirm this: the self-supervised model improves from 45% to 65% (+44.44%) with a near-zero overfitting gap, whereas the ImageNet-pretrained model reaches 87% (+50.00%) but plateaus at 75% (+19.05%), with its overfitting gap increasing to +0.060. These findings suggest that while ImageNet pretraining accelerates convergence, it also amplifies overfitting on non-clinically relevant features. In contrast, self-supervised learning achieves steady improvements, stronger generalization, and superior adaptability, underscoring the importance of domain-specific feature extraction in medical imaging.

Title: Learning Flexible Forward Trajectories for Masked Molecular Diffusion

Authors: Hyunjin Seo, Taewon Kim, Sihyun Yu, SungSoo Ahn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16790
Pdf URL: https://arxiv.org/pdf/2505.16790
Copy Paste: [[2505.16790]] Learning Flexible Forward Trajectories for Masked Molecular Diffusion(https://arxiv.org/abs/2505.16790)
Keywords: diffusion
Abstract: Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.

Title: Cohort-Based Active Modality Acquisition

Authors: Tillmann Rheude, Roland Eils, Benjamin Wild
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16791
Pdf URL: https://arxiv.org/pdf/2505.16791
Copy Paste: [[2505.16791]] Cohort-Based Active Modality Acquisition(https://arxiv.org/abs/2505.16791)
Keywords: generative
Abstract: Real-world machine learning applications often involve data from multiple modalities that must be integrated effectively to make robust predictions. However, in many practical settings, not all modalities are available for every sample, and acquiring additional modalities can be costly. This raises the question: which samples should be prioritized for additional modality acquisition when resources are limited? While prior work has explored individual-level acquisition strategies and training-time active learning paradigms, test-time and cohort-based acquisition remain underexplored despite their importance in many real-world settings. We introduce Cohort-based Active Modality Acquisition (CAMA), a novel test-time setting to formalize the challenge of selecting which samples should receive additional modalities. We derive acquisition strategies that leverage a combination of generative imputation and discriminative modeling to estimate the expected benefit of acquiring missing modalities based on common evaluation metrics. We also introduce upper-bound heuristics that provide performance ceilings to benchmark acquisition strategies. Experiments on common multimodal datasets demonstrate that our proposed imputation-based strategies can more effectively guide the acquisition of new samples in comparison to those relying solely on unimodal information, entropy guidance, and random selections. Our work provides an effective solution for optimizing modality acquisition at the cohort level, enabling better utilization of resources in constrained settings.

Title: REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Authors: Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16792
Pdf URL: https://arxiv.org/pdf/2505.16792
Copy Paste: [[2505.16792]] REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training(https://arxiv.org/abs/2505.16792)
Keywords: diffusion, generative
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at this https URL .

Title: REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Authors: Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16793
Pdf URL: https://arxiv.org/pdf/2505.16793
Copy Paste: [[2505.16793]] REOBench: Benchmarking Robustness of Earth Observation Foundation Models(https://arxiv.org/abs/2505.16793)
Keywords: foundation model
Abstract: Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.

Title: Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Authors: Changbing Yang, Garrett Nicolai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16800
Pdf URL: https://arxiv.org/pdf/2505.16800
Copy Paste: [[2505.16800]] Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation(https://arxiv.org/abs/2505.16800)
Keywords: in-context
Abstract: We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.

Title: LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Authors: Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16839
Pdf URL: https://arxiv.org/pdf/2505.16839
Copy Paste: [[2505.16839]] LaViDa: A Large Diffusion Language Model for Multimodal Understanding(https://arxiv.org/abs/2505.16839)
Keywords: diffusion
Abstract: Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Title: Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Authors: Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16862
Pdf URL: https://arxiv.org/pdf/2505.16862
Copy Paste: [[2505.16862]] Conditional Panoramic Image Generation via Masked Autoregressive Modeling(https://arxiv.org/abs/2505.16862)
Keywords: diffusion, generative
Abstract: Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

Title: Training-Free Efficient Video Generation via Dynamic Token Carving

Authors: Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16864
Pdf URL: https://arxiv.org/pdf/2505.16864
Copy Paste: [[2505.16864]] Training-Free Efficient Video Generation via Dynamic Token Carving(https://arxiv.org/abs/2505.16864)
Keywords: diffusion
Abstract: Despite the remarkable generation quality of video Diffusion Transformer (DiT) models, their practical deployment is severely hindered by extensive computational requirements. This inefficiency stems from two key challenges: the quadratic complexity of self-attention with respect to token length and the multi-step nature of diffusion models. To address these limitations, we present Jenga, a novel inference pipeline that combines dynamic attention carving with progressive resolution generation. Our approach leverages two key insights: (1) early denoising steps do not require high-resolution latents, and (2) later steps do not require dense attention. Jenga introduces a block-wise attention mechanism that dynamically selects relevant token interactions using 3D space-filling curves, alongside a progressive resolution strategy that gradually increases latent resolution during generation. Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83$\times$ speedup with 0.01\% performance drop on VBench). As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware by reducing inference time from minutes to seconds -- without requiring model retraining. Code: this https URL

Title: A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams

Authors: Mohammed Al-Qudah, Fadi AlMahamid
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16872
Pdf URL: https://arxiv.org/pdf/2505.16872
Copy Paste: [[2505.16872]] A Multi-Step Comparative Framework for Anomaly Detection in IoT Data Streams(https://arxiv.org/abs/2505.16872)
Keywords: anomaly
Abstract: The rapid expansion of Internet of Things (IoT) devices has introduced critical security challenges, underscoring the need for accurate anomaly detection. Although numerous studies have proposed machine learning (ML) methods for this purpose, limited research systematically examines how different preprocessing steps--normalization, transformation, and feature selection--interact with distinct model architectures. To address this gap, this paper presents a multi-step evaluation framework assessing the combined impact of preprocessing choices on three ML algorithms: RNN-LSTM, autoencoder neural networks (ANN), and Gradient Boosting (GBoosting). Experiments on the IoTID20 dataset shows that GBoosting consistently delivers superior accuracy across preprocessing configurations, while RNN-LSTM shows notable gains with z-score normalization and autoencoders excel in recall, making them well-suited for unsupervised scenarios. By offering a structured analysis of preprocessing decisions and their interplay with various ML techniques, the proposed framework provides actionable guidance to enhance anomaly detection performance in IoT environments.

Title: T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Authors: Zhehao Huang, Yuhang Liu, Yixin Lou, Zhengbao He, Mingzhen He, Wenxing Zhou, Tao Li, Kehan Li, Zeyi Huang, Xiaolin Huang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16875
Pdf URL: https://arxiv.org/pdf/2505.16875
Copy Paste: [[2505.16875]] T2I-ConBench: Text-to-Image Benchmark for Continual Post-training(https://arxiv.org/abs/2505.16875)
Keywords: diffusion
Abstract: Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

Title: CASTILLO: Characterizing Response Length Distributions of Large Language Models

Authors: Daniel F. Perez-Ramirez, Dejan Kostic, Magnus Boman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16881
Pdf URL: https://arxiv.org/pdf/2505.16881
Copy Paste: [[2505.16881]] CASTILLO: Characterizing Response Length Distributions of Large Language Models(https://arxiv.org/abs/2505.16881)
Keywords: generative
Abstract: Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.

Title: Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs

Authors: Zeyu Wei, Shuo Wang, Xiaohui Rong, Xuemin Liu, He Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16894
Pdf URL: https://arxiv.org/pdf/2505.16894
Copy Paste: [[2505.16894]] Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs(https://arxiv.org/abs/2505.16894)
Keywords: diffusion
Abstract: Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.

Title: Unsupervised Prompting for Graph Neural Networks

Authors: Peyman Baghershahi, Sourav Medya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16903
Pdf URL: https://arxiv.org/pdf/2505.16903
Copy Paste: [[2505.16903]] Unsupervised Prompting for Graph Neural Networks(https://arxiv.org/abs/2505.16903)
Keywords: in-context
Abstract: Prompt tuning methods for Graph Neural Networks (GNNs) have become popular to address the semantic gap between pre-training and fine-tuning steps. However, existing GNN prompting methods rely on labeled data and involve lightweight fine-tuning for downstream tasks. Meanwhile, in-context learning methods for Large Language Models (LLMs) have shown promising performance with no parameter updating and no or minimal labeled data. Inspired by these approaches, in this work, we first introduce a challenging problem setup to evaluate GNN prompting methods. This setup encourages a prompting function to enhance a pre-trained GNN's generalization to a target dataset under covariate shift without updating the GNN's parameters and with no labeled data. Next, we propose a fully unsupervised prompting method based on consistency regularization through pseudo-labeling. We use two regularization techniques to align the prompted graphs' distribution with the original data and reduce biased predictions. Through extensive experiments under our problem setting, we demonstrate that our unsupervised approach outperforms the state-of-the-art prompting methods that have access to labels.

Title: Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Authors: Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, Mang Ye
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16916
Pdf URL: https://arxiv.org/pdf/2505.16916
Copy Paste: [[2505.16916]] Backdoor Cleaning without External Guidance in MLLM Fine-tuning(https://arxiv.org/abs/2505.16916)
Keywords: self-supervised
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in fine-tuning-as-a-service (FTaaS) settings, where user-submitted datasets adapt general-purpose models to downstream tasks. This flexibility, however, introduces serious security risks, as malicious fine-tuning can implant backdoors into MLLMs with minimal effort. In this paper, we observe that backdoor triggers systematically disrupt cross-modal processing by causing abnormal attention concentration on non-semantic regions--a phenomenon we term attention collapse. Based on this insight, we propose Believe Your Eyes (BYE), a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples. BYE operates via a three-stage pipeline: (1) extracting attention maps using the fine-tuned model, (2) computing entropy scores and profiling sensitive layers via bimodal separation, and (3) performing unsupervised clustering to remove suspicious samples. Unlike prior defenses, BYE equires no clean supervision, auxiliary labels, or model modifications. Extensive experiments across various datasets, models, and diverse trigger types validate BYE's effectiveness: it achieves near-zero attack success rates while maintaining clean-task performance, offering a robust and generalizable solution against backdoor threats in MLLMs.

Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Authors: Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16933
Pdf URL: https://arxiv.org/pdf/2505.16933
Copy Paste: [[2505.16933]] LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning(https://arxiv.org/abs/2505.16933)
Keywords: diffusion
Abstract: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: this https URL.

Title: In-Context Watermarks for Large Language Models

Authors: Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, Yuheng Bu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16934
Pdf URL: https://arxiv.org/pdf/2505.16934
Copy Paste: [[2505.16934]] In-Context Watermarks for Large Language Models(https://arxiv.org/abs/2505.16934)
Keywords: in-context
Abstract: The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.

Title: SPAR: Self-supervised Placement-Aware Representation Learning for Multi-Node IoT Systems

Authors: Yizhuo Chen, Tianchen Wang, You Lyu, Yanlan Hu, Jinyang Li, Tomoyoshi Kimura, Hongjue Zhao, Yigong Hu, Denizhan Kara, Tarek Abdelzaher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.16936
Pdf URL: https://arxiv.org/pdf/2505.16936
Copy Paste: [[2505.16936]] SPAR: Self-supervised Placement-Aware Representation Learning for Multi-Node IoT Systems(https://arxiv.org/abs/2505.16936)
Keywords: self-supervised
Abstract: This work develops the underpinnings of self-supervised placement-aware representation learning given spatially-distributed (multi-view and multimodal) sensor observations, motivated by the need to represent external environmental state in multi-sensor IoT systems in a manner that correctly distills spatial phenomena from the distributed multi-vantage observations. The objective of sensing in IoT systems is, in general, to collectively represent an externally observed environment given multiple vantage points from which sensory observations occur. Pretraining of models that help interpret sensor data must therefore encode the relation between signals observed by sensors and the observers' vantage points in order to attain a representation that encodes the observed spatial phenomena in a manner informed by the specific placement of the measuring instruments, while allowing arbitrary placement. The work significantly advances self-supervised model pretraining from IoT signals beyond current solutions that often overlook the distinctive spatial nature of IoT data. Our framework explicitly learns the dependencies between measurements and geometric observer layouts and structural characteristics, guided by a core design principle: the duality between signals and observer positions. We further provide theoretical analyses from the perspectives of information theory and occlusion-invariant representation learning to offer insight into the rationale behind our design. Experiments on three real-world datasets--covering vehicle monitoring, human activity recognition, and earthquake localization--demonstrate the superior generalizability and robustness of our method across diverse modalities, sensor placements, application-level inference tasks, and spatial scales.

Title: FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records

Authors: Chao Pang, Vincent Jeanselme, Young Sang Choi, Xinzhuo Jiang, Zilin Jing, Aparajita Kashyap, Yuta Kobayashi, Yanwei Li, Florent Pollet, Karthik Natarajan, Shalmali Joshi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16941
Pdf URL: https://arxiv.org/pdf/2505.16941
Copy Paste: [[2505.16941]] FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records(https://arxiv.org/abs/2505.16941)
Keywords: foundation model
Abstract: Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models' potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.

Title: Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models

Authors: Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.16959
Pdf URL: https://arxiv.org/pdf/2505.16959
Copy Paste: [[2505.16959]] Bigger Isn't Always Memorizing: Early Stopping Overparameterized Diffusion Models(https://arxiv.org/abs/2505.16959)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models have become a cornerstone of modern generative AI, yet the mechanisms underlying their generalization remain poorly understood. In fact, if these models were perfectly minimizing their training loss, they would just generate data belonging to their training set, i.e., memorize, as empirically found in the overparameterized regime. We revisit this view by showing that, in highly overparameterized diffusion models, generalization in natural data domains is progressively achieved during training before the onset of memorization. Our results, ranging from image to language diffusion models, systematically support the empirical law that memorization time is proportional to the dataset size. Generalization vs. memorization is then best understood as a competition between time scales. We show that this phenomenology is recovered in diffusion models learning a simple probabilistic context-free grammar with random rules, where generalization corresponds to the hierarchical acquisition of deeper grammar rules as training time grows, and the generalization cost of early stopping can be characterized. We summarize these results in a phase diagram. Overall, our results support that a principled early-stopping criterion - scaling with dataset size - can effectively optimize generalization while avoiding memorization, with direct implications for hyperparameter transfer and privacy-sensitive applications.

Title: Creatively Upscaling Images with Global-Regional Priors

Authors: Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16976
Pdf URL: https://arxiv.org/pdf/2505.16976
Copy Paste: [[2505.16976]] Creatively Upscaling Images with Global-Regional Priors(https://arxiv.org/abs/2505.16976)
Keywords: diffusion
Abstract: Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.

Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Authors: Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16977
Pdf URL: https://arxiv.org/pdf/2505.16977
Copy Paste: [[2505.16977]] Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On(https://arxiv.org/abs/2505.16977)
Keywords: diffusion
Abstract: Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.

Title: Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

Authors: Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16980
Pdf URL: https://arxiv.org/pdf/2505.16980
Copy Paste: [[2505.16980]] Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction(https://arxiv.org/abs/2505.16980)
Keywords: diffusion
Abstract: Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.

Title: Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Authors: Runpeng Yu, Xinyin Ma, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.16990
Pdf URL: https://arxiv.org/pdf/2505.16990
Copy Paste: [[2505.16990]] Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding(https://arxiv.org/abs/2505.16990)
Keywords: diffusion
Abstract: In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at this https URL.

Title: Guided Diffusion Sampling on Function Spaces with Applications to PDEs

Authors: Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar
Subjects: cs.LG, cs.AI, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17004
Pdf URL: https://arxiv.org/pdf/2505.17004
Copy Paste: [[2505.17004]] Guided Diffusion Sampling on Function Spaces with Applications to PDEs(https://arxiv.org/abs/2505.17004)
Keywords: diffusion
Abstract: We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Hilbert spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at this https URL

Title: CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Authors: Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.17006
Pdf URL: https://arxiv.org/pdf/2505.17006
Copy Paste: [[2505.17006]] CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning(https://arxiv.org/abs/2505.17006)
Keywords: diffusion
Abstract: Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.

Title: Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Authors: Tim Genewein, Kevin Wenliang Li, Jordi Grau-Moya, Anian Ruoss, Laurent Orseau, Marcus Hutter
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17010
Pdf URL: https://arxiv.org/pdf/2505.17010
Copy Paste: [[2505.17010]] Understanding Prompt Tuning and In-Context Learning via Meta-Learning(https://arxiv.org/abs/2505.17010)
Keywords: in-context
Abstract: Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

Title: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Authors: Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.17011
Pdf URL: https://arxiv.org/pdf/2505.17011
Copy Paste: [[2505.17011]] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space(https://arxiv.org/abs/2505.17011)
Keywords: generative
Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

Title: When Are Concepts Erased From Diffusion Models?

Authors: Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17013
Pdf URL: https://arxiv.org/pdf/2505.17013
Copy Paste: [[2505.17013]] When Are Concepts Erased From Diffusion Models?(https://arxiv.org/abs/2505.17013)
Keywords: diffusion
Abstract: Concept erasure, the ability to selectively prevent a model from generating specific concepts, has attracted growing interest, with various approaches emerging to address the challenge. However, it remains unclear how thoroughly these methods erase the target concept. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) reducing the likelihood of generating the target concept, and (ii) interfering with the model's internal guidance mechanisms. To thoroughly assess whether a concept has been truly erased from the model, we introduce a suite of independent evaluations. Our evaluation framework includes adversarial attacks, novel probing techniques, and analysis of the model's alternative generations in place of the erased concept. Our results shed light on the tension between minimizing side effects and maintaining robustness to adversarial prompts. Broadly, our work underlines the importance of comprehensive evaluation for erasure in diffusion models.