2026-03-11

Title: Are Expressive Encoders Necessary for Discrete Graph Generation?

Authors: Jay Revolinsky, Harry Shomer, Jiliang Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08825
Pdf URL: https://arxiv.org/pdf/2603.08825
Copy Paste: [[2603.08825]] Are Expressive Encoders Necessary for Discrete Graph Generation?(https://arxiv.org/abs/2603.08825)
Keywords: diffusion
Abstract: Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.

Title: HECTOR: Hybrid Editable Compositional Object References for Video Generation

Authors: Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08850
Pdf URL: https://arxiv.org/pdf/2603.08850
Copy Paste: [[2603.08850]] HECTOR: Hybrid Editable Compositional Object References for Video Generation(https://arxiv.org/abs/2603.08850)
Keywords: generative
Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Title: NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic

Authors: Pratyay Kumar, Abu Saleh Md Tayeen, Satyajayant Misra, Huiping Cao, Jiefei Liu, Qixu Gong, Jayashree Harikumar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08901
Pdf URL: https://arxiv.org/pdf/2603.08901
Copy Paste: [[2603.08901]] NetDiffuser: Deceiving DNN-Based Network Attack Detection Systems with Diffusion-Generated Adversarial Traffic(https://arxiv.org/abs/2603.08901)
Keywords: diffusion
Abstract: Deep learning (DL)-based Network Intrusion Detection System (NIDS) has demonstrated great promise in detecting malicious network traffic. However, they face significant security risks due to their vulnerability to adversarial examples (AEs). Most existing adversarial attacks maliciously perturb data to maximize misclassification errors. Among AEs, natural adversarial examples (NAEs) are particularly difficult to detect because they closely resemble real data, making them challenging for both humans and machine learning models to distinguish from legitimate inputs. Creating NAEs is crucial for testing and strengthening NIDS defenses. This paper proposes NetDiffuser1, a novel framework for generating NAEs capable of deceiving NIDS. NetDiffuser consists of two novel components. First, a new feature categorization algorithm is designed to identify relatively independent features in network traffic. Perturbing these features minimizes changes while preserving network flow validity. The second component is a novel application of diffusion models to inject semantically consistent perturbations for generating NAEs. NetDiffuser performance was extensively evaluated using three benchmark NIDS datasets across various model architectures and state-of-the-art adversarial detectors. Our experimental results show that NetDiffuser achieves up to a 29.93% higher attack success rate and reduces AE detection performance by at least 0.267 (in some cases up to 0.534) in the Area under the Receiver Operating Characteristic Curve (AUC-ROC) score compared to the baseline attacks.

Title: TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers

Authors: Yihua Liu, Fanjiang Ye, Bowen Lin, Rongyu Fang, Chengming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08928
Pdf URL: https://arxiv.org/pdf/2603.08928
Copy Paste: [[2603.08928]] TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers(https://arxiv.org/abs/2603.08928)
Keywords: diffusion
Abstract: Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.

Title: Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Authors: Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario, Lars Lundqvist, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08930
Pdf URL: https://arxiv.org/pdf/2603.08930
Copy Paste: [[2603.08930]] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning(https://arxiv.org/abs/2603.08930)
Keywords: foundation model, in-context
Abstract: This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

Title: Semantic Level of Detail: Multi-Scale Knowledge Representation via Heat Kernel Diffusion on Hyperbolic Manifolds

Authors: Edward Izgorodin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08965
Pdf URL: https://arxiv.org/pdf/2603.08965
Copy Paste: [[2603.08965]] Semantic Level of Detail: Multi-Scale Knowledge Representation via Heat Kernel Diffusion on Hyperbolic Manifolds(https://arxiv.org/abs/2603.08965)
Keywords: diffusion
Abstract: AI memory systems increasingly organize knowledge into graph structures -- knowledge graphs, entity relations, community hierarchies -- yet lack a principled mechanism for continuous resolution control: where do the qualitative boundaries between abstraction levels lie, and how should an agent navigate them? We introduce Semantic Level of Detail (SLoD), a framework that answers both questions by defining a continuous zoom operator via heat kernel diffusion on the Poincaré ball $\mathbb{B}^d$. At coarse scales ($\sigma \to \infty$), diffusion aggregates embeddings into high-level summaries; at fine scales ($\sigma \to 0$), local semantic detail is preserved. We prove hierarchical coherence with bounded approximation error $O(\sigma)$ and $(1+\varepsilon)$ distortion for tree-structured hierarchies under Sarkar embedding. Crucially, we show that spectral gaps in the graph Laplacian induce emergent scale boundaries -- scales where the representation undergoes qualitative transitions -- which can be detected automatically without manual resolution parameters. On synthetic hierarchies (HSBM), our boundary scanner recovers planted levels with ARI up to 1.00, with detection degrading gracefully near the information-theoretic Kesten-Stigum threshold. On the full WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth ($\tau = 0.79$), demonstrating that the method discovers meaningful abstraction levels in real-world knowledge graphs without supervision.

Title: SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

Authors: Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08982
Pdf URL: https://arxiv.org/pdf/2603.08982
Copy Paste: [[2603.08982]] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing(https://arxiv.org/abs/2603.08982)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

Title: Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning

Authors: Bolutife Atoki, Iuliia Tkachenko, Bertrand Kerautret, Carlos Crispim-Junior
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08998
Pdf URL: https://arxiv.org/pdf/2603.08998
Copy Paste: [[2603.08998]] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning(https://arxiv.org/abs/2603.08998)
Keywords: diffusion, generative
Abstract: Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.

Title: Security Considerations for Multi-agent Systems

Authors: Tam Nguyen, Moses Ndebugre, Dheeraj Arremsetty
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09002
Pdf URL: https://arxiv.org/pdf/2603.09002
Copy Paste: [[2603.09002]] Security Considerations for Multi-agent Systems(https://arxiv.org/abs/2603.09002)
Keywords: generative
Abstract: Multi-agent artificial intelligence systems or MAS are systems of autonomous agents that exercise delegated tool authority, share persistent memory, and coordinate via inter-agent communication. MAS introduces qualitatively distinct security vulnerabilities from those documented for singular AI models. Existing security and governance frameworks were not designed for these emerging attack surfaces. This study systematically characterizes the threat landscape of MAS and quantitatively evaluates 16 security frameworks for AI against it. A four-phase methodology is proposed: constructing a deep technical knowledge base of production multi-agent architectures; conducting generative AI-assisted threat modeling scoped to MAS cybersecurity risks and validated by domain experts; structuring survey plans at individual-threat granularity; and scoring each framework on a three-point scale against the cybersecurity risks. The risks were organized into 193 distinct main threat items across nine risk categories. The expected minimal average score is 2. No reviewed framework achieves majority coverage of any single category. Non-Determinism (mean score 1.231 across all 16 frameworks) and Data Leakage (1.340) are the most under-addressed domains. The OWASP Agentic Security Initiative leads overall at 65.3\% coverage and in the design phase; the CDAO Generative AI Responsible AI Toolkit leads in development and operational coverage. These results provide the first empirical cross-framework comparison for MAS security and offer evidence-based guidance for framework selection.

Title: Spectral-Structured Diffusion for Single-Image Rain Removal

Authors: Yucheng Xing, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09054
Pdf URL: https://arxiv.org/pdf/2603.09054
Copy Paste: [[2603.09054]] Spectral-Structured Diffusion for Single-Image Rain Removal(https://arxiv.org/abs/2603.09054)
Keywords: diffusion
Abstract: Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.

Title: Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

Authors: Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09094
Pdf URL: https://arxiv.org/pdf/2603.09094
Copy Paste: [[2603.09094]] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation(https://arxiv.org/abs/2603.09094)
Keywords: diffusion
Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.

Title: Training-free Motion Factorization for Compositional Video Generation

Authors: Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09104
Pdf URL: https://arxiv.org/pdf/2603.09104
Copy Paste: [[2603.09104]] Training-free Motion Factorization for Compositional Video Generation(https://arxiv.org/abs/2603.09104)
Keywords: diffusion
Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.

Title: QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model

Authors: Junjie Yin, Jiaju Li, Hanfa Xing
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09125
Pdf URL: https://arxiv.org/pdf/2603.09125
Copy Paste: [[2603.09125]] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model(https://arxiv.org/abs/2603.09125)
Keywords: diffusion
Abstract: Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at this https URL.

Title: Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning

Authors: Lina Berrayana, Ahmed Heakl, Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09184
Pdf URL: https://arxiv.org/pdf/2603.09184
Copy Paste: [[2603.09184]] Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning(https://arxiv.org/abs/2603.09184)
Keywords: diffusion
Abstract: Most multi-agent systems rely exclusively on autoregressive language models (ARMs) that are based on sequential generation. Although effective for fluent text, ARMs limit global reasoning and plan revision. On the other hand, Discrete Diffusion Language Models (DDLMs) enable non-sequential, globally revisable generation and have shown strong planning capabilities, but their limited text fluency hinders direct collaboration with ARMs. We introduce Latent-DARM, a latent-space communication framework bridging DDLM (planners) and ARM (executors), maximizing collaborative benefits. Across mathematical, scientific, and commonsense reasoning benchmarks, Latent-DARM outperforms text-based interfaces on average, improving accuracy from 27.0% to 36.0% on DART-5 and from 0.0% to 14.0% on AIME2024. Latent-DARM approaches the results of state-of-the-art reasoning models while using less than 2.2% of its token budget. This work advances multi-agent collaboration among agents with heterogeneous models.

Title: TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

Authors: Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09217
Pdf URL: https://arxiv.org/pdf/2603.09217
Copy Paste: [[2603.09217]] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy(https://arxiv.org/abs/2603.09217)
Keywords: foundation model
Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $\beta_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $\beta_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.

Title: UniField: A Unified Field-Aware MRI Enhancement Framework

Authors: Yiyang Lin, Chenhui Wang, Zhihao Peng, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09223
Pdf URL: https://arxiv.org/pdf/2603.09223
Copy Paste: [[2603.09223]] UniField: A Unified Field-Aware MRI Enhancement Framework(https://arxiv.org/abs/2603.09223)
Keywords: foundation model
Abstract: Magnetic Resonance Imaging (MRI) field-strength enhancement holds immense value for both clinical diagnostics and advanced research. However, existing methods typically focus on isolated enhancement tasks, such as specific 64mT-to-3T or 3T-to-7T transitions using limited subject cohorts, thereby failing to exploit the shared degradation patterns inherent across different field strengths and severely restricting model generalization. To address this challenge, we propose \methodname, a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning by exploiting these shared degradation characteristics. Specifically, our main contributions are threefold. Firstly, to overcome MRI data scarcity and capture continuous anatomical structures, \methodname departs from conventional methods that treat 3D MRI volumes as independent 2D slices. Instead, we directly exploit comprehensive 3D volumetric information by leveraging pre-trained 3D foundation models, thereby embedding generalized and robust structural representations to significantly boost enhancement performance. In addition, to mitigate the spectral bias of mainstream flow-matching models that often over-smooth high-frequency details, we explicitly incorporate the physical mechanisms of magnetic fields to introduce a Field-Aware Spectral Rectification Mechanism (FASRM), tailoring customized spectral corrections to distinct field strengths. Finally, to resolve the fundamental data bottleneck, we organize and publicly release a comprehensive paired multi-field MRI dataset, which is an order of magnitude larger than existing datasets. Extensive experiments demonstrate our method's superiority over state-of-the-art approaches, achieving an average improvement of approximately 1.81 dB in PSNR and 9.47\% in SSIM. Code will be released upon acceptance.

Title: BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Authors: Shuang Liu, Ao Yu, Linkang Cheng, Xiwen Huang, Li Zhao, Junhui Liu, Zhiting Lin, Yu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09236
Pdf URL: https://arxiv.org/pdf/2603.09236
Copy Paste: [[2603.09236]] BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off(https://arxiv.org/abs/2603.09236)
Keywords: diffusion
Abstract: Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

Title: RAE-NWM: Navigation World Model in Dense Visual Representation Space

Authors: Mingkun Zhang, Wangtian Shen, Fan Zhang, Haijian Qin, Zihao Pei, Ziyang Meng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.09241
Pdf URL: https://arxiv.org/pdf/2603.09241
Copy Paste: [[2603.09241]] RAE-NWM: Navigation World Model in Dense Visual Representation Space(https://arxiv.org/abs/2603.09241)
Keywords: diffusion
Abstract: Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.

Title: When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

Authors: Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09242
Pdf URL: https://arxiv.org/pdf/2603.09242
Copy Paste: [[2603.09242]] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2603.09242)
Keywords: foundation model, generative
Abstract: AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).

Title: From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Authors: Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09283
Pdf URL: https://arxiv.org/pdf/2603.09283
Copy Paste: [[2603.09283]] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions(https://arxiv.org/abs/2603.09283)
Keywords: diffusion, self-supervised
Abstract: Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

Title: Learning Convex Decomposition via Feature Fields

Authors: Yuezhi Yang, Qixing Huang, Mikaela Angelina Uy, Nicholas Sharp
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09285
Pdf URL: https://arxiv.org/pdf/2603.09285
Copy Paste: [[2603.09285]] Learning Convex Decomposition via Feature Fields(https://arxiv.org/abs/2603.09285)
Keywords: self-supervised
Abstract: This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world model for convex decomposition. Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats. this https URL

Title: CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

Authors: Shengqi Dang, Jiaying Lei, Yi He, Ziqing Qian, Nan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09286
Pdf URL: https://arxiv.org/pdf/2603.09286
Copy Paste: [[2603.09286]] CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation(https://arxiv.org/abs/2603.09286)
Keywords: generative
Abstract: Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.

Title: TA-GGAD: Testing-time Adaptive Graph Model for Generalist Graph Anomaly Detection

Authors: Xiong Zhang, Hong Peng, Changlong Fu, Xin Jin, Yun Yang, Cheng Xie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09349
Pdf URL: https://arxiv.org/pdf/2603.09349
Copy Paste: [[2603.09349]] TA-GGAD: Testing-time Adaptive Graph Model for Generalist Graph Anomaly Detection(https://arxiv.org/abs/2603.09349)
Keywords: foundation model, anomaly
Abstract: A significant number of anomalous nodes in the real world, such as fake news, noncompliant users, malicious transactions, and malicious posts, severely compromises the health of the graph data ecosystem and urgently requires effective identification and processing. With anomalies that span multiple data domains yet exhibit vast differences in features, cross-domain detection models face severe domain shift issues, which limit their generalizability across all domains. This study identifies and quantitatively analyzes a specific feature mismatch pattern exhibited by domain shift in graph anomaly detection, which we define as the \emph{Anomaly Disassortativity} issue ($\mathcal{AD}$). Based on the modeling of the issue $\mathcal{AD}$, we introduce a novel graph foundation model for anomaly detection. It achieves cross-domain generalization in different graphs, requiring only a single training phase to perform effectively across diverse domains. The experimental findings, based on fourteen diverse real-world graphs, confirm a breakthrough in the model's cross-domain adaptation, achieving a pioneering state-of-the-art (SOTA) level in terms of detection accuracy. In summary, the proposed theory of $\mathcal{AD}$ provides a novel theoretical perspective and a practical route for future research in generalist graph anomaly detection (GGAD). The code is available at this https URL.

Title: Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework

Authors: Engin Deniz Erkan, Elif Surer, Ulas Yaman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09353
Pdf URL: https://arxiv.org/pdf/2603.09353
Copy Paste: [[2603.09353]] Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework(https://arxiv.org/abs/2603.09353)
Keywords: generative
Abstract: Surface roughness in Material Extrusion Additive Manufacturing varies across a part and is difficult to anticipate during process planning because it depends on both printing parameters and local surface inclination, which governs the staircase effect. A data-driven framework is presented to predict the arithmetic mean roughness (Ra) prior to fabrication using process parameters and surface angle. A structured experimental dataset was created using a three-level Box-Behnken design: 87 specimens were printed, each with multiple planar faces spanning different inclination angles, yielding 1566 Ra measurements acquired with a contact profilometer. A multilayer perceptron regressor was trained to capture nonlinear relationships between manufacturing conditions, inclination, and Ra. To mitigate limited experimental data, a conditional generative adversarial network was used to generate additional condition-specific tabular samples, thereby improving predictive performance. Model performance was assessed on a hold-out test set. A web-based decision-support interface was also developed to enable interactive process planning by loading a 3D model, specifying printing parameters, and adjusting the part's orientation. The system computes face-wise inclination from the model geometry and visualizes predicted Ra as an interactive colormap over the surface, enabling rapid identification of regions prone to high roughness and immediate comparison of parameter and orientation choices.

Title: ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation

Authors: Wenhao Yan, Ning An, Linxu Li, Bingsheng Bi, Bo Jiang, Zhigang Lu, Baoxu Liu, Junrong Liu, Cong Dong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.09358
Pdf URL: https://arxiv.org/pdf/2603.09358
Copy Paste: [[2603.09358]] ProvAgent: Threat Detection Based on Identity-Behavior Binding and Multi-Agent Collaborative Attack Investigation(https://arxiv.org/abs/2603.09358)
Keywords: anomaly
Abstract: Advanced Persistent Threats (APTs) pose critical challenges to modern cybersecurity due to their multi-stage and stealthy nature. While provenance-based detection approaches show promise in capturing causal attack semantics, current threat provenance practices face two paradoxical issues: (1) expert skepticism, where human analysts doubt the capability of traditional detection models to identify complex attacks; and (2) expert dependence, as analysts cannot manually process large-scale raw logs to detect threats without these models. Consequently, collaboration between humans and traditional models remains the prevailing paradigm. However, this renders investigation quality contingent upon human expertise and frequently results in alert fatigue. To address these challenges, we present ProvAgent, a framework that evolves the threat provenance paradigm from human-model collaboration to a novel collaboration between multi-agent systems and traditional models. ProvAgent leverages the speed and cost-efficiency of traditional models for initial anomaly screening over large-scale logs. By enforcing fine-grained identity-behavior consistency via graph contrastive learning, it profiles entities based on specific attributes to generate high-fidelity alerts. With these alerts serving as investigation entry points, ProvAgent achieves in-depth autonomous investigation through a hypothesis-verification multi-agent framework. Evaluations with real-world datasets demonstrate that ProvAgent outperforms six state-of-the-art (SOTA) baselines in anomaly detection. Through automated investigation, ProvAgent reconstructs near-complete attack processes at a minimum cost of \$0.06 per day.

Title: M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

Authors: Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09367
Pdf URL: https://arxiv.org/pdf/2603.09367
Copy Paste: [[2603.09367]] M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition(https://arxiv.org/abs/2603.09367)
Keywords: self-supervised
Abstract: In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.

Title: MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification

Authors: Nikola Jovišić, Milica Škipina, Nicola Dall'Asen, Dubravko Ćulibrk
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09374
Pdf URL: https://arxiv.org/pdf/2603.09374
Copy Paste: [[2603.09374]] MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification(https://arxiv.org/abs/2603.09374)
Keywords: foundation model
Abstract: Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.

Title: EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation

Authors: Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, Hui Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09385
Pdf URL: https://arxiv.org/pdf/2603.09385
Copy Paste: [[2603.09385]] EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation(https://arxiv.org/abs/2603.09385)
Keywords: foundation model
Abstract: Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT's powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods -- reducing the absolute mean depth error at 30m by over 53\% on EventScape (from 2.30 to 1.06) -- while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.

Title: Training-Free Coverless Multi-Image Steganography with Access Control

Authors: Minyeol Bae, Si-Hyeon Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09390
Pdf URL: https://arxiv.org/pdf/2603.09390
Copy Paste: [[2603.09390]] Training-Free Coverless Multi-Image Steganography with Access Control(https://arxiv.org/abs/2603.09390)
Keywords: diffusion
Abstract: Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.

Title: Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Authors: Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09408
Pdf URL: https://arxiv.org/pdf/2603.09408
Copy Paste: [[2603.09408]] Reviving ConvNeXt for Efficient Convolutional Diffusion Models(https://arxiv.org/abs/2603.09408)
Keywords: diffusion, generative
Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.

Title: Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

Authors: Albus Yizhuo Li, Matthew Wicker
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.09453
Pdf URL: https://arxiv.org/pdf/2603.09453
Copy Paste: [[2603.09453]] Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers(https://arxiv.org/abs/2603.09453)
Keywords: foundation model
Abstract: Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring responsible deployment. While Bayesian methods offer a principled approach to uncertainty quantification, their computational overhead renders their use impractical for training or inference at foundation model scale. State-of-the-art models achieve parameter counts in the trillions through carefully engineered sparsity including Mixture-of-Experts (MoE) layers. In this work, we demonstrate calibrated uncertainty at scale by introducing Variational Mixture-of-Experts Routing (VMoER), a structured Bayesian approach for modelling uncertainty in MoE layers. VMoER confines Bayesian inference to the expert-selection stage which is typically done by a deterministic routing network. We instantiate VMoER using two inference strategies: amortised variational inference over routing logits and inferring a temperature parameter for stochastic expert selection. Across tested foundation models, VMoER improves routing stability under noise by 38\%, reduces calibration error by 94\%, and increases out-of-distribution AUROC by 12\%, while incurring less than 1\% additional FLOPs. These results suggest VMoER offers a scalable path toward robust and uncertainty-aware foundation models.

Title: ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models

Authors: Yuqi Qian, Yun Cao, Haocheng Fu, Meiyang Lv, Meineng Zhu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.09454
Pdf URL: https://arxiv.org/pdf/2603.09454
Copy Paste: [[2603.09454]] ShapeMark: Robust and Diversity-Preserving Watermarking for Diffusion Models(https://arxiv.org/abs/2603.09454)
Keywords: diffusion
Abstract: Diffusion models have made substantial advances in recent years, enabling high-quality image synthesis; however, the widespread dissemination and reuse of their outputs have introduced new challenges in intellectual property protection and content provenance. Image watermarking offers a solution to these challenges, and recent work has increasingly explored Noise-as-Watermark (NaW) approaches that integrate watermarking directly into the diffusion process. However, existing NaW methods fail to balance robustness and diversity. We attribute this weakness to value encoding, which encodes watermark bits into individual sampled values. It is extremely fragile in practical application scenarios. To address this, we encode watermark bits into the structured noise pattern, so that the watermark is preserved even when individual values are perturbed. To further ensure generation diversity, we introduce a dedicated randomization design that reshuffles the positions of noise elements without changing their values, preventing the watermark from inducing fixed noise patterns or spatial locations. Extensive experiments demonstrate that our method achieves state-of-the-art robustness while maintaining high generation quality across a wide range of lossy scenarios.

Title: Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Authors: Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09484
Pdf URL: https://arxiv.org/pdf/2603.09484
Copy Paste: [[2603.09484]] Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion(https://arxiv.org/abs/2603.09484)
Keywords: diffusion
Abstract: Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.

Title: Streaming Autoregressive Video Generation via Diagonal Distillation

Authors: Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09488
Pdf URL: https://arxiv.org/pdf/2603.09488
Copy Paste: [[2603.09488]] Streaming Autoregressive Video Generation via Diagonal Distillation(https://arxiv.org/abs/2603.09488)
Keywords: diffusion
Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

Title: Temporal-Conditioned Normalizing Flows for Multivariate Time Series Anomaly Detection

Authors: David Baumgartner, Helge Langseth, Kenth Engø-Monsen, Heri Ramampiaro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09490
Pdf URL: https://arxiv.org/pdf/2603.09490
Copy Paste: [[2603.09490]] Temporal-Conditioned Normalizing Flows for Multivariate Time Series Anomaly Detection(https://arxiv.org/abs/2603.09490)
Keywords: anomaly
Abstract: This paper introduces temporal-conditioned normalizing flows (tcNF), a novel framework that addresses anomaly detection in time series data with accurate modeling of temporal dependencies and uncertainty. By conditioning normalizing flows on previous observations, tcNF effectively captures complex temporal dynamics and generates accurate probability distributions of expected behavior. This autoregressive approach enables robust anomaly detection by identifying low-probability events within the learned distribution. We evaluate tcNF on diverse datasets, demonstrating good accuracy and robustness compared to existing methods. A comprehensive analysis of strengths and limitations and open-source code is provided to facilitate reproducibility and future research.

Title: Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Authors: Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09512
Pdf URL: https://arxiv.org/pdf/2603.09512
Copy Paste: [[2603.09512]] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning(https://arxiv.org/abs/2603.09512)
Keywords: self-supervised
Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.

Title: BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

Authors: Chaodong Xiao, Zhengqiang Zhang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09582
Pdf URL: https://arxiv.org/pdf/2603.09582
Copy Paste: [[2603.09582]] BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers(https://arxiv.org/abs/2603.09582)
Keywords: diffusion
Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at this https URL.

Title: A saccade-inspired approach to image classification using visiontransformer attention maps

Authors: Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet, Benoît Miramond
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09613
Pdf URL: https://arxiv.org/pdf/2603.09613
Copy Paste: [[2603.09613]] A saccade-inspired approach to image classification using visiontransformer attention maps(https://arxiv.org/abs/2603.09613)
Keywords: self-supervised
Abstract: Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.

Title: Grounding Synthetic Data Generation With Vision and Language Models

Authors: Ümit Mert Çağlar, Alptekin Temizel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09625
Pdf URL: https://arxiv.org/pdf/2603.09625
Copy Paste: [[2603.09625]] Grounding Synthetic Data Generation With Vision and Language Models(https://arxiv.org/abs/2603.09625)
Keywords: generative
Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at this http URL and the code base at this http URL.

Title: X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Authors: Yueen Ma, Irwin King
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.09632
Pdf URL: https://arxiv.org/pdf/2603.09632
Copy Paste: [[2603.09632]] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models(https://arxiv.org/abs/2603.09632)
Keywords: foundation model
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.

Title: Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs

Authors: Ali Sadeghkhani, A. Assadi, B. Bennett, A. Rabbani
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2603.09651
Pdf URL: https://arxiv.org/pdf/2603.09651
Copy Paste: [[2603.09651]] Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs(https://arxiv.org/abs/2603.09651)
Keywords: generative
Abstract: Pore-scale imaging of subsurface formations is costly and limited to discrete depths, creating significant gaps in reservoir characterization. To address this, we present a conditional Generative Adversarial Network (cGAN) framework for synthesizing realistic thin section images of carbonate rock formations, conditioned on porosity values derived from well logs. The model is trained on 5,000 sub-images extracted from 15 petrography samples over a depth interval of 1992-2000m, the model generates geologically consistent images across a wide porosity range (0.004-0.745), achieving 81% accuracy within a 10\% margin of target porosity values. The successful integration of well log data with the trained generator enables continuous pore-scale visualization along the wellbore, bridging gaps between discrete core sampling points and providing valuable insights for reservoir characterization and energy transition applications such as carbon capture and underground hydrogen storage.

Title: When to Lock Attention: Training-Free KV Control in Video Diffusion

Authors: Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang
Subjects: cs.CV, cs.AI, cs.ET, eess.IV
Abstract URL: https://arxiv.org/abs/2603.09657
Pdf URL: https://arxiv.org/pdf/2603.09657
Copy Paste: [[2603.09657]] When to Lock Attention: Training-Free KV Control in Video Diffusion(https://arxiv.org/abs/2603.09657)
Keywords: diffusion
Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.

Title: GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

Authors: Federico Bello, Gonzalo Chiarlone, Marcelo Fiori, Gastón García González, Federico Larroca
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09675
Pdf URL: https://arxiv.org/pdf/2603.09675
Copy Paste: [[2603.09675]] GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation(https://arxiv.org/abs/2603.09675)
Keywords: anomaly
Abstract: There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.

Title: Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records

Authors: Jacopo Vitale, David Della Morte, Luca Bacco, Mario Merone, Mark de Groot, Saskia Haitjema, Leandro Pecchia, Bram van Es
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.09685
Pdf URL: https://arxiv.org/pdf/2603.09685
Copy Paste: [[2603.09685]] Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records(https://arxiv.org/abs/2603.09685)
Keywords: generative
Abstract: To overcome the limitations of manual administrative coding in geriatric Cardiovascular Risk Management, this study introduces an automated classification framework leveraging unstructured Electronic Health Records (EHRs). Using a dataset of 3,482 patients, we benchmarked three distinct modeling paradigms on longitudinal Dutch clinical narratives: classical machine learning baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative Large Language Models (LLMs) in a zero-shot setting. Additionally, we evaluated a late fusion strategy to integrate unstructured text with structured medication embeddings and anthropometric data. Our analysis reveals that the custom Transformer architecture outperforms both traditional methods and generative \acs{llm}s, achieving the highest F1-scores and Matthews Correlation Coefficients. These findings underscore the critical role of specialized hierarchical attention mechanisms in capturing long-range dependencies within medical texts, presenting a robust, automated alternative to manual workflows for clinical risk stratification.

Title: TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR

Authors: Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09702
Pdf URL: https://arxiv.org/pdf/2603.09702
Copy Paste: [[2603.09702]] TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR(https://arxiv.org/abs/2603.09702)
Keywords: diffusion
Abstract: Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.

Title: FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

Authors: Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09721
Pdf URL: https://arxiv.org/pdf/2603.09721
Copy Paste: [[2603.09721]] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation(https://arxiv.org/abs/2603.09721)
Keywords: diffusion
Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

Title: FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

Authors: Xiaotian Hu, Junwei Huang, Mingxuan Liu, Kasidit Anmahapong, Yifei Chen, Yitong Luo, Yiming Huang, Xuguang Bai, Zihan Li, Yi Liao, Haibo Qu, Qiyuan Tian
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2603.09733
Pdf URL: https://arxiv.org/pdf/2603.09733
Copy Paste: [[2603.09733]] FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis(https://arxiv.org/abs/2603.09733)
Keywords: foundation model
Abstract: Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.

Title: LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos

Authors: Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie Lowry
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09743
Pdf URL: https://arxiv.org/pdf/2603.09743
Copy Paste: [[2603.09743]] LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos(https://arxiv.org/abs/2603.09743)
Keywords: diffusion
Abstract: Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.

Title: LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

Authors: Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09759
Pdf URL: https://arxiv.org/pdf/2603.09759
Copy Paste: [[2603.09759]] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control(https://arxiv.org/abs/2603.09759)
Keywords: diffusion
Abstract: Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.

Title: Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Authors: Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09803
Pdf URL: https://arxiv.org/pdf/2603.09803
Copy Paste: [[2603.09803]] Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning(https://arxiv.org/abs/2603.09803)
Keywords: in-context
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that get correct answers by chance. We observe that better reasoning are better teachers: high-quality solutions serve as more effective demonstrations than low-quality ones. We term this teaching ability Demonstration Utility, and show that the policy model's own in-context learning ability provides an efficient way to measure it, yielding a quality signal termed Evidence Gain. To employ this signal during training, we introduce In-Context RLVR. By Bayesian analysis, we show that this objective implicitly reweights rewards by Evidence Gain, assigning higher weights to high-quality traces and lower weights to low-quality ones, without requiring costly computation or external evaluators. Experiments on mathematical benchmarks show improvements in both accuracy and reasoning quality over standard RLVR.

Title: ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Authors: Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09819
Pdf URL: https://arxiv.org/pdf/2603.09819
Copy Paste: [[2603.09819]] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation(https://arxiv.org/abs/2603.09819)
Keywords: diffusion
Abstract: We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.

Title: Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs

Authors: Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, Jonathan Herzig
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.09906
Pdf URL: https://arxiv.org/pdf/2603.09906
Copy Paste: [[2603.09906]] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs(https://arxiv.org/abs/2603.09906)
Keywords: generative
Abstract: While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.

Title: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09921
Pdf URL: https://arxiv.org/pdf/2603.09921
Copy Paste: [[2603.09921]] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition(https://arxiv.org/abs/2603.09921)
Keywords: generative
Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at this https URL

Title: Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

Authors: Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09931
Pdf URL: https://arxiv.org/pdf/2603.09931
Copy Paste: [[2603.09931]] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation(https://arxiv.org/abs/2603.09931)
Keywords: diffusion
Abstract: Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at this https URL

Title: Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Authors: Erkan Turan, Maks Ovsjanikov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09936
Pdf URL: https://arxiv.org/pdf/2603.09936
Copy Paste: [[2603.09936]] Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective(https://arxiv.org/abs/2603.09936)
Keywords: generative
Abstract: Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $\sigma(t)=\sigma_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.

Title: SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

Authors: Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti, Patitapaban Palo, David W. Eyre, David A. Clifton
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09940
Pdf URL: https://arxiv.org/pdf/2603.09940
Copy Paste: [[2603.09940]] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG(https://arxiv.org/abs/2603.09940)
Keywords: foundation model
Abstract: Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long-duration multimodal data remains limited. We introduce SignalMC-MED, a benchmark for evaluating biosignal FMs on synchronized single-lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC-MED dataset, SignalMC-MED comprises 22,256 visits with 10-minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD-10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time-series and biosignal FMs across ECG-only, PPG-only, and ECG + PPG settings. We find that domain-specific biosignal FMs consistently outperform general time-series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10-minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand-crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC-MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.

Title: From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

Authors: Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09955
Pdf URL: https://arxiv.org/pdf/2603.09955
Copy Paste: [[2603.09955]] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding(https://arxiv.org/abs/2603.09955)
Keywords: self-supervised
Abstract: Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.