2026-03-17

Title: Translational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT

Authors: Krish Tadigotla
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.13231
Pdf URL: https://arxiv.org/pdf/2603.13231
Copy Paste: [[2603.13231]] Translational Gaps in Graph Transformers for Longitudinal EHR Prediction: A Critical Appraisal of GT-BEHRT(https://arxiv.org/abs/2603.13231)
Keywords: self-supervised
Abstract: Transformer-based models have improved predictive modeling on longitudinal electronic health records through large-scale self-supervised pretraining. However, most EHR transformer architectures treat each clinical encounter as an unordered collection of codes, which limits their ability to capture meaningful relationships within a visit. Graph-transformer approaches aim to address this limitation by modeling visit-level structure while retaining the ability to learn long-term temporal patterns. This paper provides a critical review of GT-BEHRT, a graph-transformer architecture evaluated on MIMIC-IV intensive care outcomes and heart failure prediction in the All of Us Research Program. We examine whether the reported performance gains reflect genuine architectural benefits and whether the evaluation methodology supports claims of robustness and clinical relevance. We analyze GT-BEHRT across seven dimensions relevant to modern machine learning systems, including representation design, pretraining strategy, cohort construction transparency, evaluation beyond discrimination, fairness assessment, reproducibility, and deployment feasibility. GT-BEHRT reports strong discrimination for heart failure prediction within 365 days, with AUROC 94.37 +/- 0.20, AUPRC 73.96 +/- 0.83, and F1 64.70 +/- 0.85. Despite these results, we identify several important gaps, including the lack of calibration analysis, incomplete fairness evaluation, sensitivity to cohort selection, limited analysis across phenotypes and prediction horizons, and limited discussion of practical deployment considerations. Overall, GT-BEHRT represents a meaningful architectural advance in EHR representation learning, but more rigorous evaluation focused on calibration, fairness, and deployment is needed before such models can reliably support clinical decision-making.

Title: Your Code Agent Can Grow Alongside You with Structured Memory

Authors: Yi-Xuan Deng, Xiaoqin Liu, Yi Zhang, Guo-Wei Yang, Shuojin Yang
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.13258
Pdf URL: https://arxiv.org/pdf/2603.13258
Copy Paste: [[2603.13258]] Your Code Agent Can Grow Alongside You with Structured Memory(https://arxiv.org/abs/2603.13258)
Keywords: foundation model
Abstract: While "Intent-oriented programming" (or "Vibe Coding") redefines software engineering, existing code agents remain tethered to static code snapshots. Consequently, they struggle to model the critical information embedded in the temporal evolution of projects, failing to leverage the "reasoning trajectories" implicit in past successful practices. This limitation results in rigid behavioral logic and a lack of autonomous adaptability, ultimately hindering their ability to tackle complex, repository-level problems. To bridge this static-dynamic mismatch, we propose MemCoder, a framework designed to enable continual human-AI co-evolution. MemCoder first structures historical human experience to distill latent intent-to-code mappings from past commits. It then employs a self-refinement mechanism driven by verification feedback to correct agent behavior in real-time. Crucially, an experience self-internalization mechanism is introduced to crystallize human-validated solutions into long-term knowledge, thereby supporting sustained evolution. Experimental results on SWE-bench Verified demonstrate that MemCoder not only achieves State-of-the-Art (SOTA) performance but also delivers a 9.4% improvement in resolved rate over the general foundation model DeepSeek-V3.2. These findings indicate that equipping agents with the capability to co-evolve with humans via project history and real-time feedback effectively unlocks the potential of general models in complex software engineering tasks.

Title: Knowledge, Rules and Their Embeddings: Two Paths towards Neuro-Symbolic JEPA

Authors: Yongchao Huang, Hassan Raza
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13265
Pdf URL: https://arxiv.org/pdf/2603.13265
Copy Paste: [[2603.13265]] Knowledge, Rules and Their Embeddings: Two Paths towards Neuro-Symbolic JEPA(https://arxiv.org/abs/2603.13265)
Keywords: diffusion, self-supervised, generative
Abstract: Modern self-supervised predictive architectures excel at capturing complex statistical correlations from high-dimensional data but lack mechanisms to internalize verifiable human logic, leaving them susceptible to spurious correlations and shortcut learning. Conversely, traditional rule-based inference systems offer rigorous, interpretable logic but suffer from discrete boundaries and NP-hard combinatorial explosion. To bridge this divide, we propose a bidirectional neuro-symbolic framework centered around Rule-informed Joint-Embedding Predictive Architectures (RiJEPA). In the first direction, we inject structured inductive biases into JEPA training via Energy-Based Constraints (EBC) and a multi-modal dual-encoder architecture. This fundamentally reshapes the representation manifold, replacing arbitrary statistical correlations with geometrically sound logical basins. In the second direction, we demonstrate that by relaxing rigid, discrete symbolic rules into a continuous, differentiable logic, we can bypass traditional combinatorial search for new rule generation. By leveraging gradient-guided Langevin diffusion within the rule energy landscape, we introduce novel paradigms for continuous rule discovery, which enable unconditional joint generation, conditional forward and abductive inference, and marginal predictive translation. Empirical evaluations on both synthetic topological simulations and a high-stakes clinical use case confirm the efficacy of our approach. Ultimately, this framework establishes a powerful foundation for robust, generative, and interpretable neuro-symbolic representation learning.

Title: CAMEL-CLIP: Channel-aware Multimodal Electroencephalography-text Alignment for Generalizable Brain Foundation Models

Authors: Hanseul Choi, Jinyeong Park, Seongwon Jin, Sungho Park, Jibum Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13272
Pdf URL: https://arxiv.org/pdf/2603.13272
Copy Paste: [[2603.13272]] CAMEL-CLIP: Channel-aware Multimodal Electroencephalography-text Alignment for Generalizable Brain Foundation Models(https://arxiv.org/abs/2603.13272)
Keywords: foundation model
Abstract: Electroencephalography (EEG) foundation models have shown promise for learning generalizable representations, yet they remain sensitive to channel heterogeneity, such as changes in channel composition or ordering. We propose channel-aware multimodal EEG-text alignment contrastive language-image pretraining (CAMEL-CLIP), a contrastive EEG-text multimodal foundation model designed to be robust to heterogeneous channel configurations and widely applicable to diverse downstream tasks. CAMEL-CLIP introduces three key components: (1) channel attribute-based positional encoding, which identifies channels through semantic information; (2) dynamic channel projection, which generates variable-length embeddings by independently projecting each channel without feature compression; and (3) dual-level contrastive learning, which jointly performs channel-level and sample-level contrastive learning to capture both channel-specific and global signal characteristics. Experimental results demonstrate that CAMEL-CLIP achieves state-of-the-art performance under linear-probing and outperforms existing foundation models that rely on full-finetuning.

Title: A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)

Authors: Emil Hovad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13280
Pdf URL: https://arxiv.org/pdf/2603.13280
Copy Paste: [[2603.13280]] A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)(https://arxiv.org/abs/2603.13280)
Keywords: diffusion
Abstract: We introduce a Stability-Aware Frozen Euler autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM) that recovers material parameters and temporal field evolution from videos of physical processes. The architecture is an autoencoder whose latent-space transition is governed by a frozen PDE operator: a convolutional encoder maps each frame to a latent field; the SAFE operator propagates it forward via sub-stepped finite differences; and a decoder reconstructs the video. Because the physics is embedded as a frozen, differentiable layer, backpropagation yields gradients that directly supervise an attention-based estimator for the transport coefficient alpha, requiring no ground-truth labels. The SAFE operator is the central contribution. Temporal snapshots are saved at intervals far larger than the simulation time step; a forward Euler step at the frame interval violates the von Neumann stability condition, causing alpha to collapse to an unphysical value. The SAFE operator resolves this by sub-stepping the frozen finite-difference stencil to match the original temporal resolution, restoring stability and enabling accurate parameter recovery. We demonstrate SAFE-PIT-CM on the heat equation (diffusion, alpha < 0) and the reverse heat equation (mobility, alpha > 0). SAFE-PIT-CM also supports zero-shot inference: learning alpha from a single simulation with no training data, using only the SAFE loss as supervision. The zero-shot mode achieves accuracy comparable to a pre-trained model. The architecture generalises to any PDE admitting a convolutional finite-difference discretisation. Because latent dynamics are governed by a known PDE, SAFE-PIT-CM is inherently explainable: every prediction is traceable to a physical transport coefficient and step-by-step PDE propagation.

Title: Do Diffusion Models Dream of Electric Planes? Discrete and Continuous Simulation-Based Inference for Aircraft Design

Authors: Aurelien Ghiglino, Daniel Elenius, Anirban Roy, Ramneet Kaur, Manoj Acharya, Colin Samplawski, Brian Matejek, Susmit Jha, Juan Alonso, Adam Cobb
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.13284
Pdf URL: https://arxiv.org/pdf/2603.13284
Copy Paste: [[2603.13284]] Do Diffusion Models Dream of Electric Planes? Discrete and Continuous Simulation-Based Inference for Aircraft Design(https://arxiv.org/abs/2603.13284)
Keywords: diffusion
Abstract: In this paper, we generate conceptual engineering designs of electric vertical take-off and landing (eVTOL) aircraft. We follow the paradigm of simulation-based inference (SBI), whereby we look to learn a posterior distribution over the full eVTOL design space. To learn this distribution, we sample over discrete aircraft configurations (topologies) and their corresponding set of continuous parameters. Therefore, we introduce a hierarchical probabilistic model consisting of two diffusion models. The first model leverages recent work on Riemannian Diffusion Language Modeling (RDLM) and Unified World Models (UWMs) to enable us to sample topologies from a discrete and continuous space. For the second model we introduce a masked diffusion approach to sample the corresponding parameters conditioned on the topology. Our approach rediscovers known trends and governing physical laws in aircraft design, while significantly accelerating design generation.

Title: TAS-GNN: A Status-Aware Signed Graph Neural Network for Anomaly Detection in Bitcoin Trust Systems

Authors: Chang Xue, Fang Liu, Jiaye Wang, Jinming Xing, Chen Yang
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13290
Pdf URL: https://arxiv.org/pdf/2603.13290
Copy Paste: [[2603.13290]] TAS-GNN: A Status-Aware Signed Graph Neural Network for Anomaly Detection in Bitcoin Trust Systems(https://arxiv.org/abs/2603.13290)
Keywords: anomaly
Abstract: Decentralized financial platforms rely heavily on Web of Trust reputation systems to mitigate counterparty risk in the absence of centralized identity verification. However, these pseudonymous networks are inherently vulnerable to adversarial behaviors, such as Sybil attacks and camouflaged fraud, where malicious actors cultivate artificial reputations before executing exit scams. Traditional anomaly detection in this domain faces two critical limitations. First, reliance on naive statistical heuristics (e.g., flagging the lowest 5% of rated users) fails to distinguish between victims of bad-mouthing attacks and actual fraudsters. Second, standard Graph Neural Networks (GNNs) operate on the assumption of homophily and cannot effectively process the semantic inversion inherent in signed (trust vs. distrust) and directed (status) edges. We propose TAS-GNN (Topology-Aware Signed Graph Neural Network), a novel framework designed for feature-sparse signed networks like Bitcoin-Alpha. TAS-GNN integrates recursive Web-of-Trust labeling and a dual-channel message-passing architecture that separately models trust and distrust signals, fused through a Status-Aware Attention mechanism. Experiments demonstrate that TAS-GNN achieves state-of-the-art performance, significantly outperforming existing signed GNN baselines.

Title: ICPRL: Acquiring Physical Intuition from Interactive Control

Authors: Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Shuo Zhang, Zhiming Ding, Bo Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13295
Pdf URL: https://arxiv.org/pdf/2603.13295
Copy Paste: [[2603.13295]] ICPRL: Acquiring Physical Intuition from Interactive Control(https://arxiv.org/abs/2603.13295)
Keywords: in-context
Abstract: VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.

Title: DreamReader: An Interpretability Toolkit for Text-to-Image Models

Authors: Nirmalendu Prakash, Narmeen Oozeer, Michael Lan, Luka Samkharadze, Phillip Howard, Roy Ka-Wei Lee, Dhruv Nathawani, Shivam Raval, Amirali Abdullah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13299
Pdf URL: https://arxiv.org/pdf/2603.13299
Copy Paste: [[2603.13299]] DreamReader: An Interpretability Toolkit for Text-to-Image Models(https://arxiv.org/abs/2603.13299)
Keywords: diffusion
Abstract: Despite the rapid adoption of text-to-image (T2I) diffusion models, causal and representation-level analysis remains fragmented and largely limited to isolated probing techniques. To address this gap, we introduce DreamReader: a unified framework that formalizes diffusion interpretability as composable representation operators spanning activation extraction, causal patching, structured ablations, and activation steering across modules and timesteps. DreamReader provides a model-agnostic abstraction layer enabling systematic analysis and intervention across diffusion architectures. Beyond consolidating existing methods, DreamReader introduces three novel intervention primitives for diffusion models: (1) representation fine-tuning (LoReFT) for subspace-constrained internal adaptation; (2) classifier-guided gradient steering using MLP probes trained on activations; and (3) component-level cross-model mapping for systematic study of transferability of representations across modalities. These mechanisms allows us to do lightweight white-box interventions on T2I models by drawing inspiration from interpretability techniques on LLMs. We demonstrate DreamReader through controlled experiments that (i) perform activation stitching between two models, and (ii) apply LoReFT to steer multiple activation units, reliably injecting a target concept into the generated images. Experiments are specified declaratively and executed in controlled batched pipelines to enable reproducible large-scale analysis. Across multiple case studies, we show that techniques adapted from language model interpretability yield promising and controllable interventions in diffusion models. DreamReader is released as an open source toolkit for advancing research on T2I interpretability.

Title: Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation

Authors: Mingyu Kim, Young-Heon Kim, Mijung Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13300
Pdf URL: https://arxiv.org/pdf/2603.13300
Copy Paste: [[2603.13300]] Safety-Guided Flow (SGF): A Unified Framework for Negative Guidance in Safe Generation(https://arxiv.org/abs/2603.13300)
Keywords: diffusion, generative
Abstract: Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data-driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy-based negative guidance against unsafe data samples. Furthermore, we leverage control-barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high-quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.

Title: Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision

Authors: Kirill Borodin, Kirill Kondrashov, Nikita Vasiliev, Ksenia Gladkova, Inna Larina, Mikhail Gorodnichev, Grach Mkrtchian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13306
Pdf URL: https://arxiv.org/pdf/2603.13306
Copy Paste: [[2603.13306]] Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision(https://arxiv.org/abs/2603.13306)
Keywords: anomaly
Abstract: CCTV safety monitoring demands anomaly detectors combine reliable clip-level accuracy with predictable per-clip latency despite weak supervision. This work investigates compact vision-language models (VLMs) as practical detectors for this regime. A unified evaluation protocol standardizes preprocessing, prompting, dataset splits, metrics, and runtime settings to compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and weakly supervised baselines. Evaluation spans accuracy, precision, recall, F1, ROC-AUC, and average per-clip latency to jointly quantify detection quality and efficiency. With parameter-efficient adaptation, compact VLMs achieve performance on par with, and in several cases exceeding, established approaches while retaining competitive per-clip latency. Adaptation further reduces prompt sensitivity, producing more consistent behavior across prompt regimes under the shared protocol. These results show that parameter-efficient fine-tuning enables compact VLMs to serve as dependable clip-level anomaly detectors, yielding a favorable accuracy-efficiency trade-off within a transparent and consistent experimental setup.

Title: LightningRL: Breaking the Accuracy-Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning

Authors: Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, Zhijie Deng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13319
Pdf URL: https://arxiv.org/pdf/2603.13319
Copy Paste: [[2603.13319]] LightningRL: Breaking the Accuracy-Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning(https://arxiv.org/abs/2603.13319)
Keywords: diffusion
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising paradigm for parallel token generation, with block-wise variants garnering significant research interest. Despite their potential, existing dLLMs typically suffer from a rigid accuracy-parallelism trade-off: increasing the number of tokens per forward (TPF) via aggressive parallel decoding often leads to performance degradation and increased generation instability. We identify that this limitation stems from the model's inability to navigate high-parallelism regimes where approximation errors and local corruptions accumulate, ultimately undermining the reliability of parallel generation. To address this, we propose LightningRL, a post-training framework designed to directly optimize the speed-quality Pareto frontier of pre-trained dLLMs. Instead of forcing uniform parallelization, our approach leverages reinforcement learning to identify and reinforce high-parallelism trajectories that maintain generation accuracy. Built upon the Group Relative Policy Optimization (GRPO) framework, LightningRL introduces several enhancements tailored for dLLMs: (1) stabilized training via per-reward decoupled normalization; (2) token-level negative log-likelihood (NLL) regularization on correct trajectories to anchor model performance; and (3) a dynamic sampling strategy with TPF-aware filtering to enhance training efficiency. Experimental results across mathematical and coding benchmarks demonstrate that LightningRL consistently advances the Pareto frontier, achieving competitive task accuracy while significantly increasing parallelism, reaching an average TPF of 7.32 (with a peak of 11.10 on the MBPP dataset). Our code is available at this https URL.

Title: RBF-Solver: A Multistep Sampler for Diffusion Probabilistic Models via Radial Basis Functions

Authors: Soochul Park, Yeon Ju Lee, SeongJin Yoon, Jiyub Shin, Juhee Lee, Seongwoon Jo
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.13330
Pdf URL: https://arxiv.org/pdf/2603.13330
Copy Paste: [[2603.13330]] RBF-Solver: A Multistep Sampler for Diffusion Probabilistic Models via Radial Basis Functions(https://arxiv.org/abs/2603.13330)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models (DPMs) are widely adopted for their outstanding generative fidelity, yet their sampling is computationally demanding. Polynomial-based multistep samplers mitigate this cost by accelerating inference; however, despite their theoretical accuracy guarantees, they generate the sampling trajectory according to a predefined scheme, providing no flexibility for further optimization. To address this limitation, we propose RBF-Solver, a multistep diffusion sampler that interpolates model evaluations with Gaussian radial basis functions (RBFs). By leveraging learnable shape parameters in Gaussian RBFs, RBF-Solver explicitly follows optimal sampling trajectories. At first order, it reduces to the Euler method (DDIM). At second order or higher, as the shape parameters approach infinity, RBF-Solver converges to the Adams method, ensuring its compatibility with existing samplers. Owing to the locality of Gaussian RBFs, RBF-Solver maintains high image fidelity even at fourth order or higher, where previous samplers deteriorate. For unconditional generation, RBF-Solver consistently outperforms polynomial-based samplers in the high-NFE regime (NFE >= 15). On CIFAR-10 with the Score-SDE model, it achieves an FID of 2.87 with 15 function evaluations and further improves to 2.48 with 40 function evaluations. For conditional ImageNet 256 x 256 generation with the Guided Diffusion model at a guidance scale 8.0, substantial gains are achieved in the low-NFE range (5-10), yielding a 16.12-33.73% reduction in FID relative to polynomial-based samplers.

Title: Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

Authors: Xi Chen, Maojun Zhang, Yu Liu, Shen Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13352
Pdf URL: https://arxiv.org/pdf/2603.13352
Copy Paste: [[2603.13352]] Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts(https://arxiv.org/abs/2603.13352)
Keywords: foundation model
Abstract: Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

Title: AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Authors: Hamza Mooraj, George Pantazopoulos, Alessandro Suglia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13354
Pdf URL: https://arxiv.org/pdf/2603.13354
Copy Paste: [[2603.13354]] AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification(https://arxiv.org/abs/2603.13354)
Keywords: generative
Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.

Title: Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection

Authors: Patricia L. Suarez, Leo Thomas Ramos, Angel D. Sappa
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13357
Pdf URL: https://arxiv.org/pdf/2603.13357
Copy Paste: [[2603.13357]] Bi-CamoDiffusion: A Boundary-informed Diffusion Approach for Camouflaged Object Detection(https://arxiv.org/abs/2603.13357)
Keywords: diffusion
Abstract: Bi-CamoDiffusion is introduced, an evolution of the CamoDiffusion framework for camouflaged object detection. It integrates edge priors into early-stage embeddings via a parameter-free injection process, which enhances boundary sharpness and prevents structural ambiguity. This is governed by a unified optimization objective that balances spatial accuracy, structural constraints, and uncertainty supervision, allowing the model to capture of both the object's global context and its intricate boundary transitions. Evaluations across the CAMO, COD10K, and NC4K benchmarks show that Bi-CamoDiffusion surpasses the baseline, delivering sharper delineation of thin structures and protrusions while also minimizing false positives. Also, our model consistently outperforms existing state-of-the-art methods across all evaluated metrics, including $S_m$, $F_{\beta}^{w}$, $E_m$, and $MAE$, demonstrating a more precise object-background separation and sharper boundary recovery.

Title: Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution

Authors: Hua Liu, Yanbin Wei, Fei Xing, Tyler Derr, Haoyu Han, Yu Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13360
Pdf URL: https://arxiv.org/pdf/2603.13360
Copy Paste: [[2603.13360]] Graph2Video: Leveraging Video Models to Model Dynamic Graph Evolution(https://arxiv.org/abs/2603.13360)
Keywords: foundation model
Abstract: Dynamic graphs are common in real-world systems such as social media, recommender systems, and traffic networks. Existing dynamic graph models for link prediction often fall short in capturing the complexity of temporal evolution. They tend to overlook fine-grained variations in temporal interaction order, struggle with dependencies that span long time horizons, and offer limited capability to model pair-specific relational dynamics. To address these challenges, we propose \textbf{Graph2Video}, a video-inspired framework that views the temporal neighborhood of a target link as a sequence of "graph frames". By stacking temporally ordered subgraph frames into a "graph video", Graph2Video leverages the inductive biases of video foundation models to capture both fine-grained local variations and long-range temporal dynamics. It generates a link-level embedding that serves as a lightweight and plug-and-play link-centric memory unit. This embedding integrates seamlessly into existing dynamic graph encoders, effectively addressing the limitations of prior approaches. Extensive experiments on benchmark datasets show that Graph2Video outperforms state-of-the-art baselines on the link prediction task in most cases. The results highlight the potential of borrowing spatio-temporal modeling techniques from computer vision as a promising and effective approach for advancing dynamic graph learning.

Title: Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Authors: Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13366
Pdf URL: https://arxiv.org/pdf/2603.13366
Copy Paste: [[2603.13366]] Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding(https://arxiv.org/abs/2603.13366)
Keywords: in-context
Abstract: Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

Title: Real-Time Monocular Scene Analysis for UAV in Outdoor Environments

Authors: Yara AlaaEldin
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.13368
Pdf URL: https://arxiv.org/pdf/2603.13368
Copy Paste: [[2603.13368]] Real-Time Monocular Scene Analysis for UAV in Outdoor Environments(https://arxiv.org/abs/2603.13368)
Keywords: diffusion
Abstract: In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited. We introduce a new synthetic dataset in this thesis, TopAir that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic to real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. The results reveal good generalization performance of Co-SemDepth when tested on real data from the SMD dataset while further enhancement is needed on the MIT dataset.

Title: Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection

Authors: Ali Zia, Usman Ali, Muhammad Umer Ramzan, Hamza Abid, Abdul Rehman, Wei Xiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13374
Pdf URL: https://arxiv.org/pdf/2603.13374
Copy Paste: [[2603.13374]] Geometry-Aware Semantic Reasoning for Training Free Video Anomaly Detection(https://arxiv.org/abs/2603.13374)
Keywords: anomaly
Abstract: Training-free video anomaly detection (VAD) has recently emerged as a scalable alternative to supervised approaches, yet existing methods largely rely on static prompting and geometry-agnostic feature fusion. As a result, anomaly inference is often reduced to shallow similarity matching over Euclidean embeddings, leading to unstable predictions and limited interpretability, especially in complex or hierarchically structured scenes. We introduce MM-VAD, a geometry-aware semantic reasoning framework for training free VAD that reframes anomaly detection as adaptive test-time inference rather than fixed feature comparison. Our approach projects caption-derived scene representations into hyperbolic space to better preserve hierarchical structure and performs anomaly assessment through an adaptive question answering process over a frozen large language model. A lightweight, learnable prompt is optimised at test time using an unsupervised confidence-sparsity objective, enabling context-specific calibration without updating any backbone parameters. To further ground semantic predictions in visual evidence, we incorporate a covariance-aware Mahalanobis refinement that stabilises cross-modal alignment. Across four benchmarks, MM-VAD consistently improves over prior training-free methods, achieving 90.03% AUC on XD-Violence and 83.24%, 96.95%, and 98.81% on UCF-Crime, ShanghaiTech, and UCSD Ped2, respectively. Our results demonstrate that geometry-aware representation and adaptive semantic calibration provide a principled and effective alternative to static Euclidean matching in training-free VAD.

Title: InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization

Authors: Ronghui Li, Zhongyuan Hu, Li Siyao, Youliang Zhang, Haozhe Xie, Mingyuan Zhang, Jie Guo, Xiu Li, Ziwei Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13375
Pdf URL: https://arxiv.org/pdf/2603.13375
Copy Paste: [[2603.13375]] InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization(https://arxiv.org/abs/2603.13375)
Keywords: diffusion
Abstract: Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. (1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling 100.69 hours. (2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation. Code, models, and data will be released.

Title: DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1

Authors: Zhenpeng Zhang, Jinwei Lu, Yurui Dong, Bo Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13382
Pdf URL: https://arxiv.org/pdf/2603.13382
Copy Paste: [[2603.13382]] DINOv3 with Test-Time Calibration for Automated Carotid Intima-Media Thickness Measurement on CUBS v1(https://arxiv.org/abs/2603.13382)
Keywords: foundation model
Abstract: Carotid intima-media thickness (CIMT) measured from B-mode ultrasound is an established vascular biomarker for atherosclerosis and cardiovascular risk stratification. Although a wide range of computerized methods have been proposed for carotid boundary delineation and CIMT estimation, robust and transferable deep models that jointly address segmentation and measurement remain underexplored, particularly in the era of vision foundation models. Motivated by recent advances in adapting DINOv3 to medical segmentation and exploiting DINOv3 in test-time optimization pipelines, we investigate a DINOv3-based framework for carotid intima-media complex segmentation and subsequent CIMT measurement on the Carotid Ultrasound Boundary Study (CUBS) v1 dataset. Our pipeline predicts the intima-media band at a fixed image resolution, extracts upper and lower boundaries column-wise, corrects for image resizing using the per-image calibration factor provided by CUBS, and reports CIMT in physical units. Across three patient-level test splits, our method achieved a mean test Dice of 0.7739 $\pm$ 0.0037 and IoU of 0.6384 $\pm$ 0.0044. The mean CIMT absolute error was 181.16 $\pm$ 11.57 $\mu$m, with a mean Pearson correlation of 0.480 $\pm$ 0.259. In a held-out validation subset ($n=28$), test-time threshold calibration reduced the mean absolute CIMT error from 141.0 $\mu$m at the default threshold to 101.1 $\mu$m at the measurement-optimized threshold, while simultaneously reducing systematic bias toward zero. Relative to the error ranges reported in the original CUBS benchmark for classical computerized methods, these results place a DINOv3-based approach within the clinically relevant $\sim$0.1 mm measurement regime. Together, our findings support the feasibility of using vision foundation models for interpretable, calibration-aware CIMT measurement.

Title: Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Authors: Yuntao Shou, Xiangyong Cao, Qian Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13386
Pdf URL: https://arxiv.org/pdf/2603.13386
Copy Paste: [[2603.13386]] Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers(https://arxiv.org/abs/2603.13386)
Keywords: diffusion, generative, in-context
Abstract: Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

Title: High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Authors: Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, Chong Luo, Chang D. Yoo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13389
Pdf URL: https://arxiv.org/pdf/2603.13389
Copy Paste: [[2603.13389]] High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding(https://arxiv.org/abs/2603.13389)
Keywords: diffusion
Abstract: Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

Title: Colony Grounded SAM2: Zero-shot detection and segmentation of bacterial colonies using foundation models

Authors: Daan Korporaal, Patrick de Kruijf, Ralph H.G.M. Litjens, Bas H.M. van der Velden
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13393
Pdf URL: https://arxiv.org/pdf/2603.13393
Copy Paste: [[2603.13393]] Colony Grounded SAM2: Zero-shot detection and segmentation of bacterial colonies using foundation models(https://arxiv.org/abs/2603.13393)
Keywords: foundation model
Abstract: The detection and classification of bacterial colonies in images of agar-plates is important in microbiology, but is hindered by the lack of labeled datasets. Therefore, we propose Colony Grounded SAM2, a zero-shot inference pipeline to detect and segment bacterial colonies in multiple settings without any further training. By utilizing the pre-trained foundation models Grounding DINO and Segment Anything Model 2, fine-tuned to the microbiological domain, we developed a model that is robust to data changes. Results showed a mean Average Precision of 93.1\% and a $Dice@detection$ score of 0.85, showing excellent detection and segmentation capabilities on out-of-distribution datasets. The entire pipeline with model weights are shared open access to aid with annotation- and classification purposes in microbiology.

Title: Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models

Authors: Sihan Cao, Jianwei Zhang, Pengcheng Zheng, Jiaxin Yan, Caiyan Qin, Yalan Ye, Wei Dong, Peng Wang, Yang Yang, Chaoning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13394
Pdf URL: https://arxiv.org/pdf/2603.13394
Copy Paste: [[2603.13394]] Language-Guided Token Compression with Reinforcement Learning in Large Vision-Language Models(https://arxiv.org/abs/2603.13394)
Keywords: self-supervised
Abstract: Large Vision-Language Models (LVLMs) incur substantial inference costs due to the processing of a vast number of visual tokens. Existing methods typically struggle to model progressive visual token reduction as a multi-step decision process with sequential dependencies and often rely on hand-engineered scoring rules that lack adaptive optimization for complex reasoning trajectories. To overcome these limitations, we propose TPRL, a reinforcement learning framework that learns adaptive pruning trajectories through language-guided sequential optimization tied directly to end-task performance. We formulate visual token pruning as a sequential decision process with explicit state transitions and employ a self-supervised autoencoder to compress visual tokens into a compact state representation for efficient policy learning. The pruning policy is initialized through learning from demonstrations and subsequently fine-tuned using Proximal Policy Optimization (PPO) to jointly optimize task accuracy and computational efficiency. Our experimental results demonstrate that TPRL removes up to 66.7\% of visual tokens and achieves up to a 54.2\% reduction in FLOPs during inference while maintaining a near-lossless average accuracy drop of only 0.7\%. Code is released at \href{this https URL}{\textcolor{mypink}{this https URL}}.

Title: SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation

Authors: Jan Kociszewski, Hubert Jastrzębski, Tymoteusz Stępkowski, Filip Manijak, Krzysztof Rojek, Franziska Boenisch, Adam Dziedzic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13396
Pdf URL: https://arxiv.org/pdf/2603.13396
Copy Paste: [[2603.13396]] SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation(https://arxiv.org/abs/2603.13396)
Keywords: diffusion
Abstract: We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.

Title: MAD: Microenvironment-Aware Distillation -- A Pretraining Strategy for Virtual Spatial Omics from Microscopy

Authors: Jiashu Han, Kunzan Liu, Yeojin Kim, Saurabh Sinha, Sixian You
Subjects: cs.CV, cs.AI, physics.optics
Abstract URL: https://arxiv.org/abs/2603.13401
Pdf URL: https://arxiv.org/pdf/2603.13401
Copy Paste: [[2603.13401]] MAD: Microenvironment-Aware Distillation -- A Pretraining Strategy for Virtual Spatial Omics from Microscopy(https://arxiv.org/abs/2603.13401)
Keywords: self-supervised, foundation model
Abstract: Bridging microscopy and omics would allow us to read molecular states from images-at single-cell resolution and tissue scale-without the cost and throughput limits of omics technologies. Self-supervised pretraining offers a scalable approach with minimal labels, yet how to encode single-cell identity within tissue environments-and the extent of biological information such models can capture-remains an open question. Here, we introduce MAD (microenvironment-aware distillation), a pretraining strategy that learns cell-centric embeddings by jointly self-distilling the morphology view and the microenvironment view of the same indexed cell into a unified embedding space. Across diverse tissues and imaging modalities, MAD achieves state-of-the-art prediction performance on downstream tasks including cell subtyping, transcriptomic prediction, and bioinformatic inference. MAD even outperforms foundation models with a similar number of model parameters that have been trained on substantially larger datasets. These results demonstrate that MAD's dual-view joint self-distillation effectively captures the complexity and diversity of cells within tissues. Together, this establishes MAD as a general tool for representation learning in microscopy, enabling virtual spatial omics and biological insights from vast microscopy datasets.

Title: Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

Authors: Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2603.13405
Pdf URL: https://arxiv.org/pdf/2603.13405
Copy Paste: [[2603.13405]] Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion(https://arxiv.org/abs/2603.13405)
Keywords: diffusion
Abstract: Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: this https URL

Title: Diffusion Models Generalize but Not in the Way You Might Think

Authors: Tim Kaiser, Markus Kollmann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13419
Pdf URL: https://arxiv.org/pdf/2603.13419
Copy Paste: [[2603.13419]] Diffusion Models Generalize but Not in the Way You Might Think(https://arxiv.org/abs/2603.13419)
Keywords: diffusion
Abstract: Standard evaluation metrics suggest that Denoising Diffusion Models based on U-Net or Transformer architectures generalize well in practice. However, as it can be shown that an optimal Diffusion Model fully memorizes the training data, the model error determines generalization. Here, we show that although sufficiently large denoiser models show increasing memorization of the training set with increasing training time, the resulting denoising trajectories do not follow this trend. Our experiments indicate that the reason for this observation is rooted in the fact that overfitting occurs at intermediate noise levels, but the distribution of noisy training data at these noise levels has little overlap with denoising trajectories during inference. To gain more insight, we make use of a 2D toy diffusion model to show that overfitting at intermediate noise levels is largely determined by model error and the density of the data support. While the optimal denoising flow field localizes sharply around training samples, sufficient model error or dense support on the data manifold suppresses exact recall, yielding a smooth, generalizing flow field. To further support our results, we investigate how several factors, such as training time, model size, dataset size, condition granularity, and diffusion guidance, influence generalization behavior.

Title: Generalization and Memorization in Rectified Flow

Authors: Mingxing Rao, Daniel Moyer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.13421
Pdf URL: https://arxiv.org/pdf/2603.13421
Copy Paste: [[2603.13421]] Generalization and Memorization in Rectified Flow(https://arxiv.org/abs/2603.13421)
Keywords: generative
Abstract: Generative models based on the Flow Matching objective, particularly Rectified Flow, have emerged as a dominant paradigm for efficient, high-fidelity image synthesis. However, while existing research heavily prioritizes generation quality and architectural scaling, the underlying dynamics of how RF models memorize training data remain largely underexplored. In this paper, we systematically investigate the memorization behaviors of RF through the test statistics of Membership Inference Attacks (MIA). We progressively formulate three test statistics, culminating in a complexity-calibrated metric ($T_\text{mc\_cal}$) that successfully decouples intrinsic image spatial complexity from genuine memorization signals. This calibration yields a significant performance surge -- boosting attack AUC by up to 15\% and the privacy-critical TPR@1\%FPR metric by up to 45\% -- establishing the first non-trivial MIA specifically tailored for RF. Leveraging these refined metrics, we uncover a distinct temporal pattern: under standard uniform temporal training, a model's susceptibility to MIA strictly peaks at the integration midpoint, a phenomenon we justify via the network's forced deviation from linear approximations. Finally, we demonstrate that substituting uniform timestep sampling with a Symmetric Exponential (U-shaped) distribution effectively minimizes exposure to vulnerable intermediate timesteps. Extensive evaluations across three datasets confirm that this temporal regularization suppresses memorization while preserving generative fidelity.

Title: Self-Flow-Matching assisted Full Waveform Inversion

Authors: Xinquan Huang, Paris Perdikaris
Subjects: cs.LG, cs.AI, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2603.13425
Pdf URL: https://arxiv.org/pdf/2603.13425
Copy Paste: [[2603.13425]] Self-Flow-Matching assisted Full Waveform Inversion(https://arxiv.org/abs/2603.13425)
Keywords: diffusion, generative
Abstract: Full-waveform inversion (FWI) is a high-resolution seismic imaging method that estimates subsurface velocity by matching simulated and recorded waveforms. However, FWI is highly nonlinear, prone to cycle skipping, and sensitive to noise, particularly when low frequencies are missing or the initial model is poor, leading to failures under imperfect acquisition. Diffusion-regularized FWI introduces generative priors to encourage geologically realistic models, but these priors typically require costly offline pretraining and can deteriorate under distribution shift. Moreover, they assume Gaussian initialization and a fixed noise schedule, in which it is unclear how to map a deterministic FWI iterate and its starting model to a well-defined diffusion time or noise level. To address these limitations, we introduce Self-Flow-Matching assisted Full-Waveform Inversion (SFM-FWI), a physics-driven framework that eliminates the need for large-scale offline pretraining while avoiding the noise-level alignment ambiguity. SFM-FWI leverages flow matching to learn a transport field without assuming Gaussian initialization or a predefined noise schedule, so the initial model can be used directly as the starting point of the dynamics. Our approach trains a single flow network online using the governing physics and observed data. At each outer iteration, we build an interpolated model and update the flow by backpropagating the FWI data misfit, providing self-supervision without external training pairs. Experiments on challenging synthetic benchmarks show that SFM-FWI delivers more accurate reconstructions, greater noise robustness, and more stable convergence than standard FWI and pretraining-free regularization methods.

Title: CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design

Authors: Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13431
Pdf URL: https://arxiv.org/pdf/2603.13431
Copy Paste: [[2603.13431]] CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design(https://arxiv.org/abs/2603.13431)
Keywords: generative
Abstract: Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce \textsc{Chimera-Bench} (\textbf{C}DR \textbf{M}odeling with \textbf{E}pitope-guided \textbf{R}edesign), a unified benchmark built around a single canonical task: \emph{epitope-conditioned CDR sequence-structure co-design}. \textsc{Chimera-Bench} provides (1) a curated, deduplicated dataset of \textbf{2,922} antibody-antigen complexes with epitope and paratope annotations; (2) three biologically motivated splits testing generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets; and (3) a comprehensive evaluation protocol with five metric groups including novel epitope-specificity measures. We benchmark representative methods spanning different generative paradigms and report results across all splits. \textsc{Chimera-Bench} is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability. The source code and data are available at: this https URL

Title: Modality-free Graph In-context Alignment

Authors: Wei Zhuo, Siqiang Luo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13434
Pdf URL: https://arxiv.org/pdf/2603.13434
Copy Paste: [[2603.13434]] Modality-free Graph In-context Alignment(https://arxiv.org/abs/2603.13434)
Keywords: foundation model, in-context
Abstract: In-context learning (ICL) converts static encoders into task-conditioned reasoners, enabling adaptation to new data from just a few examples without updating pretrained parameters. This capability is essential for graph foundation models (GFMs) to approach LLM-level generality. Yet current GFMs struggle with cross-domain alignment, typically relying on modality-specific encoders that fail when graphs are pre-vectorized or raw data is inaccessible. In this paper, we introduce Modality-Free Graph In-context Alignment (MF-GIA), a framework that makes a pretrained graph encoder promptable for few-shot prediction across heterogeneous domains without modality assumptions. MF-GIA captures domain characteristics through gradient fingerprints, which parameterize lightweight transformations that align pre-encoded features and indexed labels into unified semantic spaces. During pretraining, a dual prompt-aware attention mechanism with episodic objective learns to match queries against aligned support examples to establish prompt-based reasoning capabilities. At inference, MF-GIA performs parameter-update-free adaptation using only a few-shot support set to trigger cross-domain alignment and enable immediate prediction on unseen domains. Experiments demonstrate that MF-GIA achieves superior few-shot performance across diverse graph domains and strong generalization to unseen domains.

Title: CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

Authors: Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Han Hu, Lefei Zhang, Dacheng Tao
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.13435
Pdf URL: https://arxiv.org/pdf/2603.13435
Copy Paste: [[2603.13435]] CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models(https://arxiv.org/abs/2603.13435)
Keywords: diffusion
Abstract: Diffusion-based image-to-video (I2V) models increasingly exhibit world-model-like properties by implicitly capturing temporal dynamics. However, existing studies have mainly focused on visual quality and controllability, and the robustness of the state transition learned by the model remains understudied. To fill this gap, we are the first to analyze the vulnerability of I2V models, find that temporal control mechanisms constitute a new attack surface, and reveal the challenge of modeling them uniformly under different attack settings. Based on this, we propose a trajectory-control attack, called CtrlAttack, to interfere with state evolution during the generation process. Specifically, we represent the perturbation as a low-dimensional velocity field and construct a continuous displacement field via temporal integration, thereby affecting the model's state transitions while maintaining temporal consistency; meanwhile, we map the perturbation to the observation space, making the method applicable to both white-box and black-box attack settings. Experimental results show that even under low-dimensional and strongly regularized perturbation constraints, our method can still significantly disrupt temporal consistency by increasing the attack success rate (ASR) to over 90% in the white-box setting and over 80% in the black-box setting, while keeping the variation of the FID and FVD within 6 and 130, respectively, thus revealing the potential security risk of I2V models at the level of state dynamics.

Title: Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection

Authors: Eman Ouda, Mohammed Salah, Arsenii O. Chulkov, Gianfranco Gargiulo, Gian Luca Tartaglia, Stefano Sfarra, Yusra Abdulrahman
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2603.13437
Pdf URL: https://arxiv.org/pdf/2603.13437
Copy Paste: [[2603.13437]] Vision-Language Based Expert Reporting for Painting Authentication and Defect Detection(https://arxiv.org/abs/2603.13437)
Keywords: anomaly
Abstract: Authenticity and condition assessment are central to conservation decision-making, yet interpretation and reporting of thermographic output remain largely bespoke and expert-dependent, complicating comparison across collections and limiting systematic integration into conservation documentation. Pulsed Active Infrared Thermography (AIRT) is sensitive to subsurface features such as material heterogeneity, voids, and past interventions; however, its broader adoption is constrained by artifact misinterpretation, inter-laboratory variability, and the absence of standardized, explainable reporting frameworks. Although multi-modal thermographic processing techniques are established, their integration with structured natural-language interpretation has not been explored in cultural heritage. A fully automated thermography-vision-language model (VLM) framework is presented. It combines multi-modal AIRT analysis with modality-aware textual reporting, without human intervention during inference. Thermal sequences are processed using Principal Component Thermography (PCT), Thermographic Signal Reconstruction (TSR), and Pulsed Phase Thermography (PPT), and the resulting anomaly masks are fused into a consensus segmentation that emphasizes regions supported by multiple thermal indicators while mitigating boundary artifacts. The fused evidence is provided to a VLM, which generates structured reports describing the location of the anomaly, thermal behavior, and plausible physical interpretations while explicitly acknowledging the uncertainty and diagnostic limitations. Evaluation on two marquetries demonstrates consistent anomaly detection and stable structured interpretations, indicating reproducibility and generalizability across samples.

Title: Draft-and-Target Sampling for Video Generation Policy

Authors: Qikang Zhang, Yingjie Lei, Wei Liu, Daochang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13438
Pdf URL: https://arxiv.org/pdf/2603.13438
Copy Paste: [[2603.13438]] Draft-and-Target Sampling for Video Generation Policy(https://arxiv.org/abs/2603.13438)
Keywords: diffusion
Abstract: Video generation models have been used as a robot policy to predict the future states of executing a task conditioned on task description and observation. Previous works ignore their high computational cost and long inference time. To address this challenge, we propose Draft-and-Target Sampling, a novel diffusion inference paradigm for video generation policy that is training-free and can improve inference efficiency. We introduce a self-play denoising approach by utilizing two complementary denoising trajectories in a single model, draft sampling takes large steps to generate a global trajectory in a fast manner and target sampling takes small steps to verify it. To further speedup generation, we introduce token chunking and progressive acceptance strategy to reduce redundant computation. Experiments on three benchmarks show that our method can achieve up to 2.1x speedup and improve the efficiency of current state-of-the-art methods with minimal compromise to the success rate. Our code is available.

Title: Improving Channel Estimation via Multimodal Diffusion Models with Flow Matching

Authors: Xiaotian Fan, Xingyu Zhou, Le Liang, Xiao Li, Shi Jin
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2603.13440
Pdf URL: https://arxiv.org/pdf/2603.13440
Copy Paste: [[2603.13440]] Improving Channel Estimation via Multimodal Diffusion Models with Flow Matching(https://arxiv.org/abs/2603.13440)
Keywords: diffusion, generative
Abstract: Deep generative models offer a powerful alternative to conventional channel estimation by learning complex channel distributions. By integrating the rich environmental information available in modern sensing-aided networks, this paper proposes MultiCE-Flow, a multimodal channel estimation framework based on flow matching and diffusion transformer (DiT). We design a specialized multimodal perception module that fuses LiDAR, camera, and location data into a semantic condition, while treating sparse pilots as a structural condition. These conditions guide a DiT backbone to reconstruct high-fidelity channels. Unlike standard diffusion models, we employ flow matching to learn a linear trajectory from noise to data, enabling efficient one-step sampling. By leveraging environmental semantics, our method mitigates the ill-posed nature of estimation with sparse pilots. Extensive experiments demonstrate that MultiCE-Flow consistently outperforms traditional baselines and existing generative models. Notably, it exhibits superior robustness to out-of-distribution scenarios and varying pilot densities, making it suitable for environment-aware communication systems.

Title: LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models

Authors: Chenglin Wang, Yucheng Zhou, Shawn Chen, Tao Wang, Kai Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.13450
Pdf URL: https://arxiv.org/pdf/2603.13450
Copy Paste: [[2603.13450]] LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models(https://arxiv.org/abs/2603.13450)
Keywords: diffusion, generative
Abstract: Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.

Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Authors: Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13459
Pdf URL: https://arxiv.org/pdf/2603.13459
Copy Paste: [[2603.13459]] Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding(https://arxiv.org/abs/2603.13459)
Keywords: in-context
Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model's inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a task representation space and a sample representation space. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. Code: this https URL

Title: Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference

Authors: Jianwei Li, Jung-Eun Kim
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13461
Pdf URL: https://arxiv.org/pdf/2603.13461
Copy Paste: [[2603.13461]] Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference(https://arxiv.org/abs/2603.13461)
Keywords: generative
Abstract: Backdoor attacks pose severe security threats to large language models (LLMs), where a model behaves normally under benign inputs but produces malicious outputs when a hidden trigger appears. Existing backdoor removal methods typically assume prior knowledge of triggers, access to a clean reference model, or rely on aggressive finetuning configurations, and are often limited to classification tasks. However, such assumptions fall apart in real-world instruction-tuned LLM settings. In this work, we propose a new framework for purifying instruction-tuned LLM without any prior trigger knowledge or clean references. Through systematic sanity checks, we find that backdoor associations are redundantly encoded across MLP layers, while attention modules primarily amplify trigger signals without establishing the behavior. Leveraging this insight, we shift the focus from isolating specific backdoor triggers to cutting off the trigger-behavior associations, and design an immunization-inspired elimination approach: by constructing multiple synthetic backdoored variants of the given suspicious model, each trained with different malicious trigger-behavior pairs, and contrasting them with their clean counterparts. The recurring modifications across variants reveal a shared "backdoor signature"-analogous to antigens in a virus. Guided by this signature, we neutralize highly suspicious components in LLM and apply lightweight finetuning to restore its fluency, producing purified models that withstand diverse backdoor attacks and threat models while preserving generative capability.

Title: Synthetic Melanoma Image Generation and Evaluation Using Generative Adversarial Networks

Authors: Pei-Yu Lin, Yidan Shen, Neville Mathew, Renjie Hu, Siyu Huang, Courtney M. Queen, Cameron E. West, Ana Ciurea, George Zouridakis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13497
Pdf URL: https://arxiv.org/pdf/2603.13497
Copy Paste: [[2603.13497]] Synthetic Melanoma Image Generation and Evaluation Using Generative Adversarial Networks(https://arxiv.org/abs/2603.13497)
Keywords: generative
Abstract: Melanoma is the most lethal form of skin cancer, and early detection is critical for improving patient outcomes. Although dermoscopy combined with deep learning has advanced automated skin-lesion analysis, progress is hindered by limited access to large, well-annotated datasets and by severe class imbalance, where melanoma images are substantially underrepresented. To address these challenges, we present the first systematic benchmarking study comparing four GAN architectures-DCGAN, StyleGAN2, and two StyleGAN3 variants (T/R)-for high-resolution melanoma-specific synthesis. We train and optimize all models on two expert-annotated benchmarks (ISIC 2018 and ISIC 2020) under unified preprocessing and hyperparameter exploration, with particular attention to R1 regularization tuning. Image quality is assessed through a multi-faceted protocol combining distribution-level metrics (FID), sample-level representativeness (FMD), qualitative dermoscopic inspection, downstream classification with a frozen EfficientNet-based melanoma detector, and independent evaluation by two board-certified dermatologists. StyleGAN2 achieves the best balance of quantitative performance and perceptual quality, attaining FID scores of 24.8 (ISIC 2018) and 7.96 (ISIC 2020) at gamma=0.8. The frozen classifier recognizes 83% of StyleGAN2-generated images as melanoma, while dermatologists distinguish synthetic from real images at only 66.5% accuracy (chance = 50%), with low inter-rater agreement (kappa = 0.17). In a controlled augmentation experiment, adding synthetic melanoma images to address class imbalance improved melanoma detection AUC from 0.925 to 0.945 on a held-out real-image test set. These findings demonstrate that StyleGAN2-generated melanoma images preserve diagnostically relevant features and can provide a measurable benefit for mitigating class imbalance in melanoma-focused machine learning pipelines.

Title: ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning

Authors: Eric Nazarenus, Chuqiao Li, Yannan He, Xianghui Xie, Jan Eric Lenssen, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13500
Pdf URL: https://arxiv.org/pdf/2603.13500
Copy Paste: [[2603.13500]] ActionPlan: Future-Aware Streaming Motion Synthesis via Frame-Level Action Planning(https://arxiv.org/abs/2603.13500)
Keywords: diffusion
Abstract: We present ActionPlan, a unified motion diffusion framework that bridges real-time streaming with high-quality offline generation within a single model. The core idea is to introduce a per-frame action plan: the model predicts frame-level text latents that act as dense semantic anchors throughout denoising, and uses them to denoise the full motion sequence with combined semantic and motion cues. To support this structured workflow, we design latent-specific diffusion steps, allowing each motion latent to be denoised independently and sampled in flexible orders at inference. As a result, ActionPlan can run in a history-conditioned, future-aware mode for real-time streaming, while also supporting high-quality offline generation. The same mechanism further enables zero-shot motion editing and in-betweening without additional models. Experiments demonstrate that our real-time streaming is 5.25x faster while also achieving 18% motion quality improvement over the best previous method in terms of FID.

Title: LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Authors: Jiahao Zhu, Shanshan Lao, Lijie Liu, Gen Li, Tianhao Qi, Wei Han, Bingchuan Li, Fangfang Liu, Zhuowei Chen, Tianxiang Ma, Qian HE, Yi Zhou, Xiaohua Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13506
Pdf URL: https://arxiv.org/pdf/2603.13506
Copy Paste: [[2603.13506]] LibraGen: Playing a Balance Game in Subject-Driven Video Generation(https://arxiv.org/abs/2603.13506)
Keywords: foundation model
Abstract: With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

Title: MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection

Authors: Jinwei Hu, Francesco Borsatti, Arianna Stropeni, Davide Dalle Pezze, Manuel Barusco, Gian Antonio Susto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13507
Pdf URL: https://arxiv.org/pdf/2603.13507
Copy Paste: [[2603.13507]] MIRAGE: Model-agnostic Industrial Realistic Anomaly Generation and Evaluation for Visual Anomaly Detection(https://arxiv.org/abs/2603.13507)
Keywords: generative, anomaly
Abstract: Industrial visual anomaly detection (VAD) methods are typically trained on normal samples only, yet performance improves substantially when even limited anomalous data is available. Existing anomaly generation approaches either require real anomalous examples, demand expensive hardware, or produce synthetic defects that lack realism. We present MIRAGE (Model-agnostic Industrial Realistic Anomaly Generation and Evaluation), a fully automated pipeline for realistic anomalous image generation and pixel-level mask creation that requires no training and no anomalous images. Our pipeline accesses any generative model as a black box via API calls, uses a VLM for automatic defect prompt generation, and includes a CLIP-based quality filter to retain only well-aligned generated images. For mask generation at scale, we introduce a lightweight, training-free dual-branch semantic change detection module combining text-conditioned Grounding DINO features with fine-grained YOLOv26-Seg structural features. We benchmark four generation methods using Gemini 2.5 Flash Image (Nano Banana) as the generative backbone, evaluating performance on MVTec AD and VisA across two distinct tasks: (i) downstream anomaly segmentation and (ii) visual quality of the generated images, assessed via standard metrics (IS, IC-LPIPS) and a human perceptual study involving 31 participants and 1,550 pairwise votes. The results demonstrate that MIRAGE offers a scalable, accessible foundation for anomaly-aware industrial inspection that requires no real defect data. As a final contribution, we publicly release a large-scale dataset comprising 500 image-mask pairs per category for every MVTec AD and VisA class, over 13,000 pairs in total, alongside all generation prompts and pipeline code.

Title: A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis

Authors: Alessandro Pesci, Valerio Guarrasi, Marco Alì, Isabella Castiglioni, Paolo Soda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13520
Pdf URL: https://arxiv.org/pdf/2603.13520
Copy Paste: [[2603.13520]] A Systematic Benchmark of GAN Architectures for MRI-to-CT Synthesis(https://arxiv.org/abs/2603.13520)
Keywords: generative
Abstract: The translation from Magnetic resonance imaging (MRI) to Computed tomography (CT) has been proposed as an effective solution to facilitate MRI-only clinical workflows while limiting exposure to ionizing radiation. Although numerous Generative Adversarial Network (GAN) architectures have been proposed for MRI-to-CT translation, systematic and fair comparisons across heterogeneous models remain limited. We present a comprehensive benchmark of ten GAN architectures evaluated on the SynthRAD2025 dataset across three anatomical districts (abdomen, thorax, head-and-neck). All models were trained under a unified validation protocol with identical preprocessing and optimization settings. Performance was assessed using complementary metrics capturing voxel-wise accuracy, structural fidelity, perceptual quality, and distribution-level realism, alongside an analysis of computational complexity. Supervised Paired models consistently outperformed Unpaired approaches, confirming the importance of voxel-wise supervision. Pix2Pix achieved the most balanced performance across districts while maintaining a favorable quality-to-complexity trade-off. Multi-district training improved structural robustness, whereas intra-district training maximized voxel-wise fidelity. This benchmark provides quantitative and computational guidance for model selection in MRI-only radiotherapy workflows and establishes a reproducible framework for future comparative studies. To ensure the reproducibility of our experiments we make our code public, together with the overall results, at the following link:this https URL

Title: Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization

Authors: Eshed Gal, Samy Wu Fung, Eldad Haber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13546
Pdf URL: https://arxiv.org/pdf/2603.13546
Copy Paste: [[2603.13546]] Probabilistic Gaussian Homotopy: A Probability-Space Continuation Framework for Nonconvex Optimization(https://arxiv.org/abs/2603.13546)
Keywords: diffusion
Abstract: We introduce Probabilistic Gaussian Homotopy (PGH), a probability-space continuation framework for nonconvex optimization. Unlike classical Gaussian homotopy, which smooths the objective and uniformly averages gradients, PGH deforms the associated Boltzmann distribution and induces Boltzmann-weighted aggregation of perturbed gradients, which exponentially biases descent directions toward low-energy regions. We show that PGH corresponds to a log-sum-exp (soft-min) homotopy that smooths a nonconvex objective at scale $\lambda>0$ and recovers the original objective as $\lambda\to 0$, yielding a posterior-mean generalization of the Moreau envelope, and we derive a dynamical system governing minimizer evolution along an annealed homotopy path. This establishes a principled connection between Gaussian continuation, Bayesian denoising, and diffusion-style smoothing. We further propose Probabilistic Gaussian Homotopy Optimization (PGHO), a practical stochastic algorithm based on Monte Carlo gradient estimation, and demonstrate strong performance on high-dimensional nonconvex benchmarks and sparse recovery problems where classical gradient methods and objective-space smoothing frequently fail.

Title: NumColor: Precise Numeric Color Control in Text-to-Image Generation

Authors: Muhammad Atif Butt, Diego Hernandez, Alexandra Gomez-Villa, Kai Wang, Javier Vazquez-Corral, Joost Van De Weijer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13547
Pdf URL: https://arxiv.org/pdf/2603.13547
Copy Paste: [[2603.13547]] NumColor: Precise Numeric Color Control in Text-to-Image Generation(https://arxiv.org/abs/2603.13547)
Keywords: diffusion
Abstract: Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-{\alpha}, and PixArt-{\Sigma} without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.

Title: Scalable Classification of Course Information Sheets Using Large Language Models: A Reusable Institutional Method for Academic Quality Assurance

Authors: Brecht Verbeken, Joke Van den Broeck, Inge De Cleyn, Steven Van Luchene, Nadine Engels, Andres Algaba, Vincent Ginis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13562
Pdf URL: https://arxiv.org/pdf/2603.13562
Copy Paste: [[2603.13562]] Scalable Classification of Course Information Sheets Using Large Language Models: A Reusable Institutional Method for Academic Quality Assurance(https://arxiv.org/abs/2603.13562)
Keywords: generative
Abstract: Purpose: Higher education institutions face increasing pressure to audit course designs for generative AI (GenAI) integration. This paper presents an end-to-end method for using large language models (LLMs) to scan course information sheets at scale, identify where assessments may be vulnerable to student use of GenAI tools, validate system performance through iterative refinement, and operationalise results through direct stakeholder communication and effort. Method: We developed a four-phase pipeline: (0) manual pilot sampling, (1) iterative prompt engineering with multi-model comparison, (2) full production scan of 4,684 Bachelor and Master course information sheets (Academic Year 2024-2025) from the Vrije Universiteit Brussel (VUB) with automated report generation and email distribution to teaching teams (91.4% address-matched) using a three-tier risk taxonomy (Clear risk, Potential risk, Low risk), and (3) longitudinal re-scan of 4,675 sheets after the next catalogue release. Results: Five iterations of prompt refinement achieved 87% agreement with expert labels. GPT-4o was selected for production based on superior handling of ambiguous cases involving internships and practical components. The Year 1 scan classified 60.3% of courses as Clear risk, 15.2% as Potential risk, and 24.5% as Low risk. Year 2 comparison revealed substantial shifts in risk distributions, with improvements most pronounced in practice-oriented programmes. Implications: The method enables institutions to rapidly transform heterogeneous catalogue data into structured and actionable intelligence. The approach is transferable to other audit domains (sustainability, accessibility, pedagogical alignment) and provides a template for responsible LLM deployment in higher education governance.

Title: Privacy-Preserving Machine Learning for IoT: A Cross-Paradigm Survey and Future Roadmap

Authors: Zakia Zaman, Praveen Gauravaram, Mahbub Hassan, Sanjay Jha, Wen Hu
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2603.13570
Pdf URL: https://arxiv.org/pdf/2603.13570
Copy Paste: [[2603.13570]] Privacy-Preserving Machine Learning for IoT: A Cross-Paradigm Survey and Future Roadmap(https://arxiv.org/abs/2603.13570)
Keywords: generative
Abstract: The rapid proliferation of the Internet of Things has intensified demand for robust privacy-preserving machine learning mechanisms to safeguard sensitive data generated by large-scale, heterogeneous, and resource-constrained devices. Unlike centralized environments, IoT ecosystems are inherently decentralized, bandwidth-limited, and latency-sensitive, exposing privacy risks across sensing, communication, and distributed training pipelines. These characteristics render conventional anonymization and centralized protection strategies insufficient for practical deployments. This survey presents a comprehensive IoT-centric, cross-paradigm analysis of privacy-preserving machine learning. We introduce a structured taxonomy spanning perturbation-based mechanisms such as differential privacy, distributed paradigms such as federated learning, cryptographic approaches including homomorphic encryption and secure multiparty computation, and generative synthesis techniques based on generative adversarial networks. For each paradigm, we examine formal privacy guarantees, computational and communication complexity, scalability under heterogeneous device participation, and resilience against threats including membership inference, model inversion, gradient leakage, and adversarial manipulation. We further analyze deployment constraints in wireless IoT environments, highlighting trade-offs between privacy, communication overhead, model convergence, and system efficiency within next-generation mobile architectures. We also consolidate evaluation methodologies, summarize representative datasets and open-source frameworks, and identify open challenges including hybrid privacy integration, energy-aware learning, privacy-preserving large language models, and quantum-resilient machine learning.

Title: DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models

Authors: Xiaoqiong Liu, Heng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13571
Pdf URL: https://arxiv.org/pdf/2603.13571
Copy Paste: [[2603.13571]] DiveUp: Learning Feature Upsampling from Diverse Vision Foundation Models(https://arxiv.org/abs/2603.13571)
Keywords: foundation model
Abstract: Recently, feature upsampling has gained increasing attention owing to its effectiveness in enhancing vision foundation models (VFMs) for pixel-level understanding tasks. Existing methods typically rely on high-resolution features from the same foundation model to achieve upsampling via self-reconstruction. However, relying solely on intra-model features forces the upsampler to overfit to the source model's inherent location misalignment and high-norm artifacts. To address this fundamental limitation, we propose DiveUp, a novel framework that breaks away from single-model dependency by introducing multi-VFM relational guidance. Instead of naive feature fusion, DiveUp leverages diverse VFMs as a panel of experts, utilizing their structural consensus to regularize the upsampler's learning process, effectively preventing the propagation of inaccurate spatial structures from the source model. To reconcile the unaligned feature spaces across different VFMs, we propose a universal relational feature representation, formulated as a local center-of-mass (COM) field, that extracts intrinsic geometric structures, enabling seamless cross-model interaction. Furthermore, we introduce a spikiness-aware selection strategy that evaluates the spatial reliability of each VFM, effectively filtering out high-norm artifacts to aggregate guidance from only the most reliable expert at each local region. DiveUp is a unified, encoder-agnostic framework; a jointly-trained model can universally upsample features from diverse VFMs without requiring per-model retraining. Extensive experiments demonstrate that DiveUp achieves state-of-the-art performance across various downstream dense prediction tasks, validating the efficacy of multi-expert relational guidance. Our code and models are available at: this https URL

Title: Privacy-Preserving Federated Fraud Detection in Payment Transactions with NVIDIA FLARE

Authors: Holger R. Roth, Sarthak Tickoo, Mayank Kumar, Isaac Yang, Andrew Liu, Amit Varshney, Sayani Kundu, Iustina Vintila, Peter Madsgaard, Juraj Milcak, Chester Chen, Yan Cheng, Andrew Feng, Jeff Savio, Vikram Singh, Craig Stancill, Gloria Wan, Evan Powell, Anwar Ul Haq, Sudhir Upadhyay, Jisoo Lee
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2603.13617
Pdf URL: https://arxiv.org/pdf/2603.13617
Copy Paste: [[2603.13617]] Privacy-Preserving Federated Fraud Detection in Payment Transactions with NVIDIA FLARE(https://arxiv.org/abs/2603.13617)
Keywords: anomaly
Abstract: Fraud-related financial losses continue to rise, while regulatory, privacy, and data-sovereignty constraints increasingly limit the feasibility of centralized fraud detection systems. Federated Learning (FL) has emerged as a promising paradigm for enabling collaborative model training across institutions without sharing raw transaction data. Yet, its practical effectiveness under realistic, non-IID financial data distributions remains insufficiently validated. In this work, we present a multi-institution, industry-oriented proof-of-concept study evaluating federated anomaly detection for payment transactions using the NVIDIA FLARE framework. We simulate a realistic federation of heterogeneous financial institutions, each observing distinct fraud typologies and operating under strict data isolation. Using a deep neural network trained via federated averaging (FedAvg), we demonstrate that federated models achieve a mean F1-score of 0.903 - substantially outperforming locally trained models (0.643) and closely approaching centralized training performance (0.925), while preserving full data sovereignty. We further analyze convergence behavior, showing that strong performance is achieved within 10 federated communication rounds, highlighting the operational viability of FL in latency- and cost-sensitive financial environments. To support deployment in regulated settings, we evaluate model interpretability using Shapley-based feature attribution and confirm that federated models rely on semantically coherent, domain-relevant decision signals. Finally, we incorporate sample-level differential privacy via DP-SGD and demonstrate favorable privacy-utility trade-offs...

Title: SemRep: Generative Code Representation Learning with Code Transformations

Authors: Weichen Li, Jiamin Song, Bogdan Alexandru Stoica, Arav Dhoot, Gabriel Ryan, Shengyu Fu, Kexin Pei
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2603.13640
Pdf URL: https://arxiv.org/pdf/2603.13640
Copy Paste: [[2603.13640]] SemRep: Generative Code Representation Learning with Code Transformations(https://arxiv.org/abs/2603.13640)
Keywords: generative
Abstract: Code transformation is a foundational capability in the software development process, where its effectiveness relies on constructing a high-quality code representation to characterize the input code semantics and guide the transformation. Existing approaches treat code transformation as an end-to-end learning task, leaving the construction of the representation needed for semantic reasoning implicit in model weights or relying on rigid compiler-level abstractions. We present SemRep, a framework that improves code transformation through generative code representation learning. Our key insight is to employ the semantics-preserving transformations as the intermediate representation, which serves as both a generative mid-training task and the guidance for subsequent instruction-specific code transformations. Across general code editing and optimization tasks (e.g., GPU kernel optimization), SemRep outperforms the extensively finetuned baselines with strictly the same training budget by 6.9% in correctness, 1.1x in performance, 13.9% in generalization, and 6.7% in robustness. With the improved exploration of diverse code transformations, SemRep is particularly amenable to evolutionary search. Combined with an evolutionary coding agent, SemRep finds optimizations that 685B larger-weight baselines fail to discover while achieving the same performance with 25% less inference compute.

Title: PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization

Authors: Swadhin Pradhan, Shazal Irshad, Jerome Henry
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2603.13647
Pdf URL: https://arxiv.org/pdf/2603.13647
Copy Paste: [[2603.13647]] PLUME: Building a Network-Native Foundation Model for Wireless Traces via Protocol-Aware Tokenization(https://arxiv.org/abs/2603.13647)
Keywords: foundation model, anomaly
Abstract: Foundation models succeed when they learn in the native structure of a modality, whether morphology-respecting tokens in language or pixels in vision. Wireless packet traces deserve the same treatment: meaning emerges from layered headers, typed fields, timing gaps, and cross-packet state machines, not flat strings. We present Plume (Protocol Language Understanding Model for Exchanges), a compact 140M-parameter foundation model for 802.11 traces that learns from structured PDML dissections. A protocol-aware tokenizer splits along the dissector field tree, emits gap tokens for timing, and normalizes identifiers, yielding 6.2x shorter sequences than BPE with higher per token information density. Trained on a curated corpus, Plume achieves 74-97% next-packet token accuracy across five real-world failure categories and AUROC >= 0.99 for zero-shot anomaly detection. On the same prediction task, frontier LLMs (Claude Opus 4.6, GPT-5.4) score comparably despite receiving identical protocol context, yet Plume does so with > 600x fewer parameters, fitting on a single GPU at effectively zero marginal cost vs. cloud API pricing, enabling on-prem, privacy-preserving root cause analysis.

Title: FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures

Authors: Babak Asadi, Peiyang Wu, Mani Golparvar-Fard, Viraj Shah, Ramez Hajj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13659
Pdf URL: https://arxiv.org/pdf/2603.13659
Copy Paste: [[2603.13659]] FMS$^2$: Unified Flow Matching for Segmentation and Synthesis of Thin Structures(https://arxiv.org/abs/2603.13659)
Keywords: generative
Abstract: Segmenting thin structures like infrastructure cracks and anatomical vessels is a task hampered by topology-sensitive geometry, high annotation costs, and poor generalization across domains. Existing methods address these challenges in isolation. We propose FMS$^2$, a flow-matching framework with two modules. (1) SegFlow is a 2.96M-parameter segmentation model built on a standard encoder-decoder backbone that recasts prediction as continuous image $\rightarrow$ mask transport. It learns a time-indexed velocity field with a flow-matching regression loss and outputs the mask via ODE integration, rather than supervising only end-state logits. This trajectory-level supervision improves thin-structure continuity and sharpness, compared with tuned topology-aware loss baselines, without auxiliary topology heads, post-processing, or multi-term loss engineering. (2) SynFlow is a mask-conditioned mask $\rightarrow$ image generator that produces pixel-aligned synthetic image-mask pairs. It injects mask geometry at multiple scales and emphasizes boundary bands via edge-aware gating, while a controllable mask generator expands sparsity, width, and branching regimes. On five crack and vessel benchmarks, SegFlow alone outperforms strong CNN, Transformer, Mamba, and generative baselines, improving the volumetric metric (mean IoU) from 0.511 to 0.599 (+17.2%) and reducing the topological metric (Betti matching error) from 82.145 to 51.524 (-37.3%). When training with limited labels, augmenting SegFlow with SynFlow-generated pairs recovers near-full performance using 25% of real annotations and improves cross-domain IoU by 0.11 on average. Unlike classical data augmentation that promotes invariance via label-preserving transforms, SynFlow provides pixel-aligned paired supervision with controllable structural shifts (e.g., sparsity, width, branching), which is particularly effective under domain shift.

Title: Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

Authors: Yunhe Gao, Yabin Zhang, Chong Wang, Jiaming Liu, Maya Varma, Jean-Benoit Delbrouck, Akshay Chaudhari, Curtis Langlotz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13660
Pdf URL: https://arxiv.org/pdf/2603.13660
Copy Paste: [[2603.13660]] Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision(https://arxiv.org/abs/2603.13660)
Keywords: self-supervised, foundation model, in-context
Abstract: Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: this https URL.

Title: PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers

Authors: Eshed Gal, Moshe Eliasof, Siddharth Rout, Eldad Haber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13663
Pdf URL: https://arxiv.org/pdf/2603.13663
Copy Paste: [[2603.13663]] PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers(https://arxiv.org/abs/2603.13663)
Keywords: diffusion, generative
Abstract: The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of $O(N \log N)$, delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.

Title: SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment

Authors: Mahdi Naseri, Zhou Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13669
Pdf URL: https://arxiv.org/pdf/2603.13669
Copy Paste: [[2603.13669]] SHAMISA: SHAped Modeling of Implicit Structural Associations for Self-supervised No-Reference Image Quality Assessment(https://arxiv.org/abs/2603.13669)
Keywords: self-supervised
Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.

Title: RSEdit: Text-Guided Image Editing for Remote Sensing

Authors: Chen Zhenyuan, Zhang Zechuan, Zhang Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13708
Pdf URL: https://arxiv.org/pdf/2603.13708
Copy Paste: [[2603.13708]] RSEdit: Text-Guided Image Editing for Remote Sensing(https://arxiv.org/abs/2603.13708)
Keywords: diffusion, in-context
Abstract: General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: this https URL

Title: Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality

Authors: Taiqiang Wu, Yuxin Cheng, Chenchen Ding, Runming Yang, Xincheng Feng, Wenyong Zhou, Zhengwu Liu, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13725
Pdf URL: https://arxiv.org/pdf/2603.13725
Copy Paste: [[2603.13725]] Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality(https://arxiv.org/abs/2603.13725)
Keywords: in-context
Abstract: Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.

Title: Ransomware and Artificial Intelligence: A Comprehensive Systematic Review of Reviews

Authors: Therdpong Daengsi, Phisit Pornpongtechavanich, Paradorn Boonpoor, Kathawut Wattanachukul, Korn Puangnak, Kritphon Phanrattanachai, Pongpisit Wuttidittachotti, Paramate Horkaew
Subjects: cs.CR, eess.SP
Abstract URL: https://arxiv.org/abs/2603.13734
Pdf URL: https://arxiv.org/pdf/2603.13734
Copy Paste: [[2603.13734]] Ransomware and Artificial Intelligence: A Comprehensive Systematic Review of Reviews(https://arxiv.org/abs/2603.13734)
Keywords: anomaly
Abstract: This study provides a comprehensive synthesis of Artificial Intelligence (AI), especially Machine Learning (ML) and Deep Learning (DL), in ransomware defense. Using a "review of reviews" methodology based on PRISMA, this paper gathers insights on how AI is transforming ransomware detection, prevention, and mitigation strategies during the past five years (2020-2024). The findings highlight the effectiveness of hybrid models that combine multiple analysis techniques such as code inspection (static analysis) and behavior monitoring during execution (dynamic analysis). The study also explores anomaly detection and early warning mechanisms before encryption to address the increasing complexity of ransomware. In addition, it examines key challenges in ransomware defense, including techniques designed to deceive AI-driven detection systems and the lack of strong and diverse datasets. The results highlight the role of AI in early detection and real-time response systems, improving scalability and resilience. Using a systematic review-of-reviews approach, this study consolidates insights from multiple review articles, identifies effective AI models, and bridges theory with practice to support collaboration among academia, industry, and policymakers. Future research directions and practical recommendations for cybersecurity practitioners are also discussed. Finally, this paper proposes a roadmap for advancing AI-driven countermeasures to protect critical systems and infrastructures against evolving ransomware threats.

Title: UniVid: Pyramid Diffusion Model for High Quality Video Generation

Authors: Xinyu Xiao, Binbin Yang, Tingtian Li, Yipeng Yu, Sen Lei
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2603.13739
Pdf URL: https://arxiv.org/pdf/2603.13739
Copy Paste: [[2603.13739]] UniVid: Pyramid Diffusion Model for High Quality Video Generation(https://arxiv.org/abs/2603.13739)
Keywords: diffusion, generative
Abstract: Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper, we present a unified video generation model (UniVid) with hybrid conditions of the text prompt and reference image. Given these two available controls, our model can extract objects' appearance and their motion descriptions from textual prompts, while obtaining texture details and structural information from image clues to guide the video generation process. Specifically, we scale up the pre-trained text-to-image diffusion model for generating temporally coherent frames via introducing our temporal-pyramid cross-frame spatial-temporal attention modules and convolutions. To support bimodal control, we introduce a dual-stream cross-attention mechanism, whose attention scores can be freely re-weighted for interpolation of between single and two modalities controls during inference. Extensive experiments showcase that our UniVid achieves superior temporal coherence on T2V, I2V and (T+I)2V tasks.

Title: Multi-Object Advertisement Creative Generation

Authors: Jialu Gao, Mithun Das Gupta, Qun Li, Raveena Kshatriya, Andrew D. Wilson, Keng-hao Chang, Balasaravanan Thoravi Kumaravel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13745
Pdf URL: https://arxiv.org/pdf/2603.13745
Copy Paste: [[2603.13745]] Multi-Object Advertisement Creative Generation(https://arxiv.org/abs/2603.13745)
Keywords: generative
Abstract: Lifestyle images are photographs that capture environments and objects in everyday settings. In furniture product marketing, advertisers often create lifestyle images containing products to resonate with potential buyers, allowing buyers to visualize how the products fit into their daily lives. While recent advances in Generative Artificial Intelligence (GenAI) have given rise to realistic image content creation, their application in e-commerce advertising is challenging because high-quality ads must authentically representing the products in realistic scearios. Therefore, manual intervention is usually required for individual generations, making it difficult to scale to larger product catalogs. To understand the challenges faced by advertisers using GenAI to create lifestyle images at scale, we conducted evaluations on ad images generated using state-of-the-art image generation models and identified the major challenges. Based on our findings, we present CreativeAds, a multi-product ad creation system that supports scalable automated generation with customized parameter adjustment for individual generation. To ensure automated high-quality ad generation, CreativeAds innovates a pipeline that consists of three modules to address challenges in product pairing, layout generation, and background generation separately. Furthermore, CreativeAds contains an intuitive user interface to allow users to oversee generation at scale, and it also supports detailed controls on individual generation for user customized adjustments. We performed a user study on CreativeAds and extensive evaluations of the generated images, demonstrating CreativeAds's ability to create large number of high-quality images at scale for advertisers without requiring expertise in GenAI tools.

Title: Manifold-Orthogonal Dual-spectrum Extrapolation for Parameterized Physics-Informed Neural Networks

Authors: Zhangyong Liang, Ji Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13751
Pdf URL: https://arxiv.org/pdf/2603.13751
Copy Paste: [[2603.13751]] Manifold-Orthogonal Dual-spectrum Extrapolation for Parameterized Physics-Informed Neural Networks(https://arxiv.org/abs/2603.13751)
Keywords: diffusion
Abstract: Physics-informed neural networks (PINNs) have achieved notable success in modeling dynamical systems governed by partial differential equations (PDEs). To avoid computationally expensive retraining under new physical conditions, parameterized PINNs (P$^2$INNs) commonly adapt pre-trained operators using singular value decomposition (SVD) for out-of-distribution (OOD) regimes. However, SVD-based fine-tuning often suffers from rigid subspace locking and truncation of important high-frequency spectral modes, limiting its ability to capture complex physical transitions. While parameter-efficient fine-tuning (PEFT) methods appear to be promising alternatives, applying conventional adapters such as LoRA to P$^2$INNs introduces a severe Pareto trade-off, as additive updates increase parameter overhead and disrupt the structured physical manifolds inherent in operator representations. To address these limitations, we propose Manifold-Orthogonal Dual-spectrum Extrapolation (MODE), a lightweight micro-architecture designed for physics operator adaptation. MODE decomposes physical evolution into complementary mechanisms including principal-spectrum dense mixing that enables cross-modal energy transfer within frozen orthogonal bases, residual-spectrum awakening that activates high-frequency spectral components through a single trainable scalar, and affine Galilean unlocking that explicitly isolates spatial translation dynamics. Experiments on challenging PDE benchmarks including the 1D Convection--Diffusion--Reaction equation and the 2D Helmholtz equation demonstrate that MODE achieves strong out-of-distribution generalization while preserving the minimal parameter complexity of native SVD and outperforming existing PEFT-based baselines.

Title: PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

Authors: Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13770
Pdf URL: https://arxiv.org/pdf/2603.13770
Copy Paste: [[2603.13770]] PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment(https://arxiv.org/abs/2603.13770)
Keywords: diffusion, foundation model
Abstract: Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational alignment that extracts kinematic priors from video foundation models. Extensive experiments demonstrate that PhysAlign significantly outperforms existing VDMs on tasks requiring complex physical reasoning and temporal stability, without compromising zero-shot visual quality. PhysAlign shows the potential to bridge the gap between raw visual synthesis and rigid-body kinematics, establishing a practical paradigm for genuinely physics-grounded video generation. The project page is available at this https URL.

Title: AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Authors: Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13779
Pdf URL: https://arxiv.org/pdf/2603.13779
Copy Paste: [[2603.13779]] AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison(https://arxiv.org/abs/2603.13779)
Keywords: anomaly, in-context
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

Title: Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery

Authors: Bohan Zhang, Weidong Tang, Zhixiang Chi, Yi Jin, Zhenbo Li, Yang Wang, Yanan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13858
Pdf URL: https://arxiv.org/pdf/2603.13858
Copy Paste: [[2603.13858]] Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery(https://arxiv.org/abs/2603.13858)
Keywords: diffusion
Abstract: On-the-Fly Category Discovery (OCD) aims to recognize known classes while simultaneously discovering emerging novel categories during inference, using supervision only from known classes during offline training. Existing approaches rely either on fixed label supervision or on diffusion-based augmentations to enhance the backbone, yet none of them explicitly train the model to perform the discovery task required at test time. It is fundamentally unreasonable to expect a model optimized on limited labeled data to carry out a qualitatively different discovery objective during inference. This mismatch creates a clear optimization misalignment between the offline learning stage and the online discovery stage. In addition, prior methods often depend on hash-based encodings or severe feature compression, which further limits representational capacity. To address these issues, we propose Learning through Creation (LTC), a fully feature-based and hash-free framework that injects novel-category awareness directly into offline learning. At its core is a lightweight, online pseudo-unknown generator driven by kernel-energy minimization and entropy maximization (MKEE). Unlike previous methods that generate synthetic samples once before training, our generator evolves jointly with the model dynamics and synthesizes pseudo-novel instances on the fly at negligible cost. These samples are incorporated through a dual max-margin objective with adaptive thresholding, strengthening the model's ability to delineate and detect unknown regions through explicit creation. Extensive experiments across seven benchmarks show that LTC consistently outperforms prior work, achieving improvements ranging from 1.5 percent to 13.1 percent in all-class accuracy. The code is available at this https URL

Title: On Interpolation Formulas Describing Neural Network Generalization

Authors: Jin Guo, Roy Y. He, Jean-Michel Morel
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2603.13872
Pdf URL: https://arxiv.org/pdf/2603.13872
Copy Paste: [[2603.13872]] On Interpolation Formulas Describing Neural Network Generalization(https://arxiv.org/abs/2603.13872)
Keywords: diffusion
Abstract: In 2020 Domingos introduced an interpolation formula valid for "every model trained by gradient descent". He concluded that such models behave approximately as kernel machines. In this work, we extend the Domingos formula to stochastic training. We introduce a stochastic gradient kernel that extends the deterministic version via a continuous-time diffusion approximation. We prove stochastic Domingos theorems and show that the expected network output admits a kernel-machine representation with optimizer-specific weighting. It reveals that training samples contribute through loss-dependent weights and gradient alignment along the training trajectory. We then link the generalization error to the null space of the integral operator induced by the stochastic gradient kernel. The same path-kernel viewpoint provides a unified interpretation of diffusion models and GANs: diffusion induces stage-wise, noise-localized corrections, whereas GANs induce distribution-guided corrections shaped by discriminator geometry. We visualize the evolution of implicit kernels during optimization and quantify out-of-distribution behaviors through a series of numerical experiments. Our results support a feature-space memory view of learning: training stores data-dependent information in an evolving tangent feature geometry, and predictions at test time arise from kernel-weighted retrieval and aggregation of these stored features, with generalization governed by alignment between test points and the learned feature memory.

Title: GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Authors: Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13875
Pdf URL: https://arxiv.org/pdf/2603.13875
Copy Paste: [[2603.13875]] GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent(https://arxiv.org/abs/2603.13875)
Keywords: self-supervised
Abstract: Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Title: CT-Conditioned Diffusion Prior with Physics-Constrained Sampling for PET Super-Resolution

Authors: Liutao Yang, Zi Wang, Peiyuan Jing, Xiaowen Wang, Javier A. Montoya-Zegarra, Kuangyu Shi, Daoqiang Zhang, Guang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13901
Pdf URL: https://arxiv.org/pdf/2603.13901
Copy Paste: [[2603.13901]] CT-Conditioned Diffusion Prior with Physics-Constrained Sampling for PET Super-Resolution(https://arxiv.org/abs/2603.13901)
Keywords: diffusion, generative
Abstract: PET super-resolution is highly under-constrained because paired multi-resolution scans from the same subject are rarely available, and effective resolution is determined by scanner-specific physics (e.g., PSF, detector geometry, and acquisition settings). This limits supervised end-to-end training and makes purely image-domain generative restoration prone to hallucinated structures when anatomical and physical constraints are weak. We formulate PET super-resolution as posterior inference under heterogeneous system configurations and propose a CT-conditioned diffusion framework with physics-constrained sampling. During training, a conditional diffusion prior is learned from high-quality PET/CT pairs using cross-attention for anatomical guidance, without requiring paired LR--HR PET data. During inference, measurement consistency is enforced through a scanner-aware forward model with explicit PSF effects and gradient-based data-consistency refinement. Under both standard and OOD settings, the proposed method consistently improves experimental metrics and lesion-level clinical relevance indicators over strong baselines, while reducing hallucination artifacts and improving structural fidelity.

Title: Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

Authors: Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2603.13904
Pdf URL: https://arxiv.org/pdf/2603.13904
Copy Paste: [[2603.13904]] Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition(https://arxiv.org/abs/2603.13904)
Keywords: self-supervised
Abstract: For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.

Title: Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

Authors: Stefan Ainetter, Thomas Deixelberger, Edoardo A. Dominici, Philipp Drescher, Konstantinos Vardis, Markus Steinberger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13910
Pdf URL: https://arxiv.org/pdf/2603.13910
Copy Paste: [[2603.13910]] Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation(https://arxiv.org/abs/2603.13910)
Keywords: diffusion
Abstract: We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.

Title: Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video

Authors: Yuting Tan, Xilong Cheng, Yunxiao Qin, Zhengnan Li, Jingjing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13912
Pdf URL: https://arxiv.org/pdf/2603.13912
Copy Paste: [[2603.13912]] Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video(https://arxiv.org/abs/2603.13912)
Keywords: self-supervised
Abstract: Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Title: Discriminative Flow Matching Via Local Generative Predictors

Authors: Om Govind Jha, Manoj Bamniya, Ayon Borthakur
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13928
Pdf URL: https://arxiv.org/pdf/2603.13928
Copy Paste: [[2603.13928]] Discriminative Flow Matching Via Local Generative Predictors(https://arxiv.org/abs/2603.13928)
Keywords: generative
Abstract: Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold -- such as class embeddings or bounding box coordinates -- we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.

Title: Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing

Authors: Kursat Komurcu, Linas Petkevicius
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13943
Pdf URL: https://arxiv.org/pdf/2603.13943
Copy Paste: [[2603.13943]] Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing(https://arxiv.org/abs/2603.13943)
Keywords: diffusion, self-supervised, generative
Abstract: Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on this https URL.

Title: DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction

Authors: Jing Wang, Huimin Shi, Quan Zhou, Qibo Liu, Suofei Zhang, Huimin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13951
Pdf URL: https://arxiv.org/pdf/2603.13951
Copy Paste: [[2603.13951]] DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction(https://arxiv.org/abs/2603.13951)
Keywords: foundation model
Abstract: The recent years have witnessed the remarkable development for open-vocabulary semantic segmentation (OVSS) using visual-language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross-modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse-to-fine framework, called DCP-CLIP, for OVSS. Unlike prior efforts that mainly relied on pre-established category content and the inherent spatial-class interaction capability of CLIP, we dynamic constructing category-relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open-vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross-modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine-grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP-CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

Title: IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation

Authors: Chenru Wang, Yunyi Chen, Zijun Yang, Joey Tianyi Zhou, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13960
Pdf URL: https://arxiv.org/pdf/2603.13960
Copy Paste: [[2603.13960]] IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation(https://arxiv.org/abs/2603.13960)
Keywords: diffusion, generative
Abstract: Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling(S^3) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.

Title: VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction

Authors: Hiroto Nakata, Yawen Zou, Shunsuke Sakai, Shun Maeda, Chunzhi Gu, Yijin Wei, Shangce Gao, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13964
Pdf URL: https://arxiv.org/pdf/2603.13964
Copy Paste: [[2603.13964]] VID-AD: A Dataset for Image-Level Logical Anomaly Detection under Vision-Induced Distraction(https://arxiv.org/abs/2603.13964)
Keywords: anomaly
Abstract: Logical anomaly detection in industrial inspection remains challenging due to variations in visual appearance (e.g., background clutter, illumination shift, and blur), which often distract vision-centric detectors from identifying rule-level violations. However, existing benchmarks rarely provide controlled settings where logical states are fixed while such nuisance factors vary. To address this gap, we introduce VID-AD, a dataset for logical anomaly detection under vision-induced distraction. It comprises 10 manufacturing scenarios and five capture conditions, totaling 50 one-class tasks and 10,395 images. Each scenario is defined by two logical constraints selected from quantity, length, type, placement, and relation, with anomalies including both single-constraint and combined violations. We further propose a language-based anomaly detection framework that relies solely on text descriptions generated from normal images. Using contrastive learning with positive texts and contradiction-based negative texts synthesized from these descriptions, our method learns embeddings that capture logical attributes rather than low-level features. Extensive experiments demonstrate consistent improvements over baselines across the evaluated settings. The dataset is available at: this https URL.

Title: VAD4Space: Visual Anomaly Detection for Planetary Surface Imagery

Authors: Fabrizio Genilotti, Arianna Stropeni, Francesco Borsatti, Manuel Barusco, Davide Dalle Pezze, Gian Antonio Susto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13993
Pdf URL: https://arxiv.org/pdf/2603.13993
Copy Paste: [[2603.13993]] VAD4Space: Visual Anomaly Detection for Planetary Surface Imagery(https://arxiv.org/abs/2603.13993)
Keywords: anomaly
Abstract: Space missions generate massive volumes of high-resolution orbital and surface imagery that far exceed the capacity for manual inspection. Detecting rare phenomena is scientifically critical, yet traditional supervised learning struggles due to scarce labeled examples and closed-world assumptions that prevent discovery of genuinely novel observations. In this work, we investigate Visual Anomaly Detection (VAD) as a framework for automated discovery in planetary exploration. We present the first empirical evaluation of state-of-the-art feature-based VAD methods on real planetary imagery, encompassing both orbital lunar data and Mars rover surface imagery. To support this evaluation, we introduce two benchmarks: (i) a lunar dataset derived from Lunar Reconnaissance Orbiter Camera Narrow Angle imagery, comprising of fresh and degraded craters as anomalies alongside normal terrain; and (ii) a Mars surface dataset designed to reflect the characteristics of rover-acquired imagery. We evaluate multiple VAD approaches with a focus on computationally efficient, edge-oriented solutions suitable for onboard deployment, applicable to both orbital platforms surveying the lunar surface and surface rovers operating on Mars. Our results demonstrate that feature-based VAD methods can effectively identify rare planetary surface phenomena while remaining feasible for resource-constrained environments. By grounding anomaly detection in planetary science, this work establishes practical benchmarks and highlights the potential of open-world perception systems to support a range of mission-critical applications, including tactical planning, landing site selection, hazard detection, bandwidth-aware data prioritization, and the discovery of unanticipated geological processes.

Title: Human-like Object Grouping in Self-supervised Vision Transformers

Authors: Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.13994
Pdf URL: https://arxiv.org/pdf/2603.13994
Copy Paste: [[2603.13994]] Human-like Object Grouping in Self-supervised Vision Transformers(https://arxiv.org/abs/2603.13994)
Keywords: self-supervised, foundation model
Abstract: Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object-centric component of representations by measuring patch similarity within and between objects. Across models, stronger object-centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self-supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self-supervised vision models capture object structure in a behaviorally human-like manner, and that Gram matrix structure plays a role in driving perceptual alignment.

Title: Benchmarking Open-Source PPG Foundation Models for Biological Age Prediction

Authors: N. Brag
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14030
Pdf URL: https://arxiv.org/pdf/2603.14030
Copy Paste: [[2603.14030]] Benchmarking Open-Source PPG Foundation Models for Biological Age Prediction(https://arxiv.org/abs/2603.14030)
Keywords: foundation model
Abstract: A task-specific model trained on 212,231 UK Biobank subjects to predict vascular age from PPG (AI-PPG Age) fails on a different clinical population: predictions collapse to a narrow 38-67 year range regardless of true age. Meanwhile, a general-purpose foundation model with no age-related training objective achieves lower error on the same data. We investigate why this happens and what it means for PPG-based biological age prediction. We evaluate three open-source PPG models (Pulse-PPG, PaPaGei-S, AI-PPG Age) on 906 surgical patients from PulseDB, using frozen embeddings with Ridge regression and 5-fold cross-validation. Pulse-PPG reaches MAE = 9.28 years, beating both AI-PPG Age in linear probe mode (9.72) and HR/HRV combined with demographics (9.59). Adding demographic features brings the best result down to MAE = 8.22 years (R2 = 0.517, r = 0.725). The predicted age gap correlates with diastolic blood pressure after adjusting for chronological age (r = -0.188, p = 1.2e-8), consistent with what Apple reported for their proprietary PpgAge model. The remaining gap with Apple (MAE 2.43) appears driven by dataset size (906 vs 213,593 subjects) and population differences rather than model architecture, as our learning curve shows no plateau. Code is publicly available.

Title: EyeWorld: A Generative World Model of Ocular State and Dynamics

Authors: Ziyu Gao, Xinyuan Wu, Xiaolan Chen, Zhuoran Liu, Ruoyu Chen, Bowen Liu, Bingjie Yan, Zhenhan Wang, Kai Jin, Jiancheng Yang, Yih Chung Tham, Mingguang He, Danli Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14039
Pdf URL: https://arxiv.org/pdf/2603.14039
Copy Paste: [[2603.14039]] EyeWorld: A Generative World Model of Ocular State and Dynamics(https://arxiv.org/abs/2603.14039)
Keywords: foundation model, generative
Abstract: Ophthalmic decision-making depends on subtle lesion-scale cues interpreted across multimodal imaging and over time, yet most medical foundation models remain static and degrade under modality and acquisition shifts. Here we introduce EyeWorld, a generative world model that conceptualizes the eye as a partially observed dynamical system grounded in clinical imaging. EyeWorld learns an observation-stable latent ocular state shared across modalities, unifying fine-grained parsing, structure-preserving cross-modality translation and quality-robust enhancement within a single framework. Longitudinal supervision further enables time-conditioned state transitions, supporting forecasting of clinically meaningful progression while preserving stable anatomy. By moving from static representation learning to explicit dynamical modeling, EyeWorld provides a unified approach to robust multimodal interpretation and prognosis-oriented simulation in medicine.

Title: TMPDiff: Temporal Mixed-Precision for Diffusion Models

Authors: Basile Lewandowski, Simon Kurz, Aditya Shankar, Robert Birke, Jian-Jia Chen, Lydia Y. Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14062
Pdf URL: https://arxiv.org/pdf/2603.14062
Copy Paste: [[2603.14062]] TMPDiff: Temporal Mixed-Precision for Diffusion Models(https://arxiv.org/abs/2603.14062)
Keywords: diffusion
Abstract: Diffusion models are the go-to method for Text-to-Image generation, but their iterative denoising processes has high inference latency. Quantization reduces compute time by using lower bitwidths, but applies a fixed precision across all denoising timesteps, leaving an entire optimization axis unexplored. We propose TMPDiff, a temporal mixed-precision framework for diffusion models that assigns different numeric precision to different denoising timesteps. We hypothesize that quantization errors accumulate additively across timesteps, which we then validate experimentally. Based on our observations, we develop an adaptive bisectioning-based algorithm, which assigns per-step precisions with linear evaluation complexity, reducing an otherwise exponential search problem. Across four state-of-the-art diffusion models and three datasets, TMPDiff consistently outperforms uniform-precision baselines at matched speedup, achieving 10 to 20% improvement in perceptual quality. On FLUX.1-dev, TMPDiff achieves 90% SSIM relative to the full-precision model at a speedup of 2.5x over 16-bit inference.

Title: Self-Supervised Uncertainty Estimation For Super-Resolution of Satellite Images

Authors: Zhe Zheng, Valéry Dewil, Pablo Arias
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14074
Pdf URL: https://arxiv.org/pdf/2603.14074
Copy Paste: [[2603.14074]] Self-Supervised Uncertainty Estimation For Super-Resolution of Satellite Images(https://arxiv.org/abs/2603.14074)
Keywords: self-supervised
Abstract: Super-resolution (SR) of satellite imagery is challenging due to the lack of paired low-/high-resolution data. Recent self-supervised SR methods overcome this limitation by exploiting the temporal redundancy in burst observations, but they lack a mechanism to quantify uncertainty in the reconstruction. In this work, we introduce a novel self-supervised loss that allows to estimate uncertainty in image super-resolution without ever accessing the ground-truth high-resolution data. We adopt a decision-theoretic perspective and show that minimizing the corresponding Bayesian risk yields the posterior mean and variance as optimal estimators. We validate our approach on a synthetic SkySat L1B dataset and demonstrate that it produces calibrated uncertainty estimates comparable to supervised methods. Our work bridges self-supervised restoration with uncertainty quantification, making a practical framework for uncertainty-aware image reconstruction.

Title: Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining

Authors: Eytan Kats, Mattias P. Heinrich
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14086
Pdf URL: https://arxiv.org/pdf/2603.14086
Copy Paste: [[2603.14086]] Effective Feature Learning for 3D Medical Registration via Domain-Specialized DINO Pretraining(https://arxiv.org/abs/2603.14086)
Keywords: self-supervised
Abstract: Medical image registration is a critical component of clinical imaging workflows, enabling accurate longitudinal assessment, multi-modal data fusion, and image-guided interventions. Intensity-based approaches often struggle with interscanner variability and complex anatomical deformations, whereas feature-based methods offer improved robustness by leveraging semantically informed representations. In this work, we investigate DINO-style self-supervised pretraining directly on 3D medical imaging data, aiming to learn dense volumetric features well suited for deformable registration. We assess the resulting representations on challenging interpatient abdominal registration task across both MRI and CT modalities. Our domain-specialized pretraining outperforms the DINOv2 model trained on a large-scale collection of natural images, while requiring substantially lower computational resources at inference time. Moreover, it surpasses established registration models under out-of-domain evaluation, demonstrating the value of task-agnostic yet medical imaging-focused pretraining for robust and efficient 3D image registration.

Title: Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels

Authors: Michael Leznik
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2603.14092
Pdf URL: https://arxiv.org/pdf/2603.14092
Copy Paste: [[2603.14092]] Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels(https://arxiv.org/abs/2603.14092)
Keywords: generative
Abstract: The Expected Calibration Error (ece), the dominant calibration metric in machine learning, compares predicted probabilities against empirical frequencies of binary outcomes. This is appropriate when labels are binary events. However, many modern settings produce labels that are themselves probabilities rather than binary outcomes: a radiologist's stated confidence, a teacher model's soft output in knowledge distillation, a class posterior derived from a generative model, or an annotator agreement fraction. In these settings, ece commits a category error - it discards the probabilistic information in the label by forcing it into a binary comparison. The result is not a noisy approximation that more data will correct. It is a structural misalignment that persists and converges to the wrong answer with increasing precision as sample size grows. We introduce the Soft Mean Expected Calibration Error (smece), a calibration metric for settings where labels are of probabilistic nature. The modification to the ece formula is one line: replace the empirical hard-label fraction in each prediction bin with the mean probability label of the samples in that bin. smece reduces exactly to ece when labels are binary, making it a strict generalisation.

Title: Not All Latent Spaces Are Flat: Hyperbolic Concept Control

Authors: Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà, Guido Maria D'Amely di Melendugno, Luca Franco, Fabio Galasso, Iacopo Masi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14093
Pdf URL: https://arxiv.org/pdf/2603.14093
Copy Paste: [[2603.14093]] Not All Latent Spaces Are Flat: Hyperbolic Concept Control(https://arxiv.org/abs/2603.14093)
Keywords: generative
Abstract: As modern text-to-image (T2I) models draw closer to synthesizing highly realistic content, the threat of unsafe content generation grows, and it becomes paramount to exercise control. Existing approaches steer these models by applying Euclidean adjustments to text embeddings, redirecting the generation away from unsafe concepts. In this work, we introduce hyperbolic control (HyCon): a novel control mechanism based on parallel transport that leverages semantically aligned hyperbolic representation space to yield more expressive and stable manipulation of concepts. HyCon reuses off-the-shelf generative models and a state-of-the-art hyperbolic text encoder, linked via a lightweight adapter. HyCon achieves state-of-the-art results across four safety benchmarks and four T2I backbones, showing that hyperbolic steering is a practical and flexible approach for more reliable T2I generation.

Title: Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution

Authors: Dan Wang, Haiyan Sun, Shan Du, Z. Jane Wang, Zhaochong An, Serge Belongie, Xinrui Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14112
Pdf URL: https://arxiv.org/pdf/2603.14112
Copy Paste: [[2603.14112]] Revisiting the Perception-Distortion Trade-off with Spatial-Semantic Guided Super-Resolution(https://arxiv.org/abs/2603.14112)
Keywords: diffusion, generative
Abstract: Image super-resolution (SR) aims to reconstruct high resolution images with both high perceptual quality and low distortion, but is fundamentally limited by the perception-distortion trade-off. GAN-based SR methods reduce distortion but still struggle with realistic fine-grained textures, whereas diffusion-based approaches synthesize rich details but often deviate from the input, hallucinating structures and degrading fidelity. This tension raises a key challenge: how to exploit the powerful generative priors of diffusion models without sacrificing fidelity. To address this, we propose SpaSemSR, a spatial-semantic guided diffusion framework with two complementary guidances. First, spatial-grounded textual guidance integrates object-level spatial cues with semantic prompts, aligning textual and visual structures to reduce distortion. Second, semantic-enhanced visual guidance with a multi-encoder design and semantic degradation constraints unifies multimodal semantic priors, improving perceptual realism under severe degradations. These complementary guidances are adaptively fused into the diffusion process via spatial-semantic attention, suppressing distortion and hallucination while retaining the strengths of diffusion models. Extensive experiments on multiple benchmarks show that SpaSemSR achieves a superior perception-distortion balance, producing both realistic and faithful restorations.

Title: Diffusion Reinforcement Learning via Centered Reward Distillation

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14128
Pdf URL: https://arxiv.org/pdf/2603.14128
Copy Paste: [[2603.14128]] Diffusion Reinforcement Learning via Centered Reward Distillation(https://arxiv.org/abs/2603.14128)
Keywords: diffusion, generative
Abstract: Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

Title: Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images

Authors: Rupa Kurinchi-Vendhan, Pratyusha Sharma, Antonio Torralba, Sara Beery
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14151
Pdf URL: https://arxiv.org/pdf/2603.14151
Copy Paste: [[2603.14151]] Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images(https://arxiv.org/abs/2603.14151)
Keywords: diffusion
Abstract: Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard "black-box" restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.

Title: SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation

Authors: Anbang Wang, Yuzhuo Ao, Shangzhe Wu, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14152
Pdf URL: https://arxiv.org/pdf/2603.14152
Copy Paste: [[2603.14152]] SK-Adapter: Skeleton-Based Structural Control for Native 3D Generation(https://arxiv.org/abs/2603.14152)
Keywords: foundation model, generative
Abstract: Native 3D generative models have achieved remarkable fidelity and speed, yet they suffer from a critical limitation: inability to prescribe precise structural articulations, where precise structural control within the native 3D space remains underexplored. This paper proposes SK-Adapter, a simple and yet highly efficient and effective framework that unlocks precise skeletal manipulation for native 3D generation. Moving beyond text or image prompts, which can be ambiguous for precise structure, we treat the 3D skeleton as a first-class control signal. SK-Adapter is a lightweight structural adapter network that encodes joint coordinates and topology into learnable tokens, which are injected into the frozen 3D generation backbone via cross-attention. This smart design allows the model to not only effectively "attend" to specific 3D structural constraints but also preserve its original generative priors. To bridge the data gap, we contribute Objaverse-TMS dataset, a large-scale dataset of 24k text-mesh-skeleton pairs. Extensive experiments confirm that our method achieves robust structural control while preserving the geometry and texture quality of the foundation model, significantly outperforming existing baselines. Furthermore, we extend this capability to local 3D editing, enabling the region specific editing of existing assets with skeletal guidance, which is unattainable by previous methods. Project Page: this https URL

Title: TACTIC for Navigating the Unknown: Tabular Anomaly deteCTion via In-Context inference

Authors: Patryk Marszałek, Tomasz Kuśmierczyk, Marek Śmieja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14171
Pdf URL: https://arxiv.org/pdf/2603.14171
Copy Paste: [[2603.14171]] TACTIC for Navigating the Unknown: Tabular Anomaly deteCTion via In-Context inference(https://arxiv.org/abs/2603.14171)
Keywords: foundation model, anomaly, in-context
Abstract: Anomaly detection for tabular data has been a long-standing unsupervised learning problem that remains a major challenge for current deep learning models. Recently, in-context learning has emerged as a new paradigm that has shifted efforts from task-specific optimization to large-scale pretraining aimed at creating foundation models that generalize across diverse datasets. Although in-context models, such as TabPFN, perform well in supervised problems, their learned classification-based priors may not readily extend to anomaly detection. In this paper, we study in-context models for anomaly detection and show that the unsupervised extensions to TabPFN exhibit unstable behavior, particularly in noisy or contaminated contexts, in addition to the high computational cost. We address these challenges and introduce TACTIC, an in-context anomaly detection approach based on pretraining with anomaly-centric synthetic priors, which provides fast and data-dependent reasoning about anomalies while avoiding dataset-specific tuning. In contrast to typical score-based approaches, which produce uncalibrated anomaly scores that require post-processing (e.g. threshold selection or ranking heuristics), the proposed model is trained as a discriminative predictor, enabling unambiguous anomaly decisions in a single forward pass. Through experiments on real-world datasets, we examine the performance of TACTIC in clean and noisy contexts with varying anomaly rates and different anomaly types, as well as the impact of prior choices on detection quality. Our experiments clearly show that specialized anomaly-centric in-context models such as TACTIC are highly competitive compared to other task-specific methods.

Title: Artificial intelligence-enabled single-lead ECG for non-invasive hyperkalemia detection: development, multicenter validation, and proof-of-concept deployment

Authors: Gongzheng Tang, Qinghao Zhao, Guangkun Nie, Yujie Xiao, Shijia Geng, Donglin Xie, Shun Huang, Deyun Zhang, Xingchen Yao, Jinwei Wang, Kangyin Chen, Luxia Zhang, Shenda Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14177
Pdf URL: https://arxiv.org/pdf/2603.14177
Copy Paste: [[2603.14177]] Artificial intelligence-enabled single-lead ECG for non-invasive hyperkalemia detection: development, multicenter validation, and proof-of-concept deployment(https://arxiv.org/abs/2603.14177)
Keywords: foundation model
Abstract: Hyperkalemia is a life-threatening electrolyte disorder that is common in patients with chronic kidney disease and heart failure, yet frequent monitoring remains difficult outside hospital settings. We developed and validated Pocket-K, a single-lead AI-ECG system initialized from the ECGFounder foundation model for non-invasive hyperkalemia screening and handheld deployment. In this multicentre observational study using routinely collected clinical ECG and laboratory data, 34,439 patients contributed 62,290 ECG--potassium pairs. Lead I data were used to fine-tune the model. Data from Peking University People's Hospital were divided into development and temporal validation sets, and data from The Second Hospital of Tianjin Medical University served as an independent external validation set. Hyperkalemia was defined as venous serum potassium > 5.5 mmol/L. Pocket-K achieved AUROCs of 0.936 in internal testing, 0.858 in temporal validation, and 0.808 in external validation. For KDIGO-defined moderate-to-severe hyperkalemia (serum potassium >= 6.0 mmol/L), AUROCs increased to 0.940 and 0.861 in the temporal and external sets, respectively. External negative predictive value exceeded 99.3%. Model-predicted high risk below the hyperkalemia threshold was more common in patients with chronic kidney disease and heart failure. A handheld prototype enabled near-real-time inference, supporting future prospective evaluation in native handheld and wearable settings.

Title: Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

Authors: Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14186
Pdf URL: https://arxiv.org/pdf/2603.14186
Copy Paste: [[2603.14186]] Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models(https://arxiv.org/abs/2603.14186)
Keywords: diffusion, generative
Abstract: State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.

Title: Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis

Authors: Zhiwei Wang, Yuxing Li, Meilu Zhu, Defeng He, Edmund Y. Lam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14188
Pdf URL: https://arxiv.org/pdf/2603.14188
Copy Paste: [[2603.14188]] Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma Diagnosis(https://arxiv.org/abs/2603.14188)
Keywords: diffusion
Abstract: Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at this https URL.

Title: DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution

Authors: Axi Niu, Kang Zhang, Qingsen Yan, Hao Jin, Jinqiu Sun, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14207
Pdf URL: https://arxiv.org/pdf/2603.14207
Copy Paste: [[2603.14207]] DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution(https://arxiv.org/abs/2603.14207)
Keywords: diffusion
Abstract: Scene Text Image Super-Resolution (STISR) aims to restore high-resolution details in low-resolution text images, which is crucial for both human readability and machine recognition. Existing methods, however, often depend on external Optical Character Recognition (OCR) models for textual priors or rely on complex multi-component architectures that are difficult to train and reproduce. In this paper, we introduce DualTSR, a unified end-to-end framework that addresses both issues. DualTSR employs a single multimodal transformer backbone trained with a dual diffusion objective. It simultaneously models the continuous distribution of high-resolution images via Conditional Flow Matching and the discrete distribution of textual content via discrete diffusion. This shared design enables visual and textual information to interact at every layer, allowing the model to infer text priors internally instead of relying on an external OCR module. Compared with prior multi-branch diffusion systems, DualTSR offers a simpler end-to-end formulation with fewer hand-crafted components. Experiments on synthetic Chinese benchmarks and a curated real-world evaluation protocol show that DualTSR achieves strong perceptual quality and text fidelity.

Title: ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control

Authors: Shishi Xiao, Tongyu Zhou, David Laidlaw, Gromit Yeuk-Yin Chan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14209
Pdf URL: https://arxiv.org/pdf/2603.14209
Copy Paste: [[2603.14209]] ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control(https://arxiv.org/abs/2603.14209)
Keywords: diffusion, generative
Abstract: A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: this https URL.

Title: FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection

Authors: Jie Li, Yingying Feng, Chi Xie, Jie Hu, Lei Tan, Jiayi Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14220
Pdf URL: https://arxiv.org/pdf/2603.14220
Copy Paste: [[2603.14220]] FIND: A Simple yet Effective Baseline for Diffusion-Generated Image Detection(https://arxiv.org/abs/2603.14220)
Keywords: diffusion
Abstract: The remarkable realism of images generated by diffusion models poses critical detection challenges. Current methods utilize reconstruction error as a discriminative feature, exploiting the observation that real images exhibit higher reconstruction errors when processed through diffusion models. However, these approaches require costly reconstruction computations and depend on specific diffusion models, making their performance highly model-dependent. We identify a fundamental difference: real images are more difficult to fit with Gaussian distributions compared to synthetic ones. In this paper, we propose Forgery Identification via Noise Disturbance (FIND), a novel method that requires only a simple binary classifier. It eliminates reconstruction by directly targeting the core distributional difference between real and synthetic images. Our key operation is to add Gaussian noise to real images during training and label these noisy versions as synthetic. This step allows the classifier to focus on the statistical patterns that distinguish real from synthetic images. We theoretically prove that the noise-augmented real images resemble diffusion-generated images in their ease of Gaussian fitting. Furthermore, simply by adding noise, they still retain visual similarity to the original images, highlighting the most discriminative distribution-related features. The proposed FIND improves performance by 11.7% on the GenImage benchmark while running 126x faster than existing methods. By removing the need for auxiliary diffusion models and reconstruction, it offers a practical, efficient, and generalizable way to detect diffusion-generated content.

Title: Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Authors: Ruoxi Cheng, Yizhong Ding, Hongyi Zhang, Yiyan Huang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14222
Pdf URL: https://arxiv.org/pdf/2603.14222
Copy Paste: [[2603.14222]] Membership Inference for Contrastive Pre-training Models with Text-only PII Queries(https://arxiv.org/abs/2603.14222)
Keywords: anomaly
Abstract: Contrastive pretraining models such as CLIP and CLAP underpin many vision-language and audio-language systems, yet their reliance on web-scale data raises growing concerns about memorizing Personally Identifiable Information (PII). Auditing such models via membership inference is challenging in practice: shadow-model MIAs are computationally prohibitive for large multimodal backbones, and existing multimodal attacks typically require querying the target with paired biometric inputs, thereby directly exposing sensitive biometric information to the target model. We propose Unimodal Membership Inference Detector (UMID), a text-only auditing framework that performs text-guided cross-modal latent inversion and extracts two complementary signals, similarity (alignment to the queried text) and variability (consistency across randomized inversions). UMID compares these statistics to a lightweight non-member reference constructed from synthetic gibberish and makes decisions via an ensemble of unsupervised anomaly detectors. Comprehensive experiments across diverse CLIP and CLAP architectures demonstrate that UMID significantly improves the effectiveness and efficiency over prior MIAs, delivering strong detection performance with sub-second auditing cost while complying with realistic privacy constraints.

Title: FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains

Authors: Vaibhav Rathore, Divyam Gupta, Moloud Abdar, Subhasis Chaudhuri, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14240
Pdf URL: https://arxiv.org/pdf/2603.14240
Copy Paste: [[2603.14240]] FOCUS: Bridging Fine-Grained Recognition and Open-World Discovery across Domains(https://arxiv.org/abs/2603.14240)
Keywords: diffusion
Abstract: We introduce the first unified framework for *Fine-Grained Domain-Generalized Generalized Category Discovery* (FG-DG-GCD), bringing open-world recognition closer to real-world deployment under domain shift. Unlike conventional GCD, which assumes labeled and unlabeled data come from the same distribution, DG-GCD learns only from labeled source data and must both recognize known classes and discover novel ones in unseen, unlabeled target domains. This problem is especially challenging in fine-grained settings, where subtle inter-class differences and large intra-class variation make domain generalization significantly harder. To support systematic evaluation, we establish the first *FG-DG-GCD benchmarks* by creating identity-preserving *painting* and *sketch* domains for CUB-200-2011, Stanford Cars, and FGVC-Aircraft using controlled diffusion-adapter stylization. On top of this ,we propose FoCUS, a single-stage framework that combines *Domain-Consistent Parts Discovery* (DCPD) for geometry-stable part reasoning with *Uncertainty-Aware Feature Augmentation* (UFA) for confidence-calibrated feature regularization through uncertainty-guided perturbations. Extensive experiments show that FoCUS outperforms strong GCD, FG-GCD, and DG-GCD baselines by **3.28%**, **9.68%**, and **2.07%**, respectively, in clustering accuracy on the proposed benchmarks. It also remains competitive on coarse-grained DG-GCD tasks while achieving nearly **3x** higher computational efficiency than the current state of the art. ^[Code and datasets will be released upon acceptance.]

Title: CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Authors: Zhiyi Kuang, Chengan He, Egor Zakharov, Yuxuan Xue, Shunsuke Saito, Olivier Maury, Timur Bagautdinov, Youyi Zheng, Giljoo Nam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14241
Pdf URL: https://arxiv.org/pdf/2603.14241
Copy Paste: [[2603.14241]] CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control(https://arxiv.org/abs/2603.14241)
Keywords: diffusion, generative
Abstract: We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.

Title: GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies

Authors: He Zhang, Ying Sun, Hui Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14245
Pdf URL: https://arxiv.org/pdf/2603.14245
Copy Paste: [[2603.14245]] GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies(https://arxiv.org/abs/2603.14245)
Keywords: generative
Abstract: Flow-matching policies hold great promise for reinforcement learning (RL) by capturing complex, multi-modal action distributions. However, their practical application is often hindered by prohibitive inference latency and ineffective online exploration. Although recent works have employed one-step distillation for fast inference, the structure of the initial noise distribution remains an overlooked factor that presents significant untapped potential. This overlooked factor, along with the challenge of controlling policy stochasticity, constitutes two critical areas for advancing distilled flow-matching policies. To overcome these limitations, we propose GoldenStart (GSFlow), a policy distillation method with Q-guided priors and explicit entropy control. Instead of initializing generation from uninformed noise, we introduce a Q-guided prior modeled by a conditional VAE. This state-conditioned prior repositions the starting points of the one-step generation process into high-Q regions, effectively providing a "golden start" that shortcuts the policy to promising actions. Furthermore, for effective online exploration, we enable our distilled actor to output a stochastic distribution instead of a deterministic point. This is governed by entropy regularization, allowing the policy to shift from pure exploitation to principled exploration. Our integrated framework demonstrates that by designing the generative startpoint and explicitly controlling policy entropy, it is possible to achieve efficient and exploratory policies, bridging the generative models and the practical actor-critic methods. We conduct extensive experiments on offline and online continuous control benchmarks, where our method significantly outperforms prior state-of-the-art approaches. Code will be available at this https URL.

Title: DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Authors: Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen
Subjects: cs.CV, cs.AI, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.14267
Pdf URL: https://arxiv.org/pdf/2603.14267
Copy Paste: [[2603.14267]] DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization(https://arxiv.org/abs/2603.14267)
Keywords: generative
Abstract: Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

Title: Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs

Authors: Karma Phuntsho, Abdullah, Kyungmi Lee, Ickjai Lee, Euijoon Ahn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14271
Pdf URL: https://arxiv.org/pdf/2603.14271
Copy Paste: [[2603.14271]] Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs(https://arxiv.org/abs/2603.14271)
Keywords: foundation model
Abstract: Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.

Title: Seeking Physics in Diffusion Noise

Authors: Chujun Tang, Lei Zhong, Fangqiang Ding
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2603.14294
Pdf URL: https://arxiv.org/pdf/2603.14294
Copy Paste: [[2603.14294]] Seeking Physics in Diffusion Noise(https://arxiv.org/abs/2603.14294)
Keywords: diffusion
Abstract: Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.

Title: Early Failure Detection and Intervention in Video Diffusion Models

Authors: Kwon Byung-Ki, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Tae-Hyun Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14320
Pdf URL: https://arxiv.org/pdf/2603.14320
Copy Paste: [[2603.14320]] Early Failure Detection and Intervention in Video Diffusion Models(https://arxiv.org/abs/2603.14320)
Keywords: diffusion
Abstract: Text-to-video (T2V) diffusion models have rapidly advanced, yet generations still occasionally fail in practice, such as low text-video alignment or low perceptual quality. Since diffusion sampling is non-deterministic, it is difficult to know during inference whether a generation will succeed or fail, incurring high computational cost due to trial-and-error regeneration. To address this, we propose an early failure detection and diagnostic intervention pipeline for latent T2V diffusion models. For detection, we design a Real-time Inspection (RI) module that converts latents into intermediate video previews, enabling the use of established text-video alignment scorers for inspection in the RGB space. The RI module completes the conversion and inspection process in just 39.2ms. This is highly efficient considering that CogVideoX-5B requires 4.3s per denoising step when generating a 480p, 49-frame video on an NVIDIA A100 GPU. Subsequently, we trigger a hierarchical and early-exit intervention pipeline only when failure is predicted. Experiments on CogVideoX-5B and Wan2.1-1.3B demonstrate consistency gains on VBench with up to 2.64 times less time overhead compared to post-hoc regeneration. Our method also generalizes to a higher-capacity setting, remaining effective on Wan2.1-14B with 720p resolution and 81-frame generation. Furthermore, our pipeline is plug-and-play and orthogonal to existing techniques, showing seamless compatibility with prompt refinement and sampling guidance methods. We also provide evidence that failure signals emerge early in the denoising process and are detectable within intermediate video previews using standard vision-language evaluators.

Title: AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Authors: Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14331
Pdf URL: https://arxiv.org/pdf/2603.14331
Copy Paste: [[2603.14331]] AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising(https://arxiv.org/abs/2603.14331)
Keywords: diffusion
Abstract: Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: this https URL

Title: Representation Alignment for Just Image Transformers is not Easier than You Think

Authors: Jaeyo Shin, Jiwook Kim, Hyunjung Shim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14366
Pdf URL: https://arxiv.org/pdf/2603.14366
Copy Paste: [[2603.14366]] Representation Alignment for Just Image Transformers is not Easier than You Think(https://arxiv.org/abs/2603.14366)
Keywords: diffusion
Abstract: Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment with a Masked Transformer Adapter that combines a shallow transformer adapter with partial token masking. PixelREPA improves both training convergence and final quality. PixelREPA reduces FID from 3.66 to 3.17 for JiT-B$/16$ and improves Inception Score (IS) from 275.1 to 284.6 on ImageNet $256 \times 256$, while achieving $> 2\times$ faster convergence. Finally, PixelREPA-H$/16$ achieves FID$=1.81$ and IS$=317.2$. Our code is available at this https URL.

Title: The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics

Authors: Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, Zhengzhong Tu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14375
Pdf URL: https://arxiv.org/pdf/2603.14375
Copy Paste: [[2603.14375]] The Pulse of Motion: Measuring Physical Frame Rate from Visual Dynamics(https://arxiv.org/abs/2603.14375)
Keywords: generative
Abstract: While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is this https URL.

Title: ES-Merging: Biological MLLM Merging via Embedding Space Signals

Authors: Wonbin Lee, Dongki Kim, Sung Ju Hwang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14405
Pdf URL: https://arxiv.org/pdf/2603.14405
Copy Paste: [[2603.14405]] ES-Merging: Biological MLLM Merging via Embedding Space Signals(https://arxiv.org/abs/2603.14405)
Keywords: foundation model
Abstract: Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose a representation-aware merging framework that estimates merging coefficients from embedding space signals. We first design a probe input that consists of different modality tokens and forward it through each specialized MLLM to obtain layer-wise embedding responses that reflect modality-specific representation changes. We then estimate complementary merging coefficients at two granularities from the embedding space: layer-wise coefficients from coarse-grained signals and element-wise coefficients from fine-grained signals, which are jointly combined for robust coefficient estimation. Experiments on interactive effect prediction benchmarks show that our method outperforms existing merging methods and even surpasses task-specific fine-tuned models, establishing that embedding space signals provide a principled and effective foundation for cross-modal MLLM merging.

Title: Graph-Based Deep Learning for Intelligent Detection of Energy Losses, Theft, and Operational Inefficiencies in Oil & Gas Production Networks

Authors: AbdulQoyum A. Olowookere, Adewale U. Oguntola, Ebenezer. Leke Odekanle
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14406
Pdf URL: https://arxiv.org/pdf/2603.14406
Copy Paste: [[2603.14406]] Graph-Based Deep Learning for Intelligent Detection of Energy Losses, Theft, and Operational Inefficiencies in Oil & Gas Production Networks(https://arxiv.org/abs/2603.14406)
Keywords: anomaly
Abstract: Early detection of energy losses, theft, and operational inefficiencies remains a critical challenge in oil and gas production systems due to complex interdependencies among wells and facilities, evolving operating conditions, and limited labeled anomaly data. Traditional machine learning approaches often treat production units independently and struggle under temporal distribution shifts. This study proposes a spatiotemporal graph-based deep learning framework for anomaly detection in oil and gas production networks. The production system is modeled as a hierarchical graph of wells, facilities, and fields, with additional peer connections among wells sharing common infrastructure. Weakly supervised anomaly labels are derived from physically informed heuristics based on production, pressure, and flow behavior. Temporal dynamics are captured through sequence modeling, while relational dependencies are learned using a Temporal Graph Attention Network. Under time-based evaluation, the proposed model achieves an ROC-AUC of about 0.98 and anomaly recall above 0.93, demonstrating improved robustness and practical potential for proactive monitoring in real-world energy operations.

Title: Towards One-for-All Anomaly Detection for Tabular Data

Authors: Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14407
Pdf URL: https://arxiv.org/pdf/2603.14407
Copy Paste: [[2603.14407]] Towards One-for-All Anomaly Detection for Tabular Data(https://arxiv.org/abs/2603.14407)
Keywords: anomaly
Abstract: Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many real-world applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting.

Title: PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis

Authors: Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14409
Pdf URL: https://arxiv.org/pdf/2603.14409
Copy Paste: [[2603.14409]] PGcGAN: Pathological Gait-Conditioned GAN for Human Gait Synthesis(https://arxiv.org/abs/2603.14409)
Keywords: generative
Abstract: Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.

Title: On the (Generative) Linear Sketching Problem

Authors: Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14474
Pdf URL: https://arxiv.org/pdf/2603.14474
Copy Paste: [[2603.14474]] On the (Generative) Linear Sketching Problem(https://arxiv.org/abs/2603.14474)
Keywords: generative
Abstract: Sketch techniques have been extensively studied in recent years and are especially well-suited to data streaming scenarios, where the sketch summary is updated quickly and compactly. However, it is challenging to recover the current state from these summaries in a way that is accurate, fast, and real. In this paper, we seek a solution that reconciles this tension, aiming for near-perfect recovery with lightweight computational procedures. Focusing on linear sketching problems of the form $\boldsymbol{\Phi}f \rightarrow f$, our study proceeds in three stages. First, we dissect existing techniques and show the root cause of the sketching dilemma: an orthogonal information loss. Second, we examine how generative priors can be leveraged to bridge the information gap. Third, we propose FLORE, a novel generative sketching framework that embraces these analyses to achieve the best of all worlds. More importantly, FLORE can be trained without access to ground-truth data. Comprehensive evaluations demonstrate FLORE's ability to provide high-quality recovery, and support summary with low computing overhead, outperforming previous methods by up to 1000 times in error reduction and 100 times in processing speed compared to learning-based solutions.

Title: V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Authors: Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14482
Pdf URL: https://arxiv.org/pdf/2603.14482
Copy Paste: [[2603.14482]] V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning(https://arxiv.org/abs/2603.14482)
Keywords: self-supervised
Abstract: We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

Title: WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning

Authors: Stefan Englmeier, Katharina Winter, Fabian B. Flohr
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.14497
Pdf URL: https://arxiv.org/pdf/2603.14497
Copy Paste: [[2603.14497]] WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning(https://arxiv.org/abs/2603.14497)
Keywords: foundation model
Abstract: Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.

Title: Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models

Authors: Diego Royo, Brandon Zhao, Adolfo Muñoz, Diego Gutierrez, Katherine L. Bouman
Subjects: cs.CV, astro-ph.CO
Abstract URL: https://arxiv.org/abs/2603.14503
Pdf URL: https://arxiv.org/pdf/2603.14503
Copy Paste: [[2603.14503]] Mapping Dark-Matter Clusters via Physics-Guided Diffusion Models(https://arxiv.org/abs/2603.14503)
Keywords: diffusion
Abstract: Galaxy clusters are powerful probes of astrophysics and cosmology through gravitational lensing: the clusters' mass, dominated by 85% dark matter, distorts background light. Yet, mass reconstruction lacks the scalability and large-scale benchmarks to process the hundreds of thousands of clusters expected from forthcoming wide-field surveys. We introduce a fully automated method to reconstruct cluster surface mass density from photometry and gravitational lensing observables. Central to our approach is DarkClusters-15k, our new dataset of 15,000 simulated clusters with paired mass and photometry maps, the largest benchmark to date, spanning multiple redshifts and simulation frameworks. We train a plug-and-play diffusion prior on DarkClusters-15k that learns the statistical relationship between mass and light, and draw posterior samples constrained by weak- and strong-lensing observables; this yields principled reconstructions driven by explicit physics, alongside well-calibrated uncertainties. Our approach requires no expert tuning, runs in minutes rather than hours, achieves higher accuracy, and matches expertly-tuned reconstructions of the MACS 1206 cluster. We release our method and DarkClusters-15k to support development and benchmarking for upcoming wide-field cosmological surveys.

Title: Trust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models

Authors: Niklas Schweiger, Daniel Cremers, Karnik Ram
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.14504
Pdf URL: https://arxiv.org/pdf/2603.14504
Copy Paste: [[2603.14504]] Trust-Region Noise Search for Black-Box Alignment of Diffusion and Flow Models(https://arxiv.org/abs/2603.14504)
Keywords: diffusion, generative
Abstract: Optimizing the noise samples of diffusion and flow models is an increasingly popular approach to align these models to target rewards at inference time. However, we observe that these approaches are usually restricted to differentiable or cheap reward models, the formulation of the underlying pretrained generative model, or are memory/compute inefficient. We instead propose a simple trust-region based search algorithm (TRS) which treats the pre-trained generative and reward models as a black-box and only optimizes the source noise. Our approach achieves a good balance between global exploration and local exploitation, and is versatile and easily adaptable to various generative settings and reward models with minimal hyperparameter tuning. We evaluate TRS across text-to-image, molecule and protein design tasks, and obtain significantly improved output samples over the base generative models and other inference-time alignment approaches which optimize the source noise sample, or even the entire reverse-time sampling noise trajectories in the case of diffusion models. Our source code is publicly available.

Title: Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

Authors: Yiren Zheng, Shibo Li, Jiaming Liu, Haofan Wang, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14505
Pdf URL: https://arxiv.org/pdf/2603.14505
Copy Paste: [[2603.14505]] Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs(https://arxiv.org/abs/2603.14505)
Keywords: generative, in-context
Abstract: Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.

Title: LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

Authors: Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, Ioannis Patras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14526
Pdf URL: https://arxiv.org/pdf/2603.14526
Copy Paste: [[2603.14526]] LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion(https://arxiv.org/abs/2603.14526)
Keywords: diffusion
Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.

Title: Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events

Authors: Shuang Guo, Filbert Febryanto, Lei Sun, Guillermo Gallego
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.14528
Pdf URL: https://arxiv.org/pdf/2603.14528
Copy Paste: [[2603.14528]] Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events(https://arxiv.org/abs/2603.14528)
Keywords: foundation model
Abstract: In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.

Title: Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders

Authors: Jiaming Chu, Tao Wang, Lei Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14536
Pdf URL: https://arxiv.org/pdf/2603.14536
Copy Paste: [[2603.14536]] Distilling Latent Manifolds: Resolution Extrapolation by Variational Autoencoders(https://arxiv.org/abs/2603.14536)
Keywords: generative
Abstract: Variational Autoencoder (VAE) encoders play a critical role in modern generative models, yet their computational cost often motivates the use of knowledge distillation or quantification to obtain compact alternatives. Existing studies typically believe that the model work better on the samples closed to their training data distribution than unseen data distribution. In this work, we report a counter-intuitive phenomenon in VAE encoder distillation: a compact encoder distilled only at low resolutions exhibits poor reconstruction performance at its native resolution, but achieves dramatically improved results when evaluated at higher, unseen input resolutions. Despite never being trained beyond $256^2$ resolution, the distilled encoder generalizes effectively to $512^2$ resolution inputs, partially inheriting the teacher model's resolution this http URL further analyze latent distributions across resolutions and find that higher-resolution inputs produce latent representations more closely aligned with the teacher's manifold. Through extensive experiments on ImageNet-256, we show that simple resolution remapping-upsampling inputs before encoding and downsampling reconstructions for evaluation-leads to substantial gains across PSNR, MSE, SSIM, LPIPS, and rFID metrics. These findings suggest that VAE encoder distillation learns resolution-consistent latent manifolds rather than resolution-specific pixel mappings. This also means that the high training cost on memory, time and high-resolution datasets are not necessary conditions for distilling a VAE with high-resolution image reconstruction capabilities. On low resolution datasets, the distillation model still could learn the detailed knowledge of the teacher model in high-resolution image reconstruction.

Title: Learning to Order: Task Sequencing as In-Context Optimization

Authors: Jan Kobiolka, Christian Frey, Arlind Kadra, Gresa Shala, Josif Grabocka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14550
Pdf URL: https://arxiv.org/pdf/2603.14550
Copy Paste: [[2603.14550]] Learning to Order: Task Sequencing as In-Context Optimization(https://arxiv.org/abs/2603.14550)
Keywords: in-context
Abstract: Task sequencing (TS) is one of the core open problems in Deep Learning, arising in a plethora of real-world domains, from robotic assembly lines to autonomous driving. Unfortunately, prior work has not convincingly demonstrated the generalization ability of meta-learned TS methods to solve new TS problems, given few initial demonstrations. In this paper, we demonstrate that deep neural networks can meta-learn over an infinite prior of synthetically generated TS problems and achieve a few-shot generalization. We meta-learn a transformer-based architecture over datasets of sequencing trajectories generated from a prior distribution that samples sequencing problems as paths in directed graphs. In a large-scale experiment, we provide ample empirical evidence that our meta-learned models discover optimal task sequences significantly quicker than non-meta-learned baselines.

Title: A Multi-Scale Graph Learning Framework with Temporal Consistency Constraints for Financial Fraud Detection in Transaction Networks under Non-Stationary Conditions

Authors: Yiming Lei, Qiannan Shen, Junhao Song
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2603.14592
Pdf URL: https://arxiv.org/pdf/2603.14592
Copy Paste: [[2603.14592]] A Multi-Scale Graph Learning Framework with Temporal Consistency Constraints for Financial Fraud Detection in Transaction Networks under Non-Stationary Conditions(https://arxiv.org/abs/2603.14592)
Keywords: diffusion, self-supervised, anomaly
Abstract: Financial fraud detection in transaction networks involves modeling sparse anomalies, dynamic patterns, and severe class imbalance in the presence of temporal drift in the data. In real-world transaction systems, a suspicious transaction is rarely isolated: rather, legitimate and suspicious transactions are often connected through accounts, intermediaries or through temporal transaction sequences. Attribute-based or randomly partitioned learning pipelines are therefore insufficient to detect relationally structured fraud. STC-MixHop, a graph-based framework combining spatial multi-resolution propagation with lightweight temporal consistency modeling for anomaly and fraud detection in dynamic transaction networks. It integrates three components: a MixHop-inspired multi-scale neighborhood diffusion encoder a multi-scale neighborhood diffusion MixHop-based encoder for learning structural patterns; a spatial-temporal attention module coupling current and preceding graph snapshots to stabilize representations; and a temporally informed self-supervised pretraining strategy exploiting unlabeled transaction interactions to improve representation quality. We evaluate the framework primarily on the PaySim dataset under strict chronological splits, supplementing the analysis with Porto Seguro and FEMA data to probe cross-domain component behavior. Results show that STC-MixHop is competitive among graph methods and achieves strong screening-oriented recall under highly imbalanced conditions. The experiments also reveal an important boundary condition: when node attributes are highly informative, tabular baselines remain difficult to outperform. Graph structure contributes most clearly where hidden relational dependencies are operationally important. These findings support a stability-focused view of graph learning for financial fraud detection.

Title: $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Authors: Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14602
Pdf URL: https://arxiv.org/pdf/2603.14602
Copy Paste: [[2603.14602]] $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought(https://arxiv.org/abs/2603.14602)
Keywords: in-context
Abstract: Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

Title: Make it SING: Analyzing Semantic Invariants in Classifiers

Authors: Harel Yadid, Meir Yossef Levi, Roy Betser, Guy Gilboa
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2603.14610
Pdf URL: https://arxiv.org/pdf/2603.14610
Copy Paste: [[2603.14610]] Make it SING: Analyzing Semantic Invariants in Classifiers(https://arxiv.org/abs/2603.14610)
Keywords: self-supervised
Abstract: All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DinoViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.

Title: A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans

Authors: Aadit Nilay, Bhavesh Thapar, Anant Agrawal, Mohammad Nayeem Teli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14621
Pdf URL: https://arxiv.org/pdf/2603.14621
Copy Paste: [[2603.14621]] A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans(https://arxiv.org/abs/2603.14621)
Keywords: self-supervised
Abstract: The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.

Title: Continual Few-shot Adaptation for Synthetic Fingerprint Detection

Authors: Joseph Geo Benjamin, Anil K. Jain, Karthik Nandakumar
Subjects: cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2603.14632
Pdf URL: https://arxiv.org/pdf/2603.14632
Copy Paste: [[2603.14632]] Continual Few-shot Adaptation for Synthetic Fingerprint Detection(https://arxiv.org/abs/2603.14632)
Keywords: generative
Abstract: The quality and realism of synthetically generated fingerprint images have increased significantly over the past decade fueled by advancements in generative artificial intelligence (GenAI). This has exacerbated the vulnerability of fingerprint recognition systems to data injection attacks, where synthetic fingerprints are maliciously inserted during enrollment or authentication. Hence, there is an urgent need for methods to detect if a fingerprint image is real or synthetic. While it is straightforward to train deep neural network (DNN) models to classify images as real or synthetic, often such DNN models overfit the training data and fail to generalize well when applied to synthetic fingerprints generated using unseen GenAI models. In this work, we formulate synthetic fingerprint detection as a continual few-shot adaptation problem, where the objective is to rapidly evolve a base detector to identify new types of synthetic data. To enable continual few-shot adaptation, we employ a combination of binary cross-entropy and supervised contrastive (applied to the feature representation) losses and replay a few samples from previously known styles during fine-tuning to mitigate catastrophic forgetting. Experiments based on several DNN backbones (as feature extractors) and a variety of real and synthetic fingerprint datasets indicate that the proposed approach achieves a good trade-off between fast adaptation for detecting unseen synthetic styles and forgetting of known styles.

Title: Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

Authors: Mang Ning, Mingxiao Li, Le Zhang, Lanmiao Liu, Matthew B. Blaschko, Albert Ali Salah, Itir Onal Ertugrul
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14645
Pdf URL: https://arxiv.org/pdf/2603.14645
Copy Paste: [[2603.14645]] Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion(https://arxiv.org/abs/2603.14645)
Keywords: diffusion
Abstract: In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the \emph{Spectrum Matching Hypothesis}: latents with superior diffusability should (i) follow a flattened power-law PSD (\emph{Encoding Spectrum Matching}, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (\emph{Decoding Spectrum Matching}, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available this https URL.

Title: Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models

Authors: Hendrik Chiche, Ludovic Corcos, Logan Rouge
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14667
Pdf URL: https://arxiv.org/pdf/2603.14667
Copy Paste: [[2603.14667]] Comparative Analysis of 3D Convolutional and 2.5D Slice-Conditioned U-Net Architectures for MRI Super-Resolution via Elucidated Diffusion Models(https://arxiv.org/abs/2603.14667)
Keywords: diffusion
Abstract: Magnetic resonance imaging (MRI) super-resolution (SR) methods that computationally enhance low-resolution acquisitions to approximate high-resolution quality offer a compelling alternative to expensive high-field scanners. In this work we investigate an elucidated diffusion model (EDM) framework for brain MRI SR and compare two U-Net backbone architectures: (i) a full 3D convolutional U-Net that processes volumetric patches with 3D convolutions and multi-head self-attention, and (ii) a 2.5D slice-conditioned U-Net that super-resolves each slice independently while conditioning on an adjacent slice for inter-slice context. Both models employ continuous-sigma noise conditioning following Karras et al. and are trained on the NKI cohort of the FOMO60K dataset. On a held-out test set of 5 subjects (6 volumes, 993 slices), the 3D model achieves 37.75 dB PSNR, 0.997 SSIM, and 0.020 LPIPS, improving on the off-the-shelf pretrained EDSR baseline (35.57 dB / 0.024 LPIPS) and the 2.5D variant (35.82 dB) across all three metrics under the same test data and degradation pipeline.

Title: MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model

Authors: Jinguang Tong, Jinbo Wu, Kaisiyuan Wang, Zhelun Shen, Xuan Huang, Mochu Xiang, Xuesong Li, Yingying Li, Haocheng Feng, Chen Zhao, Hang Zhou, Wei He, Chuong Nguyen, Jingdong Wang, Hongdong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14686
Pdf URL: https://arxiv.org/pdf/2603.14686
Copy Paste: [[2603.14686]] MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation Model(https://arxiv.org/abs/2603.14686)
Keywords: foundation model
Abstract: Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.

Title: AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild

Authors: Yiting Wang, Tim Brödermann, Hamed Haghighi, Haonan Zhao, Christos Sakaridis, Kurt Debattista, Valentina Donzella
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14701
Pdf URL: https://arxiv.org/pdf/2603.14701
Copy Paste: [[2603.14701]] AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild(https://arxiv.org/abs/2603.14701)
Keywords: foundation model
Abstract: Robust depth completion is fundamental to real-world 3D scene understanding, yet existing RGB-LiDAR fusion methods degrade significantly under adverse weather, where both camera images and LiDAR measurements suffer from weather-induced corruption. In this paper, we introduce AURORA-KITTI, the first large-scale multi-modal, multi-weather benchmark for robust depth completion in the wild. We further formulate Depth Completion and Denoising (DCD) as a unified task that jointly reconstructs a dense depth map from corrupted sparse inputs while suppressing weather-induced noise. AURORA-KITTI contains over \textit{82K} weather-consistent RGBL pairs with metric depth ground truth, spanning diverse weather types, three severity levels, day and night scenes, paired clean references, lens occlusion conditions, and textual descriptions. Moreover, we introduce DDCD, an efficient distillation-based baseline that leverages depth foundation models to inject clean structural priors into in-the-wild DCD training. DDCD achieves state-of-the-art performance on AURORA-KITTI and the real-world DENSE dataset while maintaining efficiency. Notably, our results further show that weather-aware, physically consistent data contributes more to robustness than architectural modifications alone. Data and code will be released upon publication.

Title: Fractal Autoregressive Depth Estimation with Continuous Token Diffusion

Authors: Jinchang Zhang, Xinrou Kang, Guoyu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14702
Pdf URL: https://arxiv.org/pdf/2603.14702
Copy Paste: [[2603.14702]] Fractal Autoregressive Depth Estimation with Continuous Token Diffusion(https://arxiv.org/abs/2603.14702)
Keywords: diffusion
Abstract: Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.

Title: Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning

Authors: Ping Chen, Xiang Liu, Xingpeng Zhang, Fei Shen, Xun Gong, Zhaoxiang Liu, Zezhou Chen, Huan Hu, Kai Wang, Shiguo Lian
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2603.14704
Pdf URL: https://arxiv.org/pdf/2603.14704
Copy Paste: [[2603.14704]] Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning(https://arxiv.org/abs/2603.14704)
Keywords: diffusion, generative
Abstract: Diffusion models operate in a reflexive System 1 mode, constrained by a fixed, content-agnostic sampling schedule. This rigidity arises from the curse of state dimensionality, where the combinatorial explosion of possible states in the high-dimensional noise manifold renders explicit trajectory planning intractable and leads to systematic computational misallocation. To address this, we introduce Chain-of-Trajectories (CoTj), a train-free framework enabling System 2 deliberative planning. Central to CoTj is Diffusion DNA, a low-dimensional signature that quantifies per-stage denoising difficulty and serves as a proxy for the high-dimensional state space, allowing us to reformulate sampling as graph planning on a directed acyclic graph. Through a Predict-Plan-Execute paradigm, CoTj dynamically allocates computational effort to the most challenging generative phases. Experiments across multiple generative models demonstrate that CoTj discovers context-aware trajectories, improving output quality and stability while reducing redundant computation. This work establishes a new foundation for resource-aware, planning-based diffusion modeling. The code is available at this https URL.

Title: Cross-RAG: Zero-Shot Retrieval-Augmented Time Series Forecasting via Cross-Attention

Authors: Seunghan Lee, Jaehoon Lee, Jun Seo, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14709
Pdf URL: https://arxiv.org/pdf/2603.14709
Copy Paste: [[2603.14709]] Cross-RAG: Zero-Shot Retrieval-Augmented Time Series Forecasting via Cross-Attention(https://arxiv.org/abs/2603.14709)
Keywords: foundation model
Abstract: Recent advances in time series foundation models (TSFMs) demonstrate strong expressive capacity through large-scale pretraining across diverse time series domains. Zero-shot time series forecasting with TSFMs, however, exhibits limited generalization to unseen datasets, which retrieval-augmented forecasting addresses by leveraging an external knowledge base. Existing approaches rely on a fixed number of retrieved samples that may introduce irrelevant information. To this end, we propose Cross-RAG, a zero-shot retrieval-augmented forecasting framework that selectively attends to query-relevant retrieved samples. Cross-RAG models input-level relevance between the query and retrieved samples via query-retrieval cross-attention, while jointly incorporating information from the query and retrieved samples. Extensive experiments demonstrate that Cross-RAG consistently improves zero-shot forecasting performance across various TSFMs and RAG methods, and additional analyses confirm its effectiveness across diverse retrieval scenarios. Code is available at this https URL.

Title: Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention

Authors: Jeffrey D. Varner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.14717
Pdf URL: https://arxiv.org/pdf/2603.14717
Copy Paste: [[2603.14717]] Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention(https://arxiv.org/abs/2603.14717)
Keywords: generative
Abstract: Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.

Title: Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach

Authors: Hasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14727
Pdf URL: https://arxiv.org/pdf/2603.14727
Copy Paste: [[2603.14727]] Automated Diabetic Screening via Anterior Segment Ocular Imaging: A Deep Learning and Explainable AI Approach(https://arxiv.org/abs/2603.14727)
Keywords: self-supervised
Abstract: Diabetic retinopathy screening traditionally relies on fundus photography, requiring specialized equipment and expertise often unavailable in primary care and resource limited settings. We developed and validated a deep learning (DL) system for automated diabetic classification using anterior segment ocular imaging a readily accessible alternative utilizing standard photography equipment. The system leverages visible biomarkers in the iris, sclera, and conjunctiva that correlate with systemic diabetic status. We systematically evaluated five contemporary architectures (EfficientNet-V2-S with self-supervised learning (SSL), Vision Transformer, Swin Transformer, ConvNeXt-Base, and ResNet-50) on 2,640 clinically annotated anterior segment images spanning Normal, Controlled Diabetic, and Uncontrolled Diabetic categories. A tailored preprocessing pipeline combining specular reflection mitigation and contrast limited adaptive histogram equalization (CLAHE) was implemented to enhance subtle vascular and textural patterns critical for classification. SSL using SimCLR on domain specific ocular images substantially improved model this http URL-V2-S with SSL achieved optimal performance with an F1-score of 98.21%, precision of 97.90%, and recall of 98.55% a substantial improvement over ImageNet only initialization (94.63% F1). Notably, the model attained near perfect precision (100%) for Normal classification, critical for minimizing unnecessary clinical referrals.

Title: DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning

Authors: Zhiyu Wang, Mohammad Goudarzi, Mingming Gong, Rajkumar Buyya
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2603.14729
Pdf URL: https://arxiv.org/pdf/2603.14729
Copy Paste: [[2603.14729]] DeFRiS: Silo-Cooperative IoT Applications Scheduling via Decentralized Federated Reinforcement Learning(https://arxiv.org/abs/2603.14729)
Keywords: anomaly
Abstract: Next-generation IoT applications increasingly span across autonomous administrative entities, necessitating silo-cooperative scheduling to leverage diverse computational resources while preserving data privacy. However, realizing efficient cooperation faces significant challenges arising from infrastructure heterogeneity, Non-IID workload shifts, and the inherent risks of adversarial environments. Existing approaches, relying predominantly on centralized coordination or independent learning, fail to address the incompatibility of state-action spaces across heterogeneous silos and lack robustness against malicious attacks. This paper proposes DeFRiS, a Decentralized Federated Reinforcement Learning framework for robust and scalable Silo-cooperative IoT application scheduling. DeFRiS integrates three synergistic innovations: (i) an action-space-agnostic policy utilizing candidate resource scoring to enable seamless knowledge transfer across heterogeneous silos; (ii) a silo-optimized local learning mechanism combining Generalized Advantage Estimation (GAE) with clipped policy updates to resolve sparse delayed reward challenges; and (iii) a Dual-Track Non-IID robust decentralized aggregation protocol leveraging gradient fingerprints for similarity-aware knowledge transfer and anomaly detection, and gradient tracking for optimization momentum. Extensive experiments on a distributed testbed with 20 heterogeneous silos and realistic IoT workloads demonstrate that DeFRiS significantly outperforms state-of-the-art baselines, reducing average response time by 6.4% and energy consumption by 7.2%, while lowering tail latency risk (CVaR$_{0.95}$) by 10.4% and achieving near-zero deadline violations. Furthermore, DeFRiS achieves over 3 times better performance retention as the system scales and over 8 times better stability in adversarial environments compared to the best-performing baseline.

Title: PHAC: Promptable Human Amodal Completion

Authors: Seung Young Noh, Ju Yong Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14741
Pdf URL: https://arxiv.org/pdf/2603.14741
Copy Paste: [[2603.14741]] PHAC: Promptable Human Amodal Completion(https://arxiv.org/abs/2603.14741)
Keywords: diffusion, generative
Abstract: Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.

Title: POLCA: Stochastic Generative Optimization with LLM

Authors: Xuanfei Ren, Allen Nie, Tengyang Xie, Ching-An Cheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14769
Pdf URL: https://arxiv.org/pdf/2603.14769
Copy Paste: [[2603.14769]] POLCA: Stochastic Generative Optimization with LLM(https://arxiv.org/abs/2603.14769)
Keywords: generative
Abstract: Optimizing complex systems, ranging from LLM prompts to multi-turn agents, traditionally requires labor-intensive manual iteration. We formalize this challenge as a stochastic generative optimization problem where a generative language model acts as the optimizer, guided by numerical rewards and text feedback to discover the best system. We introduce Prioritized Optimization with Local Contextual Aggregation (POLCA), a scalable framework designed to handle stochasticity in optimization -- such as noisy feedback, sampling minibatches, and stochastic system behaviors -- while effectively managing the unconstrained expansion of solution space. POLCA maintains a priority queue to manage the exploration-exploitation tradeoff, systematically tracking candidate solutions and their evaluation histories. To enhance efficiency, we integrate an $\varepsilon$-Net mechanism to maintain parameter diversity and an LLM Summarizer to perform meta-learning across historical trials. We theoretically prove that POLCA converges to near-optimal candidate solutions under stochasticity. We evaluate our framework on diverse benchmarks, including $\tau$-bench, HotpotQA (agent optimization), VeriBench (code translation) and KernelBench (CUDA kernel generation). Experimental results demonstrate that POLCA achieves robust, sample and time-efficient performance, consistently outperforming state-of-the-art algorithms in both deterministic and stochastic problems. The codebase for this work is publicly available at this https URL.

Title: AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Authors: Longhui Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14770
Pdf URL: https://arxiv.org/pdf/2603.14770
Copy Paste: [[2603.14770]] AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas(https://arxiv.org/abs/2603.14770)
Keywords: diffusion
Abstract: Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains increasing as the number of identities grows. AnyPhoto also supports prompt-driven stylization with accurate placement, showing great potential application value.

Title: Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Authors: Ernie Chu, Vishal M. Patel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14794
Pdf URL: https://arxiv.org/pdf/2603.14794
Copy Paste: [[2603.14794]] Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling(https://arxiv.org/abs/2603.14794)
Keywords: diffusion
Abstract: Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Title: RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14819
Pdf URL: https://arxiv.org/pdf/2603.14819
Copy Paste: [[2603.14819]] RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models(https://arxiv.org/abs/2603.14819)
Keywords: diffusion
Abstract: Transformer based diffusion and vision-language models have achieved remarkable success; yet, efficiently removing undesirable or sensitive information without retraining remains a central challenge for model safety and compliance. We introduce Ratio-Aware Zero/One-step Optimized Retentive unlearning (RAZOR), a lightweight, model-agnostic unlearning framework that generalizes forgetting updates to coordinated multi-layer and multi-head edits within transformer backbones. RAZOR identifies the most important layers and attention heads by measuring how much they contribute to forgetting the target data while preserving useful knowledge. Then, it updates these parts of the model using a carefully regularized rule to avoid harming overall performance. The set of edited components grows gradually, ensuring precise unlearning without over-editing or damaging unrelated capabilities. We evaluate RAZOR on CLIP, Stable Diffusion, and vision-language models (VLMs) using widely adopted unlearning benchmarks covering identity, style, and object erasure tasks. Our results show that RAZOR achieves highly accurate and stable forgetting, even under quantization. This approach offers stronger retention and better efficiency than prior methods. Notably, it also operates significant faster than conventional techniques. These results demonstrate that RAZOR is a practical and scalable solution for safe, adaptive unlearning in transformer-based vision models.

Title: IntegratingWeather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting

Authors: Ziqing Ma, Kai Ying, Xinyue Gu, Tian Zhou, Tianyu Zhu, Haifan Zhang, Peisong Niu, Wang Zheng, Cong Bai, Liang Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14845
Pdf URL: https://arxiv.org/pdf/2603.14845
Copy Paste: [[2603.14845]] IntegratingWeather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting(https://arxiv.org/abs/2603.14845)
Keywords: foundation model
Abstract: Accurate day-ahead solar irradiance forecasting is essential for integrating solar energy into the power grid. However, it remains challenging due to the pronounced diurnal cycle and inherently complex cloud dynamics. Current methods either lack fine-scale resolution (e.g., numerical weather prediction, weather foundation models) or degrade at longer lead times (e.g., satellite extrapolation). We propose Baguan-solar, a two-stage multimodal framework that fuses forecasts from Baguan, a global weather foundation model, with high-resolution geostationary satellite imagery to produce 24- hour irradiance forecasts at kilometer scale. Its decoupled two-stage design first forecasts day-night continuous intermediates (e.g., cloud cover) and then infers irradiance, while its modality fusion jointly preserves fine-scale cloud structures from satellite and large-scale constraints from Baguan forecasts. Evaluated over East Asia using CLDAS as ground truth, Baguan-solar outperforms strong baselines (including ECMWF IFS, vanilla Baguan, and SolarSeer), reducing RMSE by 16.08% and better resolving cloud-induced transients. An operational deployment of Baguan-solar has supported solar power forecasting in an eastern province in China, since July 2025. Our code is accessible at this https URL. git.

Title: From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration

Authors: Ziwei Wei, Yao Shen, Wanheng Lu, Ghim Wei Ho, Kaiyang Zeng
Subjects: cs.CV, cond-mat.mes-hall
Abstract URL: https://arxiv.org/abs/2603.14850
Pdf URL: https://arxiv.org/pdf/2603.14850
Copy Paste: [[2603.14850]] From Artefact to Insight: Efficient Low-Rank Adaptation of BrushNet for Scanning Probe Microscopy Image Restoration(https://arxiv.org/abs/2603.14850)
Keywords: diffusion, generative
Abstract: Scanning Probe Microscopy or SPM offers nanoscale resolution but is frequently marred by structured artefacts such as line scan dropout, gain induced noise, tip convolution, and phase hops. While most available methods treat SPM artefact removal as isolated denoising or interpolation tasks, the generative inpainting perspective remains largely unexplored. In this work, we introduce a diffusion based inpainting framework tailored to scientific grayscale imagery. By fine tuning less than 0.2 percent of BrushNet weights with rank constrained low rank adaptation (LoRA), we adapt a pretrained diffusion model using only 7390 artefact, clean pairs distilled from 739 experimental scans. On our forthcoming public SPM InpBench benchmark, the LoRA enhanced model lifts the Peak Signal to Noise Ratio or PSNR by 6.61 dB and halves the Learned Perceptual Image Patch Similarity or LPIPS relative to zero-shot inference, while matching or slightly surpassing the accuracy of full retraining, trainable on a single GPU instead of four high-memory cards. The approach generalizes across various SPM image channels including height, amplitude and phase, faithfully restores subtle structural details, and suppresses hallucination artefacts inherited from natural image priors. This lightweight framework enables efficient, scalable recovery of irreplaceable SPM images and paves the way for a broader diffusion model adoption in nanoscopic imaging analysis.

Title: Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Authors: Bingxue Zhang, Yang Gao, Feida Zhu, Yanyan Shen, Yang Shi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14860
Pdf URL: https://arxiv.org/pdf/2603.14860
Copy Paste: [[2603.14860]] Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats(https://arxiv.org/abs/2603.14860)
Keywords: diffusion, generative
Abstract: Generative AI deployment poses unprecedented challenges to content safety and privacy. However, existing defense mechanisms are often tailored to specific architectures (e.g., Diffusion Models or GANs), creating fragile "defense silos" that fail against heterogeneous generative threats. This paper identifies a fundamental optimization barrier in naive pixel-space ensemble strategies: due to divergent objective functions, pixel-level gradients from heterogeneous generators become statistically orthogonal, causing destructive interference. To overcome this, we observe that despite disparate low-level mechanisms, high-level feature representations of generated content exhibit alignment across architectures. Based on this, we propose the Architecture-Agnostic Targeted Feature Synergy (ATFS) framework. By introducing a target guidance image, ATFS reformulates multi-model defense as a unified feature space alignment task, enabling intrinsic gradient alignment without complex rectification. Extensive experiments show ATFS achieves SOTA protection in heterogeneous scenarios (e.g., Diffusion+GAN). It converges rapidly, reaching over 90% performance within 40 iterations, and maintains strong attack potency even under tight perturbation budgets. The framework seamlessly extends to unseen architectures (e.g., VQ-VAE) by switching the feature extractor, and demonstrates robust resistance to JPEG compression and scaling. Being computationally efficient and lightweight, ATFS offers a viable pathway to dismantle defense silos and enable universal generative security. Code and models are open-sourced for reproducibility.

Title: IgPose: A Generative Data-Augmented Pipeline for Robust Immunoglobulin-Antigen Binding Prediction

Authors: Tien-Cuong Bui, Injae Chung, Wonjun Lee, Junsu Ko, Juyong Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14870
Pdf URL: https://arxiv.org/pdf/2603.14870
Copy Paste: [[2603.14870]] IgPose: A Generative Data-Augmented Pipeline for Robust Immunoglobulin-Antigen Binding Prediction(https://arxiv.org/abs/2603.14870)
Keywords: generative
Abstract: Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally-resolved complexes and the limited accuracy of de novo Ig structure prediction. We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks--IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation--and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking. IgPose is available on GitHub (this https URL).

Title: Seismic full-waveform inversion based on a physics-driven generative adversarial network

Authors: Xinyi Zhang, Caiyun Liu, Jie Xiong, Qingfeng Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14879
Pdf URL: https://arxiv.org/pdf/2603.14879
Copy Paste: [[2603.14879]] Seismic full-waveform inversion based on a physics-driven generative adversarial network(https://arxiv.org/abs/2603.14879)
Keywords: generative
Abstract: Objectives: Full-waveform inversion (FWI) is a high-resolution geophysical imaging technique that reconstructs subsurface velocity models by iteratively minimizing the misfit between predicted and observed seismic data. However, under complex geological conditions, conventional FWI suffers from strong dependence on the initial model and tends to produce unstable results when the data are sparse or contaminated by noise. Methods: To address these limitations, this paper proposes a physics-driven generative adversarial network-based full-waveform inversion method. The proposed approach integrates the data-driven capability of deep neural networks with the physical constraints imposed by the seismic wave equation, and employs adversarial training through a discriminator to enhance the stability and robustness of the inversion results. Results: Experimental results on two representative benchmark geological models demonstrate that the proposed method can effectively recover complex velocity structures and achieves superior performance in terms of structural similarity (SSIM) and signal-to-noise ratio (SNR). Conclusions: This method provides a promising solution for alleviating the initial-model dependence in full-waveform inversion and shows strong potential for practical applications.

Title: SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

Authors: Huanjing Yue, Shangbin Xie, Cong Cao, Qian Wu, Lei Zhang, Lei Zhao, Jingyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14885
Pdf URL: https://arxiv.org/pdf/2603.14885
Copy Paste: [[2603.14885]] SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras(https://arxiv.org/abs/2603.14885)
Keywords: diffusion
Abstract: RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to adapt to different camera-specific ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection. Our code and model are available at this https URL.

Title: Workflow-Aware Structured Layer Decomposition for Illustration Production

Authors: Tianyu Zhang, Dongchi Li, Keiichi Sawada, Haoran Xie
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2603.14925
Pdf URL: https://arxiv.org/pdf/2603.14925
Copy Paste: [[2603.14925]] Workflow-Aware Structured Layer Decomposition for Illustration Production(https://arxiv.org/abs/2603.14925)
Keywords: generative
Abstract: Recent generative image editing methods adopt layered representations to mitigate the entangled nature of raster images and improve controllability, typically relying on object-based segmentation. However, such strategies may fail to capture the structural and stylized properties of human-created images, such as anime illustrations. To solve this issue, we propose a workflow-aware structured layer decomposition framework tailored to the illustration production of anime artwork. Inspired by the creation pipeline of anime production, our method decomposes the illustration into semantically meaningful production layers, including line art, flat color, shadow, and highlight. To decouple all these layers, we introduce lightweight layer semantic embeddings to provide specific task guidance for each layer. Furthermore, a set of layer-wise losses is incorporated to supervise the training process of individual layers. To overcome the lack of ground-truth layered data, we construct a high-quality illustration dataset that simulated the standard anime production workflow. Experiments demonstrate that the accurate and visually coherent layer decompositions were achieved by using our method. We believe that the resulting layered representation further enables downstream tasks such as recoloring and embedding texture, supporting content creation, and illustration editing. Code is available at: this https URL

Title: Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

Authors: Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14936
Pdf URL: https://arxiv.org/pdf/2603.14936
Copy Paste: [[2603.14936]] Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework(https://arxiv.org/abs/2603.14936)
Keywords: diffusion, generative
Abstract: Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.

Title: LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs

Authors: Ying Zhang, Hang Yu, Haipeng Zhang, Peng Di
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.14937
Pdf URL: https://arxiv.org/pdf/2603.14937
Copy Paste: [[2603.14937]] LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs(https://arxiv.org/abs/2603.14937)
Keywords: generative
Abstract: Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node's raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.

Title: FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

Authors: Yaoru Li, Federico Landi, Marco Godi, Xin Jin, Ruiju Fu, Yufei Ma, Muyang Sun, Heyu Si, Qi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14938
Pdf URL: https://arxiv.org/pdf/2603.14938
Copy Paste: [[2603.14938]] FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving(https://arxiv.org/abs/2603.14938)
Keywords: diffusion, generative
Abstract: Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.

Title: CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

Authors: Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand, Bingnan Li, Haiyang Xu, Zhuowen Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14957
Pdf URL: https://arxiv.org/pdf/2603.14957
Copy Paste: [[2603.14957]] CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models(https://arxiv.org/abs/2603.14957)
Keywords: foundation model
Abstract: We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

Title: GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis

Authors: Minjun Kang, Inkyu Shin, Taeyeop Lee, Myungchul Kim, In So Kweon, Kuk-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.14965
Pdf URL: https://arxiv.org/pdf/2603.14965
Copy Paste: [[2603.14965]] GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis(https://arxiv.org/abs/2603.14965)
Keywords: diffusion
Abstract: Novel view synthesis requires strong 3D geometric consistency and the ability to generate visually coherent images across diverse viewpoints. While recent camera-controlled video diffusion models show promising results, they often suffer from geometric distortions and limited camera controllability. To overcome these challenges, we introduce GeoNVS, a geometry-grounded novel-view synthesizer that enhances both geometric fidelity and camera controllability through explicit 3D geometric guidance. Our key innovation is the Gaussian Splat Feature Adapter (GS-Adapter), which lifts input-view diffusion features into 3D Gaussian representations, renders geometry-constrained novel-view features, and adaptively fuses them with diffusion features to correct geometrically inconsistent representations. Unlike prior methods that inject geometry at the input level, GS-Adapter operates in feature space, avoiding view-dependent color noise that degrades structural consistency. Its plug-and-play design enables zero-shot compatibility with diverse feed-forward geometry models without additional training, and can be adapted to other video diffusion backbones. Experiments across 9 scenes and 18 settings demonstrate state-of-the-art performance, achieving 11.3% and 14.9% improvements over SEVA and CameraCtrl, with up to 2x reduction in translation error and 7x in Chamfer Distance.

Title: Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning

Authors: Nasrin Rahimi, Mısra Yavuz, Burak Can Biner, Yunus Bilge Kurt, Ahmet Rasim Emirdağı, Süleyman Aslan, Görkay Aydemir, M. Akın Yılmaz, A. Murat Tekalp
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15003
Pdf URL: https://arxiv.org/pdf/2603.15003
Copy Paste: [[2603.15003]] Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning(https://arxiv.org/abs/2603.15003)
Keywords: foundation model
Abstract: Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally designed solely for static instruction-based edits, can be adapted for Video Frame Interpolation (VFI) using only 64-256 training samples via Low-Rank Adaptation (LoRA). Our core contribution is revealing that the model's inherent understanding of "how objects transform" in static scenes contains latent temporal reasoning that can be activated through few-shot fine-tuning. While the baseline model completely fails at producing coherent intermediate frames, our parameter-efficient adaptation successfully unlocks its interpolation capability. Rather than competing with task-specific VFI methods trained from scratch on massive datasets, our work establishes that foundation image editing models possess untapped potential for temporal tasks, offering a data-efficient pathway for video synthesis in resource-constrained scenarios. This bridges the gap between image manipulation and video understanding, suggesting that spatial and temporal reasoning may be more intertwined in foundation models than previously recognized

Title: TrajFlow: Nation-wide Pseudo GPS Trajectory Generation with Flow Matching Models

Authors: Peiran Li, Jiawei Wang, Haoran Zhang, Xiaodan Shi, Noboru Koshizuka, Chihiro Shimizu, Renhe Jiang
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.15009
Pdf URL: https://arxiv.org/pdf/2603.15009
Copy Paste: [[2603.15009]] TrajFlow: Nation-wide Pseudo GPS Trajectory Generation with Flow Matching Models(https://arxiv.org/abs/2603.15009)
Keywords: diffusion, generative
Abstract: The importance of mobile phone GPS trajectory data is widely recognized across many fields, yet the use of real data is often hindered by privacy concerns, limited accessibility, and high acquisition costs. As a result, generating pseudo-GPS trajectory data has become an active area of research. Recent diffusion-based approaches have achieved strong fidelity but remain limited in spatial scale (small urban areas), transportation-mode diversity, and efficiency (requiring numerous sampling steps). To address these challenges, we introduce TrajFlow, which to the best of our knowledge is the first flow-matching-based generative model for GPS trajectory generation. TrajFlow leverages the flow-matching paradigm to improve robustness and efficiency across multiple geospatial scales, and incorporates a trajectory harmonization and reconstruction strategy to jointly address scalability, diversity, and efficiency. Using a nationwide mobile phone GPS dataset with millions of trajectories across Japan, we show that TrajFlow or its variants consistently outperform diffusion-based and deep generative baselines at urban, metropolitan, and nationwide levels. As the first nationwide, multi-scale GPS trajectory generation model, TrajFlow demonstrates strong potential to support inter-region urban planning, traffic management, and disaster response, thereby advancing the resilience and intelligence of future mobility systems.

Title: Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Authors: Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15026
Pdf URL: https://arxiv.org/pdf/2603.15026
Copy Paste: [[2603.15026]] Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods(https://arxiv.org/abs/2603.15026)
Keywords: generative
Abstract: Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at this https URL.

Title: Interpretable Predictability-Based AI Text Detection: A Replication Study

Authors: Adam Skurla, Dominik Macko, Jakub Simko
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15034
Pdf URL: https://arxiv.org/pdf/2603.15034
Copy Paste: [[2603.15034]] Interpretable Predictability-Based AI Text Detection: A Replication Study(https://arxiv.org/abs/2603.15034)
Keywords: generative
Abstract: This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model's decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.

Title: Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Authors: Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15061
Pdf URL: https://arxiv.org/pdf/2603.15061
Copy Paste: [[2603.15061]] Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization(https://arxiv.org/abs/2603.15061)
Keywords: generative
Abstract: As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.

Title: ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

Authors: Cheng Luo, Bizhu Wu, Bing Li, Jianfeng Ren, Ruibin Bai, Rong Qu, Linlin Shen, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.HC, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.15083
Pdf URL: https://arxiv.org/pdf/2603.15083
Copy Paste: [[2603.15083]] ReactMotion: Generating Reactive Listener Motions from Speaker Utterance(https://arxiv.org/abs/2603.15083)
Keywords: generative
Abstract: In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listener behaviors remains underexplored and challenging due to the inherently non-deterministic nature of human reactions. To facilitate this task, we present ReactMotionNet, a large-scale dataset that pairs speaker utterances with multiple candidate listener motions annotated with varying degrees of appropriateness. This dataset design explicitly captures the one-to-many nature of listener behavior and provides supervision beyond a single ground-truth motion. Building on this dataset design, we develop preference-oriented evaluation protocols tailored to evaluate reactive appropriateness, where conventional motion metrics focusing on input-motion alignment ignore. We further propose ReactMotion, a unified generative framework that jointly models text, audio, emotion, and motion, and is trained with preference-based objectives to encourage both appropriate and diverse listener responses. Extensive experiments show that ReactMotion outperforms retrieval baselines and cascaded LLM-based pipelines, generating more natural, diverse, and appropriate listener motions.

Title: Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC

Authors: Alice Natalina Caragliano, Giulia Farina, Fatih Aksu, Camillo Maria Caruso, Claudia Tacconi, Carlo Greco, Lorenzo Nibid, Edy Ippolito, Michele Fiore, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15100
Pdf URL: https://arxiv.org/pdf/2603.15100
Copy Paste: [[2603.15100]] Learning from Limited and Incomplete Data: A Multimodal Framework for Predicting Pathological Response in NSCLC(https://arxiv.org/abs/2603.15100)
Keywords: foundation model
Abstract: Major pathological response (pR) following neoadjuvant therapy is a clinically meaningful endpoint in non-small cell lung cancer, strongly associated with improved survival. However, accurate preoperative prediction of pR remains challenging, particularly in real-world clinical settings characterized by limited data availability and incomplete clinical profiles. In this study, we propose a multimodal deep learning framework designed to address these constraints by integrating foundation model-based CT feature extraction with a missing-aware architecture for clinical variables. This approach enables robust learning from small cohorts while explicitly modeling missing clinical information, without relying on conventional imputation strategies. A weighted fusion mechanism is employed to leverage the complementary contributions of imaging and clinical modalities, yielding a multimodal model that consistently outperforms both unimodal imaging and clinical baselines. These findings underscore the added value of integrating heterogeneous data sources and highlight the potential of multimodal, missing-aware systems to support pR prediction under realistic clinical conditions.

Title: VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Authors: Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15118
Pdf URL: https://arxiv.org/pdf/2603.15118
Copy Paste: [[2603.15118]] VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents(https://arxiv.org/abs/2603.15118)
Keywords: foundation model
Abstract: We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

Title: A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation

Authors: Nevrez Imamoglu, Ali Caglayan, Toru Kouyama
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15119
Pdf URL: https://arxiv.org/pdf/2603.15119
Copy Paste: [[2603.15119]] A Tutorial on ALOS2 SAR Utilization: Dataset Preparation, Self-Supervised Pretraining, and Semantic Segmentation(https://arxiv.org/abs/2603.15119)
Keywords: self-supervised, foundation model
Abstract: Masked auto-encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR-W-MixMAE, which adds SAR-specific intensity-weighted loss to standard MixMAE for pretraining, we also introduce SAR-W-SimMIM; a weighted variant of SimMIM applied to ALOS-2 single-channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self-supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR-W-MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine-tuning models on satellite imagery pose unique challenges, particularly when developing region-specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS-2 single-channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national-scale foundation model. This dataset was used to pretrain a vision transformer-based autoencoder, with the resulting encoder fine-tuned for semantic segmentation using a task-specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self-supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.

Title: Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors

Authors: Yunuo Chen, Chuqin Zhou, Jiangchuan Li, Xiaoyue Ling, Bing He, Jincheng Dai, Li Song, Guo Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15129
Pdf URL: https://arxiv.org/pdf/2603.15129
Copy Paste: [[2603.15129]] Next-Frame Decoding for Ultra-Low-Bitrate Image Compression with Video Diffusion Priors(https://arxiv.org/abs/2603.15129)
Keywords: diffusion, generative
Abstract: We present a novel paradigm for ultra-low-bitrate image compression (ULB-IC) that exploits the ``temporal'' evolution in generative image compression. Specifically, we define an explicit intermediate state during decoding: a compact anchor frame, which preserves the scene geometry and semantic layout while discarding high-frequency details. We then reinterpret generative decoding as a virtual temporal transition from this anchor to the final reconstructed this http URL model this progression, we leverage a pretrained video diffusion model (VDM) as temporal priors: the anchor frame serves as the initial frame and the original image as the target frame, transforming the decoding process into a next-frame prediction this http URL contrast to image diffusion-based ULB-IC models, our decoding proceeds from a visible, semantically faithful anchor, which improves both fidelity and realism for perceptual image compression. Extensive experiments demonstrate that our method achieves superior objective and subjective performance. On the CLIC2020 test set, our method achieves over \textbf{50\% bitrate savings} across LPIPS, DISTS, FID, and KID compared to DiffC, while also delivering a significant decoding speedup of up to $\times$5. Code will be released later.

Title: WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Authors: Hainuo Wang, Mingjia Li, Xiaojie Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15132
Pdf URL: https://arxiv.org/pdf/2603.15132
Copy Paste: [[2603.15132]] WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation(https://arxiv.org/abs/2603.15132)
Keywords: diffusion
Abstract: While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at this https URL.

Title: Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Authors: Mumuksh Tayal, Manan Tayal, Ravi Prakash
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15136
Pdf URL: https://arxiv.org/pdf/2603.15136
Copy Paste: [[2603.15136]] Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies(https://arxiv.org/abs/2603.15136)
Keywords: diffusion, generative
Abstract: Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

Title: SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, Jason Kuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15150
Pdf URL: https://arxiv.org/pdf/2603.15150
Copy Paste: [[2603.15150]] SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation(https://arxiv.org/abs/2603.15150)
Keywords: generative
Abstract: Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

Title: PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing

Authors: Benjamin Uhrich, Tim Häntschel, Erhard Rahm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.15194
Pdf URL: https://arxiv.org/pdf/2603.15194
Copy Paste: [[2603.15194]] PiGRAND: Physics-informed Graph Neural Diffusion for Intelligent Additive Manufacturing(https://arxiv.org/abs/2603.15194)
Keywords: diffusion
Abstract: A comprehensive understanding of heat transport is essential for optimizing various mechanical and engineering applications, including 3D printing. Recent advances in machine learning, combined with physics-based models, have enabled a powerful fusion of numerical methods and data-driven algorithms. This progress is driven by the availability of limited sensor data in various engineering and scientific domains, where the cost of data collection and the inaccessibility of certain measurements are high. To this end, we present PiGRAND, a Physics-informed graph neural diffusion framework. In order to reduce the computational complexity of graph learning, an efficient graph construction procedure was developed. Our approach is inspired by the explicit Euler and implicit Crank-Nicolson methods for modeling continuous heat transport, leveraging sub-learning models to secure the accurate diffusion across graph nodes. To enhance computational performance, our approach is combined with efficient transfer learning. We evaluate PiGRAND on thermal images from 3D printing, demonstrating significant improvements in prediction accuracy and computational performance compared to traditional graph neural diffusion (GRAND) and physics-informed neural networks (PINNs). These enhancements are attributed to the incorporation of physical principles derived from the theoretical study of partial differential equations (PDEs) into the learning model. The PiGRAND code is open-sourced on GitHub: this https URL

Title: Towards Foundation Models for Consensus Rank Aggregation

Authors: Yijun Jin, Simon Klüttermann, Chiara Balestra, Emmanuel Müller
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2603.15218
Pdf URL: https://arxiv.org/pdf/2603.15218
Copy Paste: [[2603.15218]] Towards Foundation Models for Consensus Rank Aggregation(https://arxiv.org/abs/2603.15218)
Keywords: foundation model
Abstract: Aggregating a consensus ranking from multiple input rankings is a fundamental problem with applications in recommendation systems, search engines, job recruitment, and elections. Despite decades of research in consensus ranking aggregation, minimizing the Kemeny distance remains computationally intractable. Specifically, determining an optimal aggregation of rankings with respect to the Kemeny distance is an NP-hard problem, limiting its practical application to relatively small-scale instances. We propose the Kemeny Transformer, a novel Transformer-based algorithm trained via reinforcement learning to efficiently approximate the Kemeny optimal ranking. Experimental results demonstrate that our model outperforms classical majority-heuristic and Markov-chain approaches, achieving substantially faster inference than integer linear programming solvers. Our approach thus offers a practical, scalable alternative for real-world ranking-aggregation tasks.

Title: Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Authors: Yao Gu, Xiaohao Xu, Yingna Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15237
Pdf URL: https://arxiv.org/pdf/2603.15237
Copy Paste: [[2603.15237]] Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection(https://arxiv.org/abs/2603.15237)
Keywords: anomaly
Abstract: Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

Title: In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

Authors: Francesco Sovrano, Lidia Losavio, Giulia Vilone, Marc Langheinrich
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15250
Pdf URL: https://arxiv.org/pdf/2603.15250
Copy Paste: [[2603.15250]] In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks(https://arxiv.org/abs/2603.15250)
Keywords: in-context
Abstract: Symbolic regression aims to replace black-box predictors with concise analytical expressions that can be inspected and validated in scientific machine learning. Kolmogorov-Arnold Networks (KANs) are well suited to this goal because each connection between adjacent units (an "edge") is parametrised by a learnable univariate function that can, in principle, be replaced by a symbolic operator. In practice, however, symbolic extraction is a bottleneck: the standard KAN-to-symbol approach fits operators to each learned edge function in isolation, making the discrete choice sensitive to initialisation and non-convex parameter fitting, and ignoring how local substitutions interact through the full network. We study in-context symbolic regression for operator extraction in KANs, and present two complementary instantiations. Greedy in-context Symbolic Regression (GSR) performs greedy, in-context selection by choosing edge replacements according to end-to-end loss improvement after brief fine-tuning. Gated Matching Pursuit (GMP) amortises this in-context selection by training a differentiable gated operator layer that places an operator library behind sparse gates on each edge; after convergence, gates are discretised (optionally followed by a short in-context greedy refinement pass). We quantify robustness via one-factor-at-a-time (OFAT) hyper-parameter sweeps and assess both predictive error and qualitative consistency of recovered formulas. Across several experiments, greedy in-context symbolic regression achieves up to 99.8% reduction in median OFAT test MSE.

Title: IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning

Authors: Konstantinos Almpanakis, Anna Kreshuk
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15263
Pdf URL: https://arxiv.org/pdf/2603.15263
Copy Paste: [[2603.15263]] IConE: Batch Independent Collapse Prevention for Self-Supervised Representation Learning(https://arxiv.org/abs/2603.15263)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has revolutionized representation learning, with Joint-Embedding Architectures (JEAs) emerging as an effective approach for capturing semantic features. Existing JEAs rely on implicit or explicit batch interaction -- via negative sampling or statistical regularization -- to prevent representation collapse. This reliance becomes problematic in regimes where batch sizes must be small, such as high-dimensional scientific data, where memory constraints and class imbalance make large, well-balanced batches infeasible. We introduce IConE (Instance-Contrasted Embeddings), a framework that decouples collapse prevention from the training batch size. Rather than enforcing diversity through batch statistics, IConE maintains a global set of learnable auxiliary instance embeddings regularized by an explicit diversity objective. This transfers the anti-collapse mechanism from the transient batch to a dataset-level embedding space, allowing stable training even when batch statistics are unreliable, down to batch size 1. Across diverse 2D and 3D biomedical modalities, IConE outperforms strong contrastive and non-contrastive baselines throughout the small-batch regime (from B=1 to B=64) and demonstrates marked robustness to severe class imbalance. Geometric analysis shows that IConE preserves high intrinsic dimensionality in the learned representations, preventing the collapse observed in existing JEAs as batch sizes shrink.

Title: Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels

Authors: Victor Wåhlstrand, Jennifer Alvén, Ida Häggström
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15267
Pdf URL: https://arxiv.org/pdf/2603.15267
Copy Paste: [[2603.15267]] Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels(https://arxiv.org/abs/2603.15267)
Keywords: diffusion
Abstract: We present a framework to take advantage of existing labels at inference, called \textit{exemplars}, in order to improve the performance of object detection in medical images. The method, \textit{exemplar diffusion}, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: this https URL

Title: Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

Authors: Kim Ouan, Noémie Moreau, Katarzyna Bozek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15269
Pdf URL: https://arxiv.org/pdf/2603.15269
Copy Paste: [[2603.15269]] Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps(https://arxiv.org/abs/2603.15269)
Keywords: self-supervised
Abstract: The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

Title: Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Authors: Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15271
Pdf URL: https://arxiv.org/pdf/2603.15271
Copy Paste: [[2603.15271]] Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models(https://arxiv.org/abs/2603.15271)
Keywords: diffusion, generative
Abstract: Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at this https URL.

Title: Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling

Authors: Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher, Paolo Favaro
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.15279
Pdf URL: https://arxiv.org/pdf/2603.15279
Copy Paste: [[2603.15279]] Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling(https://arxiv.org/abs/2603.15279)
Keywords: diffusion, generative
Abstract: Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.

Title: GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection

Authors: Aggelos Psiris, Yannis Panagakis, Maria Vakalopoulou, Georgios Th. Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15300
Pdf URL: https://arxiv.org/pdf/2603.15300
Copy Paste: [[2603.15300]] GATE-AD: Graph Attention Network Encoding For Few-Shot Industrial Visual Anomaly Detection(https://arxiv.org/abs/2603.15300)
Keywords: anomaly
Abstract: Few-Shot Industrial Visual Anomaly Detection (FS-IVAD) comprises a critical task in modern manufacturing settings, where automated product inspection systems need to identify rare defects using only a handful of normal/defect-free training samples. In this context, the current study introduces a novel reconstruction-based approach termed GATE-AD. In particular, the proposed framework relies on the employment of a masked, representation-aligned Graph Attention Network (GAT) encoding scheme to learn robust appearance patterns of normal samples. By leveraging dense, patch-level, visual feature tokens as graph nodes, the model employs stacked self-attentional layers to adaptively encode complex, irregular, non-Euclidean, local relations. The graph is enhanced with a representation alignment component grounded on a learnable, latent space, where high reconstruction residual areas (i.e., defects) are assessed using a Scaled Cosine Error (SCE) objective function. Extensive comparative evaluation on the MVTec AD, VisA, and MPDD industrial defect detection benchmarks demonstrates that GATE-AD achieves state-of-the-art performance across the $1$- to $8$-shot settings, combining the highest detection accuracy (increase up to $1.8\%$ in image AUROC in the 8-shot case in MPDD) with the lowest per-image inference latency (at least $25.05\%$ faster), compared to the best-performing literature methods. In order to facilitate reproducibility and further research, the source code of GATE-AD is available at this https URL.

Title: Generative Video Compression with One-Dimensional Latent Representation

Authors: Zihan Zheng, Zhaoyang Jia, Naifu Xue, Jiahao Li, Bin Li, Zongyu Guo, Xiaoyi Zhang, Zhenghao Chen, Houqiang Li, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15302
Pdf URL: https://arxiv.org/pdf/2603.15302
Copy Paste: [[2603.15302]] Generative Video Compression with One-Dimensional Latent Representation(https://arxiv.org/abs/2603.15302)
Keywords: generative
Abstract: Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4\% under LPIPS and 68.8\% under DISTS on the HEVC Class B dataset, surpassing the previous video compression this http URL: this https URL

Title: DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

Authors: Xueyu Zhou, Yangrong Hu, Jian Huang
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2603.15340
Pdf URL: https://arxiv.org/pdf/2603.15340
Copy Paste: [[2603.15340]] DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models(https://arxiv.org/abs/2603.15340)
Keywords: diffusion
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

Title: Unsupervised Cross-Protocol Anomaly Analysis in Mobile Core Networks via Multi-Embedding Models Consensus

Authors: Aayush Garg, Orlando Amaral Cejas
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.15344
Pdf URL: https://arxiv.org/pdf/2603.15344
Copy Paste: [[2603.15344]] Unsupervised Cross-Protocol Anomaly Analysis in Mobile Core Networks via Multi-Embedding Models Consensus(https://arxiv.org/abs/2603.15344)
Keywords: anomaly
Abstract: Mobile core networks rely on several signalling protocols in parallel, such as SS7, Diameter, and GTP, so many security-relevant problems become visible only when their interactions are analyzed jointly. At the same time, labeled examples of real attacks and cross-protocol misconfigurations are scarce, which complicates supervised detection. We therefore study unsupervised cross-protocol anomaly analysis on fused representations that combine SS7, Diameter, and GTP signalling. For each subscriber, we aggregate messages into per-minute fused records, serialize each record as text, embed it with several models, and apply unsupervised anomaly detection. We then assign each record a consensus score equal to the number of embedding models that flag it as anomalous. For evaluation, we generate cross-protocol-plausible synthetic anomalies by swapping one field group at a time between pairs of records, preserving per-message validity while making the fused view contradictory. On 219,294 fused records, 44.15% are flagged by at least one model, but only 0.97% reach full agreement across all six. Higher consensus is strongly associated with synthetic records, where for k=1-4 the odds that a flagged record is synthetic are hundreds of times greater than for original records, and for k>=5 all flagged records are synthetic, with extremely small p-values. Cosine distances between synthetic and original records also increase with consensus, suggesting clearer separation in embedding space. These results support the use of multi-embedding consensus to prioritize a much smaller set of candidate cross-protocol inconsistencies for further inspection.

Title: Conditional Rectified Flow-based End-to-End Rapid Seismic Inversion Method

Authors: Haofei Xu, Wei Cheng, Sizhe Li, Jie Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15354
Pdf URL: https://arxiv.org/pdf/2603.15354
Copy Paste: [[2603.15354]] Conditional Rectified Flow-based End-to-End Rapid Seismic Inversion Method(https://arxiv.org/abs/2603.15354)
Keywords: diffusion, generative
Abstract: Seismic inversion is a core problem in geophysical exploration, where traditional methods suffer from high computational costs and are susceptible to initial model dependence. In recent years, deep generative model-based seismic inversion methods have achieved remarkable progress, but existing generative models struggle to balance sampling efficiency and inversion accuracy. This paper proposes an end-to-end fast seismic inversion method based on Conditional Rectified Flow[1], which designs a dedicated seismic encoder to extract multi-scale seismic features and adopts a layer-by-layer injection control strategy to achieve fine-grained conditional control. Experimental results demonstrate that the proposed method achieves excellent inversion accuracy on the OpenFWI[2] benchmark dataset. Compared with Diffusion[3,4] methods, it achieves sampling acceleration; compared with InversionNet[5,6,7] methods, it achieves higher accuracy in generation. Our zero-shot generalization experiments on Marmousi[8,9] real data further verify the practical value of the method. Experimental results show that the proposed method achieves excellent inversion accuracy on the OpenFWI benchmark dataset; compared with Diffusion methods, it achieves sampling acceleration while maintaining higher accuracy than InversionNet methods; experiments based on the Marmousi standard model further verify that this method can generate high-quality initial velocity models in a zero-shot manner, effectively alleviating the initial model dependency problem in traditional Full Waveform Inversion (FWI), and possesses industrial practical value.

Title: A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression

Authors: Yuming Han, Jooho Kim, Anish Shakya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15365
Pdf URL: https://arxiv.org/pdf/2603.15365
Copy Paste: [[2603.15365]] A PPO-Based Bitrate Allocation Conditional Diffusion Model for Remote Sensing Image Compression(https://arxiv.org/abs/2603.15365)
Keywords: diffusion
Abstract: Existing remote sensing image compression methods still explore to balance high compression efficiency with the preservation of fine details and task-relevant information. Meanwhile, high-resolution drone imagery offers valuable structural details for urban monitoring and disaster assessment, but large-area datasets can easily reach hundreds of gigabytes, creating significant challenges for storage and long-term management. In this paper, we propose a PPO-based bitrate allocation Conditional Diffusion Compression (PCDC) framework. PCDC integrates a conditional diffusion decoder with a PPO-based block-wise bitrate allocation strategy to achieve high compression ratios while maintaining strong perceptual performance. We also release a high-resolution drone image dataset with richer structural details at a consistent low altitude over residential neighborhoods in coastal urban areas. Experimental results show compression ratios of 19.3x on DIV2K and 21.2x on the drone image dataset. Moreover, downstream object detection experiments demonstrate that the reconstructed images preserve task-relevant information with negligible performance loss.

Title: Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation

Authors: Xiaoxian Zhang, Minghai Shi, Lei Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15374
Pdf URL: https://arxiv.org/pdf/2603.15374
Copy Paste: [[2603.15374]] Spectral Rectification for Parameter-Efficient Adaptation of Foundation Models in Colonoscopy Depth Estimation(https://arxiv.org/abs/2603.15374)
Keywords: foundation model
Abstract: Accurate monocular depth estimation is critical in colonoscopy for lesion localization and navigation. Foundation models trained on natural images fail to generalize directly to colonoscopy. We identify the core issue not as a semantic gap, but as a statistical shift in the frequency domain: colonoscopy images lack the strong high-frequency edge and texture gradients that these models rely on for geometric reasoning. To address this, we propose SpecDepth, a parameter-efficient adaptation framework that preserves the robust geometric representations of the pre-trained models while adapting to the colonoscopy domain. Its key innovation is an adaptive spectral rectification module, which uses a learnable wavelet decomposition to explicitly model and amplify the attenuated high-frequency components in feature maps. Different from conventional fine-tuning that risks distorting high-level semantic features, this targeted, low-level adjustment realigns the input signal with the original inductive bias of the foundational model. On the public C3VD and SimCol3D datasets, SpecDepth achieved state-of-the-art performance with an absolute relative error of 0.022 and 0.027, respectively. Our work demonstrates that directly addressing spectral mismatches is a highly effective strategy for adapting vision foundation models to specialized medical imaging tasks. The code will be released publicly after the manuscript is accepted for publication.

Title: AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

Authors: Noe Claudel, Weisi Guo, Yang Xing
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15396
Pdf URL: https://arxiv.org/pdf/2603.15396
Copy Paste: [[2603.15396]] AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations(https://arxiv.org/abs/2603.15396)
Keywords: diffusion
Abstract: Facial identification systems are increasingly deployed in surveillance and yet their vulnerability to adversarial evasion and impersonation attacks pose a critical risk. This paper introduces a novel framework for generating adversarial patches capable of both evasion and impersonation attacks against deep re-identification models across non-overlapping cameras. Unlike prior approaches that require iterative patch optimisation for each target, our method employs a conditional encoder-decoder network to synthesize adversarial patches in a single forward pass, guided by multi-scale features from source and target images. The patches are optimised with a dual adversarial objective comprising of pull and push terms. To enhance imperceptibility and aid physical deployment, we further integrate naturalistic patch generation using pre-trained latent diffusion models. Experiments on standard pedestrian (Market-1501, DukeMTMCreID) and facial recognition benchmarks (CelebA-HQ, PubFig) datasets demonstrate the effectiveness of the proposed method. Our adversarial evasion attacks reduce mean Average Precision from 90% to 0.4% in white-box settings and from 72% to 0.4% in black-box settings, showing strong cross-model generalization. In targeted impersonation attacks, our framework achieves a success rate of 27% on CelebA-HQ, competing with other patch-based methods. We go further to use clustering of activation maps to interpret which features are most used by adversarial attacks and propose a pathway for future countermeasures. The results highlight the practicality of adversarial patch attacks on retrieval-based systems and underline the urgent need for robust defense strategies.

Title: AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Authors: Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma, Yujian Zheng, Jing Wang, Zheng Chong, Xujie Zhang, Xianhang Cheng, Xiaodan Liang, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15415
Pdf URL: https://arxiv.org/pdf/2603.15415
Copy Paste: [[2603.15415]] AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation(https://arxiv.org/abs/2603.15415)
Keywords: diffusion
Abstract: Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...

Title: Physics-informed fine-tuning of foundation models for partial differential equations

Authors: Vlad Medvedev, Leon Armbruster, Christopher Straub, Georg Kruse, Andreas Rosskopf
Subjects: cs.LG, cs.AI, math.AP, math.NA
Abstract URL: https://arxiv.org/abs/2603.15431
Pdf URL: https://arxiv.org/pdf/2603.15431
Copy Paste: [[2603.15431]] Physics-informed fine-tuning of foundation models for partial differential equations(https://arxiv.org/abs/2603.15431)
Keywords: foundation model
Abstract: Foundation models for partial differential equations (PDEs) have emerged as powerful surrogates pre-trained on diverse physical systems, but adapting them to new downstream tasks remains challenging due to limited task-specific data and distribution shifts. While fine-tuning has proven transformative in natural language processing, best practices for adapting PDE foundation models remain underexplored. Although physics-informed training has successfully trained accurate solvers across a wide range of PDE problems, its potential for fine-tuning data-based foundation models has not been systematically studied. In this work, we introduce a physics-informed fine-tuning framework that adapts pre-trained PDE foundation models by incorporating physical constraints (PDE residuals and boundary conditions) directly into the fine-tuning objective. This enables effective adaptation in data-scarce regimes while promoting physical consistency. We evaluate our method on a downstream task composed of an unseen PDE class and compare it with data-driven finetuning counterparts. Our results demonstrate that physics-informed fine-tuning achieves competitive accuracy without requiring PDE solutions for training. Furthermore, a hybrid fine-tuning strategy yields superior generalization to out-of-distribution scenarios when only minimal training data is available. These findings establish physics-informed fine-tuning as a scalable and data-efficient paradigm, providing a physically interpretable pathway for adapting foundation models in scientific machine learning.

Title: MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts

Authors: Zheng Zhang, Qinchuan Zhang, Yuteng Ye, Zhi Chen, Penglei Ji, Mengfei Li, Wenxiao Zhang, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15436
Pdf URL: https://arxiv.org/pdf/2603.15436
Copy Paste: [[2603.15436]] MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts(https://arxiv.org/abs/2603.15436)
Keywords: diffusion, generative
Abstract: Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.

Title: ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

Authors: Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15478
Pdf URL: https://arxiv.org/pdf/2603.15478
Copy Paste: [[2603.15478]] ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer(https://arxiv.org/abs/2603.15478)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available this https URL.

Title: RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance

Authors: Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15484
Pdf URL: https://arxiv.org/pdf/2603.15484
Copy Paste: [[2603.15484]] RSGen: Enhancing Layout-Driven Remote Sensing Image Generation with Diverse Edge Guidance(https://arxiv.org/abs/2603.15484)
Keywords: diffusion
Abstract: Diffusion models have significantly mitigated the impact of annotated data scarcity in remote sensing (RS). Although recent approaches have successfully harnessed these models to enable diverse and controllable Layout-to-Image (L2I) synthesis, they still suffer from limited fine-grained control and fail to strictly adhere to bounding box constraints. To address these limitations, we propose RSGen, a plug-and-play framework that leverages diverse edge guidance to enhance layout-driven RS image generation. Specifically, RSGen employs a progressive enhancement strategy: 1) it first enriches the diversity of edge maps composited from retrieved training instances via Image-to-Image generation; and 2) subsequently utilizes these diverse edge maps as conditioning for existing L2I models to enforce pixel-level control within bounding boxes, ensuring the generated instances strictly adhere to the layout. Extensive experiments across three baseline models demonstrate that RSGen significantly boosts the capabilities of existing L2I models. For instance, with CC-Diff on the DOTA dataset for oriented object detection, we achieve remarkable gains of +9.8/+12.0 in YOLOScore mAP50/mAP50-95 and +1.6 in mAP on the downstream detection task. Our code will be publicly available: this https URL

Title: Kimodo: Scaling Controllable Human Motion Generation

Authors: Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, Sanja Fidler
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2603.15546
Pdf URL: https://arxiv.org/pdf/2603.15546
Copy Paste: [[2603.15546]] Kimodo: Scaling Controllable Human Motion Generation(https://arxiv.org/abs/2603.15546)
Keywords: diffusion, generative
Abstract: High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

Title: Self-Distillation of Hidden Layers for Self-Supervised Representation Learning

Authors: Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15553
Pdf URL: https://arxiv.org/pdf/2603.15553
Copy Paste: [[2603.15553]] Self-Distillation of Hidden Layers for Self-Supervised Representation Learning(https://arxiv.org/abs/2603.15553)
Keywords: self-supervised, generative
Abstract: The landscape of self-supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low-level data, and predictive approaches (e.g., I-JEPA) that predict high-level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high-redundancy modalities like imagery, and their training objective does not prioritize learning high-level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non-stationary targets of final-layer self-distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I-JEPA) on classification of ImageNet-1K and iNaturalist-21, and semantic segmentation of ADE20K and Cityscapes.

Title: Learning Latent Proxies for Controllable Single-Image Relighting

Authors: Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu, Yun Li, Xiaogang Xu, Harry Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15555
Pdf URL: https://arxiv.org/pdf/2603.15555
Copy Paste: [[2603.15555]] Learning Latent Proxies for Controllable Single-Image Relighting(https://arxiv.org/abs/2603.15555)
Keywords: diffusion
Abstract: Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

Title: Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

Authors: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15557
Pdf URL: https://arxiv.org/pdf/2603.15557
Copy Paste: [[2603.15557]] Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models(https://arxiv.org/abs/2603.15557)
Keywords: anomaly
Abstract: Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

Title: Grounding World Simulation Models in a Real-World Metropolis

Authors: Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15583
Pdf URL: https://arxiv.org/pdf/2603.15583
Copy Paste: [[2603.15583]] Grounding World Simulation Models in a Real-World Metropolis(https://arxiv.org/abs/2603.15583)
Keywords: generative
Abstract: What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

Title: Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Authors: Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15614
Pdf URL: https://arxiv.org/pdf/2603.15614
Copy Paste: [[2603.15614]] Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion(https://arxiv.org/abs/2603.15614)
Keywords: diffusion
Abstract: Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

Title: Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.15618
Pdf URL: https://arxiv.org/pdf/2603.15618
Copy Paste: [[2603.15618]] Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models(https://arxiv.org/abs/2603.15618)
Keywords: foundation model
Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.