2026-03-31

Title: GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models

Authors: Lipeng Wan, Junjie Ma, Jianhui Gu, Zeyang Liu, Xuyang Lu, Xuguang Lan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26675
Pdf URL: https://arxiv.org/pdf/2603.26675
Copy Paste: [[2603.26675]] GeoBlock: Inferring Block Granularity from Dependency Geometry in Diffusion Language Models(https://arxiv.org/abs/2603.26675)
Keywords: diffusion
Abstract: Block diffusion enables efficient parallel refinement in diffusion language models, but its decoding behavior depends critically on block size. Existing block-sizing strategies rely on fixed rules or heuristic signals and do not account for the dependency geometry that determines which tokens can be safely refined together. This motivates a geometry view of diffusion decoding: \emph{regions with strong causal ordering require sequential updates, whereas semantically cohesive regions admit parallel refinement.} We introduce GeoBlock, a geometry-aware block inference framework that determines block granularity directly from attention-derived dependency geometry. Instead of relying on predefined schedules or local confidence heuristics, GeoBlock analyzes cross-token dependency patterns to identify geometrically stable refinement regions and dynamically determines appropriate block boundaries during decoding. By adapting block granularity to the dependency geometry, GeoBlock preserves the parallel efficiency of block diffusion while enforcing dependency-consistent refinement that exhibits autoregressive reliability. GeoBlock requires no additional training and integrates seamlessly into existing block diffusion architectures. Extensive experiments across multiple benchmarks show that GeoBlock reliably identifies geometry-consistent block boundaries and improves the accuracy of block diffusion with only a small additional computational budget.

Title: A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data

Authors: Aram Ansary Ogholbake, Hannah Choi, Spencer Brandenburg, Alyssa Antuna, Zahraa Al-Sharshahi, Makayla Cox, Haseeb Ahmed, Jacqueline Frank, Nathan Millson, Luke Bauerle, Jessica Lee, David Dornbos III, Qiang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26726
Pdf URL: https://arxiv.org/pdf/2603.26726
Copy Paste: [[2603.26726]] A Multimodal Deep Learning Framework for Edema Classification Using HCT and Clinical Data(https://arxiv.org/abs/2603.26726)
Keywords: self-supervised
Abstract: We propose AttentionMixer, a unified deep learning framework for multimodal detection of brain edema that combines structural head CT (HCT) with routine clinical metadata. While HCT provides rich spatial information, clinical variables such as age, laboratory values, and scan timing capture complementary context that might be ignored or naively concatenated. AttentionMixer is designed to fuse these heterogeneous sources in a principled and efficient manner. HCT volumes are first encoded using a self-supervised Vision Transformer Autoencoder (ViT-AE++), without requiring large labeled datasets. Clinical metadata are mapped into the same feature space and used as keys and values in a cross-attention module, where HCT-derived feature vector serves as queries. This cross-attention fusion allows the network to dynamically modulate imaging features based on patient-specific context and provides an interpretable mechanism for multimodal integration. A lightweight MLP-Mixer then refines the fused representation before final classification, enabling global dependency modeling with substantially reduced parameter overhead. Missing or incomplete metadata are handled via a learnable embedding, promoting robustness to real-world clinical data quality. We evaluate AttentionMixer on a curated brain HCT cohort with expert edema annotations using five-fold cross-validation. Compared with strong HCT-only, metadata-only, and prior multimodal baselines, AttentionMixer achieves superior performance (accuracy 87.32%, precision 92.10%, F1-score 85.37%, AUC 94.14%). Ablation studies confirm the benefit of both cross-attention and MLP-Mixer refinement, and permutation-based metadata importance analysis highlights clinically meaningful variables driving predictions. These results demonstrate that structured, interpretable multimodal fusion can substantially improve edema detection in clinical practice.

Title: Language-Conditioned World Modeling for Visual Navigation

Authors: Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.26741
Pdf URL: https://arxiv.org/pdf/2603.26741
Copy Paste: [[2603.26741]] Language-Conditioned World Modeling for Visual Navigation(https://arxiv.org/abs/2603.26741)
Keywords: diffusion
Abstract: We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an actor-critic agent trained in the latent space of the world model. The second family, LCVN-Uni, adopts an autoregressive multimodal architecture that predicts both actions and future observations. Experiments show that these families offer different advantages: the former provides more temporally coherent rollouts, whereas the latter generalizes better to unseen environments. Taken together, these observations point to the value of jointly studying language grounding, imagination, and policy learning in a unified task setting, and LCVN provides a concrete basis for further investigation of language-conditioned world models. The code is available at this https URL.

Title: Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection

Authors: Yang Liu, Boan Chen, Yuanyuan Meng, Jing Liu, Zhengliang Guo, Wei Zhou, Peng Sun, Hong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26745
Pdf URL: https://arxiv.org/pdf/2603.26745
Copy Paste: [[2603.26745]] Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection(https://arxiv.org/abs/2603.26745)
Keywords: anomaly
Abstract: As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.

Title: From Diffusion To Flow: Efficient Motion Generation In MotionGPT3

Authors: Jaymin Ban, JiHong Jeon, SangYeop Jeong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26747
Pdf URL: https://arxiv.org/pdf/2603.26747
Copy Paste: [[2603.26747]] From Diffusion To Flow: Efficient Motion Generation In MotionGPT3(https://arxiv.org/abs/2603.26747)
Keywords: diffusion, generative
Abstract: Recent text-driven motion generation methods span both discrete token-based approaches and continuous-latent formulations. MotionGPT3 exemplifies the latter paradigm, combining a learned continuous motion latent space with a diffusion-based prior for text-conditioned synthesis. While rectified flow objectives have recently demonstrated favorable convergence and inference-time properties relative to diffusion in image and audio generation, it remains unclear whether these advantages transfer cleanly to the motion generation setting. In this work, we conduct a controlled empirical study comparing diffusion and rectified flow objectives within the MotionGPT3 framework. By holding the model architecture, training protocol, and evaluation setup fixed, we isolate the effect of the generative objective on training dynamics, final performance, and inference efficiency. Experiments on the HumanML3D dataset show that rectified flow converges in fewer training epochs, reaches strong test performance earlier, and matches or exceeds diffusion-based motion quality under identical conditions. Moreover, flow-based priors exhibit stable behavior across a wide range of inference step counts and achieve competitive quality with fewer sampling steps, yielding improved efficiency--quality trade-offs. Overall, our results suggest that several known benefits of rectified flow objectives do extend to continuous-latent text-to-motion generation, highlighting the importance of the training objective choice in motion priors.

Title: Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models

Authors: Qionghao Huang, Can Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26751
Pdf URL: https://arxiv.org/pdf/2603.26751
Copy Paste: [[2603.26751]] Survey on Remote Sensing Scene Classification: From Traditional Methods to Large Generative AI Models(https://arxiv.org/abs/2603.26751)
Keywords: self-supervised, foundation model, generative
Abstract: Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.

Title: Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data

Authors: David Brundage
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26754
Pdf URL: https://arxiv.org/pdf/2603.26754
Copy Paste: [[2603.26754]] Generating Synthetic Wildlife Health Data from Camera Trap Imagery: A Pipeline for Alopecia and Body Condition Training Data(https://arxiv.org/abs/2603.26754)
Keywords: generative
Abstract: No publicly available, ML ready datasets exist for wildlife health conditions in camera trap imagery, creating a fundamental barrier to automated health screening. We present a pipeline for generating synthetic training images depicting alopecia and body condition deterioration in wildlife from real camera trap photographs. Our pipeline constructs a curated base image set from iWildCam using MegaDetector derived bounding boxes and center frame weighted stratified sampling across 8 North American species. A generative phenotype editing system produces controlled severity variants depicting hair loss consistent with mange and emaciation. An adaptive scene drift quality control system uses a sham prefilter and decoupled mask then score approach with complementary day or night metrics to reject images where the generative model altered the original scene. We frame the pipeline explicitly as a screening data source. From 201 base images across 4 species, we generate 553 QC passing synthetic variants with an overall pass rate of 83 percent. A sim to real transfer experiment training exclusively on synthetic data and testing on real camera trap images of suspected health conditions achieves 0.85 AUROC, demonstrating that the synthetic data captures visual features sufficient for screening.

Title: Physics-Aware Diffusion for LiDAR Point Cloud Densification

Authors: Zeping Zhang, Robert Laganière
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26759
Pdf URL: https://arxiv.org/pdf/2603.26759
Copy Paste: [[2603.26759]] Physics-Aware Diffusion for LiDAR Point Cloud Densification(https://arxiv.org/abs/2603.26759)
Keywords: diffusion
Abstract: LiDAR perception is severely limited by the distance-dependent sparsity of distant objects. While diffusion models can recover dense geometry, they suffer from prohibitive latency and physical hallucinations manifesting as ghost points. We propose Scanline-Consistent Range-Aware Diffusion, a framework that treats densification as probabilistic refinement rather than generation. By leveraging Partial Diffusion (SDEdit) on a coarse prior, we achieve high-fidelity results in just 156ms. Our novel Ray-Consistency loss and Negative Ray Augmentation enforce sensor physics to suppress artifacts. Our method achieves state-of-the-art results on KITTI-360 and nuScenes, directly boosting off-the-shelf 3D detectors without retraining. Code will be made available.

Title: A training-free framework for high-fidelity appearance transfer via diffusion transformers

Authors: Shengrong Gu, Ye Wang, Song Wu, Rui Ma, Qian Wang, Lanjun Wang, Zili Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26767
Pdf URL: https://arxiv.org/pdf/2603.26767
Copy Paste: [[2603.26767]] A training-free framework for high-fidelity appearance transfer via diffusion transformers(https://arxiv.org/abs/2603.26767)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.

Title: Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models

Authors: Chen Zheng, Yuxuan Lai, Haoyang Lu, Wentao Ma, Jitao Yang, Jian Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.26768
Pdf URL: https://arxiv.org/pdf/2603.26768
Copy Paste: [[2603.26768]] Aesthetic Assessment of Chinese Handwritings Based on Vision Language Models(https://arxiv.org/abs/2603.26768)
Keywords: in-context
Abstract: The handwriting of Chinese characters is a fundamental aspect of learning the Chinese language. Previous automated assessment methods often framed scoring as a regression problem. However, this score-only feedback lacks actionable guidance, which limits its effectiveness in helping learners improve their handwriting skills. In this paper, we leverage vision-language models (VLMs) to analyze the quality of handwritten Chinese characters and generate multi-level feedback. Specifically, we investigate two feedback generation tasks: simple grade feedback (Task 1) and enriched, descriptive feedback (Task 2). We explore both low-rank adaptation (LoRA)-based fine-tuning strategies and in-context learning methods to integrate aesthetic assessment knowledge into VLMs. Experimental results show that our approach achieves state-of-the-art performances across multiple evaluation tracks in the CCL 2025 workshop on evaluation of handwritten Chinese character quality.

Title: LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models

Authors: Shaik Aman
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26771
Pdf URL: https://arxiv.org/pdf/2603.26771
Copy Paste: [[2603.26771]] LogicDiff: Logic-Guided Denoising Improves Reasoning in Masked Diffusion Language Models(https://arxiv.org/abs/2603.26771)
Keywords: diffusion
Abstract: Masked diffusion language models (MDLMs) generate text by iteratively unmasking tokens from a fully masked sequence, offering parallel generation and bidirectional context. However, their standard confidence-based unmasking strategy systematically defers high-entropy logical connective tokens, the critical branching points in reasoning chains, leading to severely degraded reasoning performance. We introduce LogicDiff, an inference-time method that replaces confidence-based unmasking with logic-role-guided unmasking. A lightweight classification head (4.2M parameters, 0.05% of the base model) predicts the logical role of each masked position (premise, connective, derived step, conclusion, or filler) from the base model's hidden states with 98.4% accuracy. A dependency-ordered scheduler then unmasks tokens in logical dependency order: premises first, then connectives, then derived steps, then conclusions. Without modifying a single parameter of the base model and without any reinforcement learning or task-specific training, LogicDiff improves LLaDA-8B-Instruct accuracy from 22.0% to 60.7% on GSM8K (+38.7 percentage points) and from 23.6% to 29.2% on MATH-500 (+5.6 pp), with less than 6% speed overhead. Our results demonstrate that a substantial portion of the reasoning deficit in MDLMs is attributable to suboptimal token unmasking order, not to limitations of the model's learned representations.

Title: Learning to Select Visual In-Context Demonstrations

Authors: Eugene Lee, Yu-Chi Lin, Jiajie Diao
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.26775
Pdf URL: https://arxiv.org/pdf/2603.26775
Copy Paste: [[2603.26775]] Learning to Select Visual In-Context Demonstrations(https://arxiv.org/abs/2603.26775)
Keywords: in-context
Abstract: Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

Title: TED: Training-Free Experience Distillation for Multimodal Reasoning

Authors: Shuozhi Yuan, Jinqing Wang, Zihao Liu, Miaomiao Yuan, Haoran Peng, Jin Zhao, Bingwen Wang, Haoyi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26778
Pdf URL: https://arxiv.org/pdf/2603.26778
Copy Paste: [[2603.26778]] TED: Training-Free Experience Distillation for Multimodal Reasoning(https://arxiv.org/abs/2603.26778)
Keywords: in-context
Abstract: Knowledge distillation is typically realized by transferring a teacher model's knowledge into a student's parameters through supervised or reinforcement-based optimization. While effective, such approaches require repeated parameter updates and large-scale training data, limiting their applicability in resource-constrained environments. In this work, we propose TED, a training-free, context-based distillation framework that shifts the update target of distillation from model parameters to an in-context experience injected into the student's prompt. For each input, the student generates multiple reasoning trajectories, while a teacher independently produces its own solution. The teacher then compares the student trajectories with its reasoning and the ground-truth answer, extracting generalized experiences that capture effective reasoning patterns. These experiences are continuously refined and updated over time. A key challenge of context-based distillation is unbounded experience growth and noise accumulation. TED addresses this with an experience compression mechanism that tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences. Experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles show that TED consistently improves performance. On MathVision, TED raises the performance of Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under this low-data, no-update setting, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x, demonstrating that meaningful knowledge transfer can be achieved through contextual experience.

Title: Can We Change the Stroke Size for Easier Diffusion?

Authors: Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26783
Pdf URL: https://arxiv.org/pdf/2603.26783
Copy Paste: [[2603.26783]] Can We Change the Stroke Size for Easier Diffusion?(https://arxiv.org/abs/2603.26783)
Keywords: diffusion
Abstract: Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.

Title: Elucidating the Design Space of Flow Matching for Cellular Microscopy

Authors: Charles Jones, Emmanuel Noutahi, Jason Hartford, Cian Eastwood
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26790
Pdf URL: https://arxiv.org/pdf/2603.26790
Copy Paste: [[2603.26790]] Elucidating the Design Space of Flow Matching for Cellular Microscopy(https://arxiv.org/abs/2603.26790)
Keywords: foundation model, generative
Abstract: Flow-matching generative models are increasingly used to simulate cell responses to biological perturbations. However, the design space for building such models is large and underexplored. We systematically analyse the design space of flow matching models for cell-microscopy images, finding that many popular techniques are unnecessary and can even hurt performance. We develop a simple, stable, and scalable recipe which we use to train our foundation model. We scale our model to two orders of magnitude larger than prior methods, achieving a two-fold FID and ten-fold KID improvement over prior methods. We then fine-tune our model with pre-trained molecular embeddings to achieve state-of-the-art performance simulating responses to unseen molecules. Code is available at this https URL

Title: Gaussian Joint Embeddings For Self-Supervised Representation Learning

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.26799
Pdf URL: https://arxiv.org/pdf/2603.26799
Copy Paste: [[2603.26799]] Gaussian Joint Embeddings For Self-Supervised Representation Learning(https://arxiv.org/abs/2603.26799)
Keywords: self-supervised, generative
Abstract: Self-supervised representation learning often relies on deterministic predictive architectures to align context and target views in latent space. While effective in many settings, such methods are limited in genuinely multi-modal inverse problems, where squared-loss prediction collapses towards conditional averages, and they frequently depend on architectural asymmetries to prevent representation collapse. In this work, we propose a probabilistic alternative based on generative joint modeling. We introduce Gaussian Joint Embeddings (GJE) and its multi-modal extension, Gaussian Mixture Joint Embeddings (GMJE), which model the joint density of context and target representations and replace black-box prediction with closed-form conditional inference under an explicit probabilistic model. This yields principled uncertainty estimates and a covariance-aware objective for controlling latent geometry. We further identify a failure mode of naive empirical batch optimization, which we term the Mahalanobis Trace Trap, and develop several remedies spanning parametric, adaptive, and non-parametric settings, including prototype-based GMJE, conditional Mixture Density Networks (GMJE-MDN), topology-adaptive Growing Neural Gas (GMJE-GNG), and a Sequential Monte Carlo (SMC) memory bank. In addition, we show that standard contrastive learning can be interpreted as a degenerate non-parametric limiting case of the GMJE framework. Experiments on synthetic multi-modal alignment tasks and vision benchmarks show that GMJE recovers complex conditional structure, learns competitive discriminative representations, and defines latent densities that are better suited to unconditional sampling than deterministic or unimodal baselines.

Title: Epileptic Seizure Prediction Using Patient-Adaptive Transformer Networks

Authors: Mohamed Mahdi, Asma Baghdadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26821
Pdf URL: https://arxiv.org/pdf/2603.26821
Copy Paste: [[2603.26821]] Epileptic Seizure Prediction Using Patient-Adaptive Transformer Networks(https://arxiv.org/abs/2603.26821)
Keywords: self-supervised
Abstract: Epileptic seizure prediction from electroencephalographic (EEG) recordings remains challenging due to strong inter-patient variability and the complex temporal structure of neural signals. This paper presents a patient-adaptive transformer framework for short-horizon seizure forecasting. The proposed approach employs a two-stage training strategy: self-supervised pretraining is first used to learn general EEG temporal representations through autoregressive sequence modeling, followed by patient-specific fine-tuning for binary prediction of seizure onset within a 30-second horizon. To enable transformer-based sequence learning, multichannel EEG signals are processed using noise-aware preprocessing and discretized into tokenized temporal sequences. Experiments conducted on subjects from the TUH EEG dataset demonstrate that the proposed method achieves validation accuracies above 90% and F1 scores exceeding 0.80 across evaluated patients, supporting the effectiveness of combining self-supervised representation learning with patient-specific adaptation for individualized seizure prediction.

Title: Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations

Authors: Mayank Jha
Subjects: cs.LG, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2603.26823
Pdf URL: https://arxiv.org/pdf/2603.26823
Copy Paste: [[2603.26823]] Throughput Optimization as a Strategic Lever in Large-Scale AI Systems: Evidence from Dataloader and Memory Profiling Innovations(https://arxiv.org/abs/2603.26823)
Keywords: foundation model
Abstract: The development of large-scale foundation models, particularly Large Language Models (LLMs), is constrained by significant computational and memory bottlenecks. These challenges elevate throughput optimization from a mere engineering task to a critical strategic lever, directly influencing training time, operational cost, and the feasible scale of next-generation models. This paper synthesizes evidence from recent academic and industry innovations to analyze key advancements in training efficiency. We examine architectural solutions to dataloader bottlenecks, such as the OVERLORD framework, which has demonstrated a 4.5% improvement in end-to-end training throughput. We investigate memory optimization techniques designed to overcome the GPU memory wall, including CPU offloading strategies like DeepSpeed's ZeRO-Offload, which enable the training of models far exceeding single-accelerator capacity. Furthermore, we explore the growing importance of compiler-centric optimizations, exemplified by Triton-distributed, which enables the joint optimization of computation, memory, and communication for substantial performance gains. The analysis is contextualized by advanced profiling tools and hardware characterization studies that identify and mitigate previously overlooked overheads like Dynamic Voltage and Frequency Scaling (DVFS). Findings indicate that a holistic, system-level approach, integrating innovations across data pipelines, memory management, network fabrics, and compiler technologies, is essential for accelerating AI development, managing costs, and pushing the boundaries of model scale.

Title: Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics

Authors: Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.26827
Pdf URL: https://arxiv.org/pdf/2603.26827
Copy Paste: [[2603.26827]] Central-to-Local Adaptive Generative Diffusion Framework for Improving Gene Expression Prediction in Data-Limited Spatial Transcriptomics(https://arxiv.org/abs/2603.26827)
Keywords: diffusion, generative
Abstract: Spatial Transcriptomics (ST) provides spatially resolved gene expression profiles within intact tissue architecture, enabling molecular analysis in histological context. However, the high cost, limited throughput, and restricted data sharing of ST experiments result in severe data scarcity, constraining the development of robust computational models. To address this limitation, we present a Central-to-Local adaptive generative diffusion framework for ST (C2L-ST) that integrates large-scale morphological priors with limited molecular guidance. A global central model is first pretrained on extensive histopathology datasets to learn transferable morphological representations, and institution-specific local models are then adapted through lightweight gene-conditioned modulation using a small number of paired image-gene spots. This strategy enables the synthesis of realistic and molecularly consistent histology patches under data-limited conditions. The generated images exhibit high visual and structural fidelity, reproduce cellular composition, and show strong embedding overlap with real data across multiple organs, reflecting both realism and diversity. When incorporated into downstream training, synthetic image-gene pairs improve gene expression prediction accuracy and spatial coherence, achieving performance comparable to real data while requiring only a fraction of sampled spots. C2L-ST provides a scalable and data-efficient framework for molecular-level data augmentation, offering a domain-adaptive and generalizable approach for integrating histology and transcriptomics in spatial biology and related fields.

Title: Envisioning global urban development with satellite imagery and generative AI

Authors: Kailai Sun, Yuebing Liang, Mingyi He, Yunhan Zheng, Alok Prakash, Shenhao Wang, Jinhua Zhao, Alex "Sandy'' Pentland
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26831
Pdf URL: https://arxiv.org/pdf/2603.26831
Copy Paste: [[2603.26831]] Envisioning global urban development with satellite imagery and generative AI(https://arxiv.org/abs/2603.26831)
Keywords: generative
Abstract: Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.

Title: VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection

Authors: PengYu Chen, Shang Wan, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.26842
Pdf URL: https://arxiv.org/pdf/2603.26842
Copy Paste: [[2603.26842]] VAN-AD: Visual Masked Autoencoder with Normalizing Flow For Time Series Anomaly Detection(https://arxiv.org/abs/2603.26842)
Keywords: foundation model, anomaly
Abstract: Time series anomaly detection (TSAD) is essential for maintaining the reliability and security of IoT-enabled service systems. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. To address this limitation, foundation models have emerged as a promising direction. However, existing approaches either repurpose large language models (LLMs) or construct largescale time series datasets to develop general anomaly detection foundation models, and still face challenges caused by severe cross-modal gaps or in-domain heterogeneity. In this paper, we investigate the applicability of large-scale vision models to TSAD. Specifically, we adapt a visual Masked Autoencoder (MAE) pretrained on ImageNet to the TSAD task. However, directly transferring MAE to TSAD introduces two key challenges: overgeneralization and limited local perception. To address these challenges, we propose VAN-AD, a novel MAE-based framework for TSAD. To alleviate the over-generalization issue, we design an Adaptive Distribution Mapping Module (ADMM), which maps the reconstruction results before and after MAE into a unified statistical space to amplify discrepancies caused by abnormal patterns. To overcome the limitation of local perception, we further develop a Normalizing Flow Module (NFM), which combines MAE with normalizing flow to estimate the probability density of the current window under the global distribution. Extensive experiments on nine real-world datasets demonstrate that VAN-AD consistently outperforms existing state-of-the-art methods across multiple evaluation this http URL make our code and datasets available at this https URL.

Title: Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Authors: Dongsheng Yang, Yinfeng Yu, Liejun Wang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2603.26859
Pdf URL: https://arxiv.org/pdf/2603.26859
Copy Paste: [[2603.26859]] Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation(https://arxiv.org/abs/2603.26859)
Keywords: generative
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at this https URL.

Title: LACON: Training Text-to-Image Model from Uncurated Data

Authors: Zhiyang Liang, Ziyu Wan, Hongyu Liu, Dong Chen, Qiu Shen, Hao Zhu, Dongdong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26866
Pdf URL: https://arxiv.org/pdf/2603.26866
Copy Paste: [[2603.26866]] LACON: Training Text-to-Image Model from Uncurated Data(https://arxiv.org/abs/2603.26866)
Keywords: generative
Abstract: The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that exploits the underlying uncurated data distribution. Instead of filtering, LACON re-purposes quality signals, such as aesthetic scores and watermark probabilities as explicit, quantitative condition labels. The generative model is then trained to learn the full spectrum of data quality, from bad to good. By learning the explicit boundary between high- and low-quality content, LACON achieves superior generation quality compared to baselines trained only on filtered data using the same compute budget, proving the significant value of uncurated data.

Title: Property-Guided Molecular Generation and Optimization via Latent Flows

Authors: Alexander Arjun Lobo, Urvi Awasthi, Leonid Zhukov
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2603.26889
Pdf URL: https://arxiv.org/pdf/2603.26889
Copy Paste: [[2603.26889]] Property-Guided Molecular Generation and Optimization via Latent Flows(https://arxiv.org/abs/2603.26889)
Keywords: generative
Abstract: Molecular discovery is increasingly framed as an inverse design problem: identifying molecular structures that satisfy desired property profiles under feasibility constraints. While recent generative models provide continuous latent representations of chemical space, targeted optimization within these representations often leads to degraded validity, loss of structural fidelity, or unstable behavior. We introduce MoltenFlow, a modular framework that combines property-organized latent representations with flow-matching generative priors and gradient-based guidance. This formulation supports both conditioned generation and local optimization within a single latent-space framework. We show that guided latent flows enable efficient multi-objective molecular optimization under fixed oracle budgets with controllable trade-offs, while a learned flow prior improves unconditional generation quality.

Title: Strategic Candidacy in Generative AI Arenas

Authors: Chris Hays, Rachel Li, Bailey Flanigan, Manish Raghavan
Subjects: cs.LG, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2603.26891
Pdf URL: https://arxiv.org/pdf/2603.26891
Copy Paste: [[2603.26891]] Strategic Candidacy in Generative AI Arenas(https://arxiv.org/abs/2603.26891)
Keywords: generative
Abstract: AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.

Title: Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark

Authors: Laura Pedrouzo-Rodriguez, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Roberto Daza, Aythami Morales, Julian Fierrez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.26934
Pdf URL: https://arxiv.org/pdf/2603.26934
Copy Paste: [[2603.26934]] Leveraging Avatar Fingerprinting: A Multi-Generator Photorealistic Talking-Head Public Database and Benchmark(https://arxiv.org/abs/2603.26934)
Keywords: foundation model
Abstract: Recent advances in photorealistic avatar generation have enabled highly realistic talking-head avatars, raising security concerns regarding identity impersonation in AI-mediated communication. To advance in this challenging problem, the task of avatar fingerprinting aims to determine whether two avatar videos are driven by the same human operator or not. However, current public databases in the literature are scarce and based solely on old-fashioned talking-head avatar generators, not representing realistic scenarios for the current task of avatar fingerprinting. To overcome this situation, the present article introduces AVAPrintDB, a new publicly available multi-generator talking-head avatar database for avatar fingerprinting. AVAPrintDB is constructed from two audiovisual corpora and three state-of-the-art avatar generators (GAGAvatar, LivePortrait, HunyuanPortrait), representing different synthesis paradigms, and includes both self- and cross-reenactments to simulate legitimate usage and impersonation scenarios. Building on this database, we also define a standardized and reproducible benchmark for avatar fingerprinting, considering public state-of-the-art avatar fingerprinting systems and exploring novel methods based on Foundation Models (DINOv2 and CLIP). Also, we conduct a comprehensive analysis under generator and dataset shift. Our results show that, while identity-related motion cues persist across synthetic avatars, current avatar fingerprinting systems remain highly sensitive to changes in the synthesis pipeline and source domain. The AVAPrintDB, benchmark protocols, and avatar fingerprinting systems are publicly available to facilitate reproducible research.

Title: RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models

Authors: Rahul Soni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.27008
Pdf URL: https://arxiv.org/pdf/2603.27008
Copy Paste: [[2603.27008]] RASPRef: Retrieval-Augmented Self-Supervised Prompt Refinement for Large Reasoning Models(https://arxiv.org/abs/2603.27008)
Keywords: self-supervised
Abstract: Recent reasoning-focused language models such as DeepSeek R1 and OpenAI o1 have demonstrated strong performance on structured reasoning benchmarks including GSM8K, MATH, and multi-hop question answering tasks. However, their performance remains highly sensitive to prompt formulation, and designing effective prompts is typically a manual and iterative process that does not scale well across tasks or domains. To address this limitation, we introduce Retrieval-Augmented Self-Supervised Prompt Refinement (RASPRef), a framework that improves prompts without requiring human annotations or task-specific supervision. The approach retrieves relevant examples and previously generated reasoning trajectories, and leverages signals such as multi-sample consistency, verifier feedback, and model-generated critiques to iteratively refine the prompt. Unlike prior approaches that focus primarily on improving model outputs, RASPRef directly treats the prompt as the optimization target and improves it through an iterative retrieval-guided refinement process. Experiments on GSM8K-style mathematical reasoning tasks show that retrieval-guided prompting improves performance compared with a static prompting baseline. We further discuss how retrieval quality, trajectory selection, and self-supervised feedback signals may influence the effectiveness of prompt refinement. These findings suggest that prompt design remains a critical factor for reasoning-oriented language models, and that self-improving prompts offer a practical and scalable strategy for improving reasoning performance.

Title: Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics

Authors: Linus Härenstam-Nielsen, Dmitrii Pozdeev, Thomas Dagès, Nikita Araslanov, Daniel Cremers
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27016
Pdf URL: https://arxiv.org/pdf/2603.27016
Copy Paste: [[2603.27016]] Generative Shape Reconstruction with Geometry-Guided Langevin Dynamics(https://arxiv.org/abs/2603.27016)
Keywords: diffusion, generative
Abstract: Reconstructing complete 3D shapes from incomplete or noisy observations is a fundamentally ill-posed problem that requires balancing measurement consistency with shape plausibility. Existing methods for shape reconstruction can achieve strong geometric fidelity in ideal conditions but fail under realistic conditions with incomplete measurements or noise. At the same time, recent generative models for 3D shapes can synthesize highly realistic and detailed shapes but fail to be consistent with observed measurements. In this work, we introduce GG-Langevin: Geometry-Guided Langevin dynamics, a probabilistic approach that unifies these complementary perspectives. By traversing the trajectories of Langevin dynamics induced by a diffusion model, while preserving measurement consistency at every step, we generatively reconstruct shapes that fit both the measurements and the data-informed prior. We demonstrate through extensive experiments that GG-Langevin achieves higher geometric accuracy and greater robustness to missing data than existing methods for surface reconstruction.

Title: Unified Number-Free Text-to-Motion Generation Via Flow Matching

Authors: Guanhe Huang, Oya Celiktutan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27040
Pdf URL: https://arxiv.org/pdf/2603.27040
Copy Paste: [[2603.27040]] Unified Number-Free Text-to-Motion Generation Via Flow Matching(https://arxiv.org/abs/2603.27040)
Keywords: generative
Abstract: Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: this https URL.

Title: Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching

Authors: Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27044
Pdf URL: https://arxiv.org/pdf/2603.27044
Copy Paste: [[2603.27044]] Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching(https://arxiv.org/abs/2603.27044)
Keywords: generative
Abstract: Deep Reinforcement Learning (DRL) is widely recognized as sample-inefficient, a limitation attributable in part to the high dimensionality and substantial functional redundancy inherent to the policy parameter space. A recent framework, which we refer to as Action-based Policy Compression (APC), mitigates this issue by compressing the parameter space $\Theta$ into a low-dimensional latent manifold $\mathcal Z$ using a learned generative mapping $g:\mathcal Z \to \Theta$. However, its performance is severely constrained by relying on immediate action-matching as a reconstruction loss, a myopic proxy for behavioral similarity that suffers from compounding errors across sequential decisions. To overcome this bottleneck, we introduce Occupancy-based Policy Compression (OPC), which enhances APC by shifting behavior representation from immediate action-matching to long-horizon state-space coverage. Specifically, we propose two principal improvements: (1) we curate the dataset generation with an information-theoretic uniqueness metric that delivers a diverse population of policies; and (2) we propose a fully differentiable compression objective that directly minimizes the divergence between the true and reconstructed mixture occupancy distributions. These modifications force the generative model to organize the latent space around true functional similarity, promoting a latent representation that generalizes over a broad spectrum of behaviors while retaining most of the original parameter space's expressivity. Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.

Title: MOOZY: A Patient-First Foundation Model for Computational Pathology

Authors: Yousef Kotp, Vincent Quoc-Huy Trinh, Christopher Pal, Mahdi S. Hosseini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27048
Pdf URL: https://arxiv.org/pdf/2603.27048
Copy Paste: [[2603.27048]] MOOZY: A Patient-First Foundation Model for Computational Pathology(https://arxiv.org/abs/2603.27048)
Keywords: foundation model
Abstract: Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.

Title: Liquid Networks with Mixture Density Heads for Efficient Imitation Learning

Authors: Nikolaus Correll
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2603.27058
Pdf URL: https://arxiv.org/pdf/2603.27058
Copy Paste: [[2603.27058]] Liquid Networks with Mixture Density Heads for Efficient Imitation Learning(https://arxiv.org/abs/2603.27058)
Keywords: diffusion
Abstract: We compare liquid neural networks with mixture density heads against diffusion policies on Push-T, RoboMimic Can, and PointMaze under a shared-backbone comparison protocol that isolates policy-head effects under matched inputs, training budgets, and evaluation settings. Across tasks, liquid policies use roughly half the parameters (4.3M vs. 8.6M), achieve 2.4x lower offline prediction error, and run 1.8 faster at inference. In sample-efficiency experiments spanning 1% to 46.42% of training data, liquid models remain consistently more robust, with especially large gains in low-data and medium-data regimes. Closed-loop results on Push-T and PointMaze are directionally consistent with offline rankings but noisier, indicating that strong offline density modeling helps deployment while not fully determining closed-loop success. Overall, liquid recurrent multimodal policies provide a compact and practical alternative to iterative denoising for imitation learning.

Title: ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Authors: Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I. Ross, Daniel Karl I. Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zexue He, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Peter Staar, Luis Lastras, Aude Oliva, Rogerio Feris
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.27064
Pdf URL: https://arxiv.org/pdf/2603.27064
Copy Paste: [[2603.27064]] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding(https://arxiv.org/abs/2603.27064)
Keywords: foundation model
Abstract: Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. Moreover, a rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. The dataset is publicly available at this https URL

Title: LightCtrl: Training-free Controllable Video Relighting

Authors: Yizuo Peng, Xuelin Chen, Kai Zhang, Xiaodong Cun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27083
Pdf URL: https://arxiv.org/pdf/2603.27083
Copy Paste: [[2603.27083]] LightCtrl: Training-free Controllable Video Relighting(https://arxiv.org/abs/2603.27083)
Keywords: diffusion
Abstract: Recent diffusion models have achieved remarkable success in image relighting, and this success has quickly been extended to video relighting. However, existing methods offer limited explicit control over illumination in the relighted output. We present LightCtrl, the first controllable video relighting method that enables explicit control of video illumination through a user-supplied light trajectory in a training-free manner. Our approach combines pre-trained diffusion models: an image relighting model processes each frame individually, followed by a video diffusion prior to enhance temporal consistency. To achieve explicit control over dynamically varying lighting, we introduce two key components. First, a Light Map Injection module samples light trajectory-specific noise and injects it into the latent representation of the source video, improving illumination coherence with the conditional light trajectory. Second, a Geometry-Aware Relighting module dynamically combines RGB and normal map latents in the frequency domain to suppress the influence of the original lighting, further enhancing adherence to the input light trajectory. Experiments show that LightCtrl produces high-quality videos with diverse illumination changes that closely follow the specified light trajectory, demonstrating improved controllability over baseline methods. Code is available at: this https URL.

Title: SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views

Authors: Zijian He, enjie Liu, Yihao Wang, Weizhi Zhong, Huan Yuan, Kun Gai, Guangrun Wang, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27084
Pdf URL: https://arxiv.org/pdf/2603.27084
Copy Paste: [[2603.27084]] SceneExpander: Expanding 3D Scenes with Free-Form Inserted Views(https://arxiv.org/abs/2603.27084)
Keywords: generative
Abstract: World building with 3D scene representations is increasingly important for content creation, simulation, and interactive experiences, yet real workflows are inherently iterative: creators must repeatedly extend an existing scene under user control. Motivated by this research gap, we study 3D scene expansion in a user-centric workflow: starting from a real scene captured by multi-view images, we extend its coverage by inserting an additional view synthesized by a generative model. Unlike simple object editing or style transfer in a fixed scene, the inserted view is often 3D-misaligned with the original reconstruction, introducing geometry shifts, hallucinated content, or view-dependent artifacts that break global multi-view consistency. To address the challenge, we propose SceneExpander, which applies test-time adaptation to a parametric feed-forward 3D reconstruction model with two complementary distillation signals: anchor distillation stabilizes the original scene by distilling geometric cues from the captured views, while inserted-view self-distillation preserves observation-supported predictions yet adapts latent geometry and appearance to accommodate the misaligned inserted view. Experiments on ETH scenes and online data demonstrate improved expansion behavior and reconstruction quality under misalignment.

Title: EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow

Authors: Dogyun Park, Yanyu Li, Sergey Tulyakov, Anil Kag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27086
Pdf URL: https://arxiv.org/pdf/2603.27086
Copy Paste: [[2603.27086]] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow(https://arxiv.org/abs/2603.27086)
Keywords: diffusion
Abstract: Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.

Title: PRUE: A Practical Recipe for Field Boundary Segmentation at Scale

Authors: Gedeon Muhawenayo, Caleb Robinson, Subash Khanal, Zhanpei Fang, Isaac Corley, Alexander Wollam, Tianyi Gao, Leonard Strnad, Ryan Avery, Lyndon Estes, Ana M. Tárano, Nathan Jacobs, Hannah Kerner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27101
Pdf URL: https://arxiv.org/pdf/2603.27101
Copy Paste: [[2603.27101]] PRUE: A Practical Recipe for Field Boundary Segmentation at Scale(https://arxiv.org/abs/2603.27101)
Keywords: foundation model
Abstract: Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76\% IoU and 47\% object-F1 on FTW, an increase of 6\% and 9\% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model-derived field boundary datasets for five countries.

Title: Semantic Interaction Information mediates compositional generalization in latent space

Authors: John Schwarcz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27134
Pdf URL: https://arxiv.org/pdf/2603.27134
Copy Paste: [[2603.27134]] Semantic Interaction Information mediates compositional generalization in latent space(https://arxiv.org/abs/2603.27134)
Keywords: self-supervised
Abstract: Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) where observations are generated jointly by multiple latent variables, yet feedback is provided for only a single goal variable. This setting allows us to define Semantic Interaction Information (SII): a metric measuring the contribution of latent variable interactions to task performance. Using SII, we analyze Recurrent Neural Networks (RNNs) provided with these interactions, finding that SII explains the accuracy gap between Echo State and Fully Trained networks. Our analysis also uncovers a theoretically predicted failure mode where confidence decouples from accuracy, suggesting that utilizing interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions must be learned by an embedding model. Learning how latent variables interact requires accurate inference, yet accurate inference depends on knowing those interactions. The Cognitive Gridworld reveals this circular dependence as a core challenge for continual meta-learning. We approach this dilemma via Representation Classification Chains (RCCs), a JEPA-style architecture that disentangles these processes: variable inference and variable embeddings are learned by separate modules through Reinforcement Learning and self-supervised learning, respectively. Lastly, we demonstrate that RCCs facilitate compositional generalization to novel combinations of relevant variables. Together, these results establish a grounded setting for evaluating goal-directed generalist agents.

Title: Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data

Authors: Shijie Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27135
Pdf URL: https://arxiv.org/pdf/2603.27135
Copy Paste: [[2603.27135]] Spectral-Aware Text-to-Time Series Generation with Billion-Scale Multimodal Meteorological Data(https://arxiv.org/abs/2603.27135)
Keywords: diffusion
Abstract: Text-to-time-series generation is particularly important in meteorology, where natural language offers intuitive control over complex, multi-scale atmospheric dynamics. Existing approaches are constrained by the lack of large-scale, physically grounded multimodal datasets and by architectures that overlook the spectral-temporal structure of weather signals. We address these challenges with a unified framework for text-guided meteorological time-series generation. First, we introduce MeteoCap-3B, a billion-scale weather dataset paired with expert-level captions constructed via a Multi-agent Collaborative Captioning (MACC) pipeline, yielding information-dense and physically consistent annotations. Building on this dataset, we propose MTransformer, a diffusion-based model that enables precise semantic control by mapping textual descriptions into multi-band spectral priors through a Spectral Prompt Generator, which guides generation via frequency-aware attention. Extensive experiments on real-world benchmarks demonstrate state-of-the-art generation quality, accurate cross-modal alignment, strong semantic controllability, and substantial gains in downstream forecasting under data-sparse and zero-shot settings. Additional results on general time-series benchmarks indicate that the proposed framework generalizes beyond meteorology.

Title: RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation

Authors: Yiyang Zou, Tianhao Zhao, Peilun Xiao, Hongyu Jin, Longyu Qi, Yuxuan Li, Liyin Liang, Yifeng Qian, Chunbo Lai, Yutian Lin, Zhihui Li, Yu Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27165
Pdf URL: https://arxiv.org/pdf/2603.27165
Copy Paste: [[2603.27165]] RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation(https://arxiv.org/abs/2603.27165)
Keywords: self-supervised, anomaly
Abstract: Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated "anomaly onset" frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose RiskProp, a novel collision-anchored self-supervised risk propagation paradigm for early accident anticipation, which removes the need for anomaly onset annotations and leverages only the reliably annotated collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model's next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.

Title: MEDIC-AD: Towards Medical Vision-Language Model's Clinical Intelligence

Authors: Woohyeon Park, Jaeik Kim, Sunghwan Steve Cho, Pa Hong, Wookyoung Jeong, Yoojin Nam, Namjoon Kim, Ginny Y. Wong, Ka Chun Cheung, Jaeyoung Do
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27176
Pdf URL: https://arxiv.org/pdf/2603.27176
Copy Paste: [[2603.27176]] MEDIC-AD: Towards Medical Vision-Language Model's Clinical Intelligence(https://arxiv.org/abs/2603.27176)
Keywords: anomaly
Abstract: Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens () encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter image difference tokens () explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, MEDIC-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that MEDIC-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows

Title: Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

Authors: Yizhou Jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27179
Pdf URL: https://arxiv.org/pdf/2603.27179
Copy Paste: [[2603.27179]] Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision(https://arxiv.org/abs/2603.27179)
Keywords: anomaly
Abstract: Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at this https URL.

Title: MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation

Authors: Xiaofeng Tan, Wanjiang Weng, Hongsong Wang, Fang Zhao, Xin Geng, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27185
Pdf URL: https://arxiv.org/pdf/2603.27185
Copy Paste: [[2603.27185]] MotionRFT: Unified Reinforcement Fine-Tuning for Text-to-Motion Generation(https://arxiv.org/abs/2603.27185)
Keywords: diffusion, generative
Abstract: Text-to-motion generation has advanced with diffusion- and flow-based generative models, yet supervised pretraining remains insufficient to align models with high-level objectives such as semantic consistency, realism, and human preference. Existing post-training methods have key limitations: they (1) target a specific motion representation, such as joints, (2) optimize a particular aspect, such as text-motion alignment, and may compromise other factors; and (3) incur substantial computational overhead, data dependence, and coarse-grained optimization. We present a reinforcement fine-tuning framework that comprises a heterogeneous-representation, multi-dimensional reward model, MotionReward, and an efficient, fine-grained fine-tuning method, EasyTune. To obtain a unified semantics representation, MotionReward maps heterogeneous motions into a shared semantic space anchored by text, enabling multidimensional reward learning; Self-refinement Preference Learning further enhances semantics without additional annotations. For efficient and effective fine-tuning, we identify the recursive gradient dependence across denoising steps as the key bottleneck, and propose EasyTune, which optimizes step-wise rather than over the full trajectory, yielding dense, fine-grained, and memory-efficient updates. Extensive experiments validate the effectiveness of our framework, achieving FID 0.132 at 22.10 GB peak memory for MLD model and saving up to 15.22 GB over DRaFT. It reduces FID by 22.9% on joint-based ACMDM, and achieves a 12.6% R-Precision gain and 23.3% FID improvement on rotation-based HY Motion. Our project page with code is publicly available.

Title: Let Triggers Control: Frequency-Aware Dropout for Effective Token Control

Authors: Junyoung Koh, Hoyeon Moon, Dongha Kim, Seungmin Lee, Sanghyun Park, Min Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27199
Pdf URL: https://arxiv.org/pdf/2603.27199
Copy Paste: [[2603.27199]] Let Triggers Control: Frequency-Aware Dropout for Effective Token Control(https://arxiv.org/abs/2603.27199)
Keywords: diffusion, generative
Abstract: Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.

Title: Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation

Authors: Guohuan Xie, Xin He, Dingying Fan, Le Zhang, Ming-Ming Cheng, Yun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27206
Pdf URL: https://arxiv.org/pdf/2603.27206
Copy Paste: [[2603.27206]] Make It Up: Fake Images, Real Gains in Generalized Few-shot Semantic Segmentation(https://arxiv.org/abs/2603.27206)
Keywords: diffusion
Abstract: Generalized few-shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel-class appearances under scarce annotations. While diffusion models can synthesize novel-class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation-enhanced GFSS framework designed to expand novel-class coverage while improving pseudo-label quality. Syn4Seg first maximizes prompt-space coverage by constructing an embedding-deduplicated prompt bank for each novel class, yielding diverse yet class-consistent synthetic images. It then performs support-guided pseudo-label estimation via a two-stage refinement that i) filters low-consistency regions to obtain high-precision seeds and ii) relabels uncertain pixels with image-adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary-band and unlabeled pixels using a constrained SAM-based update to improve contour fidelity without overwriting high-confidence interiors. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate consistent improvements in both 1-shot and 5-shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.

Title: LightMover: Generative Light Movement with Color and Intensity Controls

Authors: Gengze Zhou, Tianyu Wang, Soo Ye Kim, Zhixin Shu, Xin Yu, Yannick Hold-Geoffroy, Sumit Chaturvedi, Qi Wu, Zhe Lin, Scott Cohen
Subjects: cs.CV, cs.CL, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27209
Pdf URL: https://arxiv.org/pdf/2603.27209
Copy Paste: [[2603.27209]] LightMover: Generative Light Movement with Color and Intensity Controls(https://arxiv.org/abs/2603.27209)
Keywords: diffusion, generative
Abstract: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. To train our framework, we construct a scalable rendering pipeline that generates large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.

Title: NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather

Authors: Yanying Li, Jinyang Li, Shengfeng He, Yangyang Xu, Junyu Dong, Yong Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27228
Pdf URL: https://arxiv.org/pdf/2603.27228
Copy Paste: [[2603.27228]] NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather(https://arxiv.org/abs/2603.27228)
Keywords: self-supervised
Abstract: We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions. Code is available at this https URL.

Title: TrendGen: An Outfit Recommendation and Display System

Authors: Theodoros Koukopoulos, Dimos Klimenof, Ioannis Xarchakos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27264
Pdf URL: https://arxiv.org/pdf/2603.27264
Copy Paste: [[2603.27264]] TrendGen: An Outfit Recommendation and Display System(https://arxiv.org/abs/2603.27264)
Keywords: generative
Abstract: Recent advances in Computer Vision have significantly improved image understanding and generation, revolutionizing the fashion industry. However, challenges such as inconsistent lighting, non-ideal garment angles, complex backgrounds, and occlusions in raw images hinder their full potential. Overcoming these obstacles is crucial for developing robust fashion AI systems capable of real-world applications. In this paper, we introduce TrendGen, a Fashion AI system designed to enhance online shopping with intelligent outfit recommendations. Deployed on a major e-commerce platform, TrendGen leverages cloth images and product attributes to generate trend-aligned, cohesive outfit suggestions. Additionally, it employs Generative AI to transform raw images into high-quality lay-down views, offering a clear and structured presentation of garments. Our evaluation on production data demonstrates TrendGen's consistent high-quality outfits and lay-down images, marking a significant advancement in AI-driven solutions for fashion retail.

Title: TrackMAE: Video Representation Learning via Track Mask and Predict

Authors: Renaud Vandeghen, Fida Mohammad Thoker, Marc Van Droogenbroeck, Bernard Ghanem
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27268
Pdf URL: https://arxiv.org/pdf/2603.27268
Copy Paste: [[2603.27268]] TrackMAE: Video Representation Learning via Track Mask and Predict(https://arxiv.org/abs/2603.27268)
Keywords: self-supervised
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos, generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve random tube masking with a motion-aware masking strategy. We enhance video representations learned in both pixel and feature semantic reconstruction spaces by providing a complementary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms state-of-the-art video self-supervised learning baselines, learning more discriminative and generalizable representations. Code available at this https URL

Title: TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR

Authors: Ted Lentsch, Santiago Montiel-Marín, Holger Caesar, Dariu M. Gavrila
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27344
Pdf URL: https://arxiv.org/pdf/2603.27344
Copy Paste: [[2603.27344]] TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR(https://arxiv.org/abs/2603.27344)
Keywords: self-supervised
Abstract: LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. However, existing methods are either handcrafted for specific sensor configurations or rely on costly per-point manual labels, severely limiting their generalization and scalability. To overcome this, we introduce TerraSeg, the first self-supervised, domain-agnostic model for LiDAR ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes data from 12 major public benchmarks. Spanning almost 22 million raw scans across 15 distinct sensor models, OmniLiDAR provides unprecedented diversity for learning a highly generalizable ground model. To supervise training without human annotations, we propose PseudoLabeler, a novel module that generates high-quality ground and non-ground labels through self-supervised per-scan runtime optimization. Extensive evaluations demonstrate that, despite using no manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception while delivering real-time performance. Our code and model weights are publicly available.

Title: Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach

Authors: Maziar Kianimoghadam Jouneghani
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.27356
Pdf URL: https://arxiv.org/pdf/2603.27356
Copy Paste: [[2603.27356]] Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach(https://arxiv.org/abs/2603.27356)
Keywords: in-context
Abstract: Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric "black boxes," producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.

Title: HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors

Authors: Ke Li, Tianjia Yang, Kaidi Liang, Xianbiao Hu, Ruwen Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27371
Pdf URL: https://arxiv.org/pdf/2603.27371
Copy Paste: [[2603.27371]] HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors(https://arxiv.org/abs/2603.27371)
Keywords: diffusion
Abstract: Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at this https URL.

Title: Active In-Context Learning for Tabular Foundation Models

Authors: Wilailuck Treerath, Fabrizio Pittorino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27385
Pdf URL: https://arxiv.org/pdf/2603.27385
Copy Paste: [[2603.27385]] Active In-Context Learning for Tabular Foundation Models(https://arxiv.org/abs/2603.27385)
Keywords: foundation model, in-context
Abstract: Active learning (AL) reduces labeling cost by querying informative samples, but in tabular settings its cold-start gains are often limited because uncertainty estimates are unreliable when models are trained on very few labels. Tabular foundation models such as TabPFN provide calibrated probabilistic predictions via in-context learning (ICL), i.e., without task-specific weight updates, enabling an AL regime in which the labeled context - rather than parameters - is iteratively optimized. We formalize Tabular Active In-Context Learning (Tab-AICL) and instantiate it with four acquisition rules: uncertainty (TabPFN-Margin), diversity (TabPFN-Coreset), an uncertainty-diversity hybrid (TabPFN-Hybrid), and a scalable two-stage method (TabPFN-Proxy-Hybrid) that shortlists candidates using a lightweight linear proxy before TabPFN-based selection. Across 20 classification benchmarks, Tab-AICL improves cold-start sample efficiency over retrained gradient-boosting baselines (CatBoost-Margin and XGBoost-Margin), measured by normalized AULC up to 100 labeled samples.

Title: K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)

Authors: Abdulrahman Albaiz, Fathi Amsaad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27393
Pdf URL: https://arxiv.org/pdf/2603.27393
Copy Paste: [[2603.27393]] K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)(https://arxiv.org/abs/2603.27393)
Keywords: anomaly
Abstract: This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the "Train Once, Share Everywhere" (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.

Title: The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

Authors: Isaac Llorente-Saguer
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.27412
Pdf URL: https://arxiv.org/pdf/2603.27412
Copy Paste: [[2603.27412]] The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams(https://arxiv.org/abs/2603.27412)
Keywords: generative, anomaly
Abstract: We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle $\theta$ from this reference direction. The anomaly score is the negative log-likelihood of $\theta$ under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC $\geq$0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution ($\sigma_\theta \approx 0.03$ rad), an order of magnitude tighter than the normative distribution ($\sigma_\theta \approx 0.27$ rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.

Title: Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce

Authors: Nikolas Chatzis, Angeliki Tsinouka, Katerina Papadimitriou, Niki Efthymiou, Marios Glytsos, George Retsinas, Paris Oikonomou, Gerasimos Potamianos, Petros Maragos, Panagiotis Paraskevas Filntisis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27429
Pdf URL: https://arxiv.org/pdf/2603.27429
Copy Paste: [[2603.27429]] Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce(https://arxiv.org/abs/2603.27429)
Keywords: generative
Abstract: Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.

Title: GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback

Authors: Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, Faez Ahmed
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2603.27448
Pdf URL: https://arxiv.org/pdf/2603.27448
Copy Paste: [[2603.27448]] GIFT: Bootstrapping Image-to-CAD Program Synthesis via Geometric Feedback(https://arxiv.org/abs/2603.27448)
Keywords: generative
Abstract: Generating executable CAD programs from images requires alignment between visual geometry and symbolic program representations, a capability that current methods fail to learn reliably as design complexity increases. Existing fine-tuning approaches rely on either limited supervised datasets or expensive post-training pipelines, resulting in brittle systems that restrict progress in generative CAD design. We argue that the primary bottleneck lies not in model or algorithmic capacity, but in the scarcity of diverse training examples that align visual geometry with program syntax. This limitation is especially acute because the collection of diverse and verified engineering datasets is both expensive and difficult to scale, constraining the development of robust generative CAD models. We introduce Geometric Inference Feedback Tuning (GIFT), a data augmentation framework that leverages geometric feedback to turn test-time compute into a bootstrapped set of high-quality training samples. GIFT combines two mechanisms: Soft-Rejection Sampling (GIFT-REJECT), which retains diverse high-fidelity programs beyond exact ground-truth matches, and Failure-Driven Augmentation (GIFT-FAIL), which converts near-miss predictions into synthetic training examples that improve robustness on challenging geometries. By amortizing inference-time search into the model parameters, GIFT captures the benefits of test-time scaling while reducing inference compute by 80%. It improves mean IoU by 12% over a strong supervised baseline and remains competitive with more complex multimodal systems, without requiring additional human annotation or specialized architectures.

Title: LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model

Authors: Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, Yue Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27449
Pdf URL: https://arxiv.org/pdf/2603.27449
Copy Paste: [[2603.27449]] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model(https://arxiv.org/abs/2603.27449)
Keywords: generative
Abstract: Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME demonstrates not only high action-following accuracy and strong generalization to unseen scenarios, but also realistic physical consequences of hand-object interactions, e.g., liquid flowing from a bottle into a mug after executing a ``pouring'' action. Extensive experiments demonstrate that our video-based framework significantly outperforms state-of-the-art image based and video-based action-conditioned methods and Image/Text-to-Video (I/T2V) generative model in terms of both temporal consistency and motion control. LOME paves the way for photorealistic AR/VR experiences and scalable robotic training, without being limited to simulated environments or relying on explicit 3D/4D modeling.

Title: FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies

Authors: Chenxiao Gao, Edward Chen, Tianyi Chen, Bo Dai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27450
Pdf URL: https://arxiv.org/pdf/2603.27450
Copy Paste: [[2603.27450]] FlowRL: A Taxonomy and Modular Framework for Reinforcement Learning with Diffusion Policies(https://arxiv.org/abs/2603.27450)
Keywords: diffusion, generative
Abstract: Thanks to their remarkable flexibility, diffusion models and flow models have emerged as promising candidates for policy representation. However, efficient reinforcement learning (RL) upon these policies remains a challenge due to the lack of explicit log-probabilities for vanilla policy gradient estimators. While numerous attempts have been proposed to address this, the field lacks a unified perspective to reconcile these seemingly disparate methods, thus hampering ongoing development. In this paper, we bridge this gap by introducing a comprehensive taxonomy for RL algorithms with diffusion/flow policies. To support reproducibility and agile prototyping, we introduce a modular, JAX-based open-source codebase that leverages JIT-compilation for high-throughput training. Finally, we provide systematic and standardized benchmarks across Gym-Locomotion, DeepMind Control Suite, and IsaacLab, offering a rigorous side-by-side comparison of diffusion-based methods and guidance for practitioners to choose proper algorithms based on the application. Our work establishes a clear foundation for understanding and algorithm design, a high-efficiency toolkit for future research in the field, and an algorithmic guideline for practitioners in generative models and robotics. Our code is available at this https URL.

Title: From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

Authors: Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27455
Pdf URL: https://arxiv.org/pdf/2603.27455
Copy Paste: [[2603.27455]] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis(https://arxiv.org/abs/2603.27455)
Keywords: self-supervised
Abstract: In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at this https URL.

Title: Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Authors: Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27460
Pdf URL: https://arxiv.org/pdf/2603.27460
Copy Paste: [[2603.27460]] Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development(https://arxiv.org/abs/2603.27460)
Keywords: foundation model
Abstract: Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

Title: Transferring Physical Priors into Remote Sensing Segmentation via Large Language Models

Authors: Yuxi Lu, Kunqi Li, Zhidong Li, Xiaohan Su, Biao Wu, Chenya Huang, Bin Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27504
Pdf URL: https://arxiv.org/pdf/2603.27504
Copy Paste: [[2603.27504]] Transferring Physical Priors into Remote Sensing Segmentation via Large Language Models(https://arxiv.org/abs/2603.27504)
Keywords: foundation model
Abstract: Semantic segmentation of remote sensing imagery is fundamental to Earth observation. Achieving accurate results requires integrating not only optical images but also physical variables such as the Digital Elevation Model (DEM), Synthetic Aperture Radar (SAR) and Normalized Difference Vegetation Index (NDVI). Recent foundation models (FMs) leverage pre-training to exploit these variables but still depend on spatially aligned data and costly retraining when involving new sensors. To overcome these limitations, we introduce a novel paradigm for integrating domain-specific physical priors into segmentation models. We first construct a Physical-Centric Knowledge Graph (PCKG) by prompting large language models to extract physical priors from 1,763 vocabularies, and use it to build a heterogeneous, spatial-aligned dataset, Phy-Sky-SA. Building on this foundation, we develop PriorSeg, a physics-aware residual refinement model trained with a joint visual-physical strategy that incorporates a novel physics-consistency loss. Experiments on heterogeneous settings demonstrate that PriorSeg improves segmentation accuracy and physical plausibility without retraining the FMs. Ablation studies verify the effectiveness of the Phy-Sky-SA dataset, the PCKG, and the physics-consistency loss.

Title: Understanding Semantic Perturbations on In-Processing Generative Image Watermarks

Authors: Anirudh Nakra, Min Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27513
Pdf URL: https://arxiv.org/pdf/2603.27513
Copy Paste: [[2603.27513]] Understanding Semantic Perturbations on In-Processing Generative Image Watermarks(https://arxiv.org/abs/2603.27513)
Keywords: generative
Abstract: The widespread deployment of high-fidelity generative models has intensified the need for reliable mechanisms for provenance and content authentication. In-processing watermarking, embedding a signature into the generative model's synthesis procedure, has been advocated as a solution and is often reported to be robust to standard post-processing (such as geometric transforms and filtering). Yet robustness to semantic manipulations that alter high-level scene content while maintaining reasonable visual quality is not well studied or understood. We introduce a simple, multi-stage framework for systematically stress-testing in-processing generative watermarks under semantic drift. The framework utilizes off-the-shelf models for object detection, mask generation, and semantically guided inpainting or regeneration to produce controlled, meaning-altering edits with minimal perceptual degradation. Based on extensive experiments on representative schemes, we find that robustness varies significantly with the degree of semantic entanglement: methods by which watermarks remain detectable under a broad suite of conventional perturbations can fail under semantic edits, with watermark detectability in many cases dropping to near zero while image quality remains high. Overall, our results reveal a critical gap in current watermarking evaluations and suggest that watermark designs and benchmarking must explicitly account for robustness against semantic manipulation.

Title: SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision

Authors: Shuai Xiang, Wei Guo, James Burridge, Shouyang Liu, Hao Lu, Tokihiro Fukatsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27519
Pdf URL: https://arxiv.org/pdf/2603.27519
Copy Paste: [[2603.27519]] SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision(https://arxiv.org/abs/2603.27519)
Keywords: diffusion, foundation model
Abstract: Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at this https URL.

Title: LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Authors: Meituan LongCat Team: Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.27538
Pdf URL: https://arxiv.org/pdf/2603.27538
Copy Paste: [[2603.27538]] LongCat-Next: Lexicalizing Modalities as Discrete Tokens(https://arxiv.org/abs/2603.27538)
Keywords: foundation model
Abstract: The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: this https URL

Title: Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps

Authors: Fulong Ma, Daojie Peng, Jun Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27553
Pdf URL: https://arxiv.org/pdf/2603.27553
Copy Paste: [[2603.27553]] Annotation-Free Detection of Drivable Areas and Curbs Leveraging LiDAR Point Cloud Maps(https://arxiv.org/abs/2603.27553)
Keywords: self-supervised
Abstract: Drivable areas and curbs are critical traffic elements for autonomous driving, forming essential components of the vehicle visual perception system and ensuring driving safety. Deep neural networks (DNNs) have significantly improved perception performance for drivable area and curb detection, but most DNN-based methods rely on large manually labeled datasets, which are costly, time-consuming, and expert-dependent, limiting their real-world application. Thus, we developed an automated training data generation module. Our previous work generated training labels using single-frame LiDAR and RGB data, suffering from occlusion and distant point cloud sparsity. In this paper, we propose a novel map-based automatic data labeler (MADL) module, combining LiDAR mapping/localization with curb detection to automatically generate training data for both tasks. MADL avoids occlusion and point cloud sparsity issues via LiDAR mapping, creating accurate large-scale datasets for DNN training. In addition, we construct a data review agent to filter the data generated by the MADL module, eliminating low-quality samples. Experiments on the KITTI, KITTI-CARLA and 3D-Curb datasets show that MADL achieves impressive performance compared to manual labeling, and outperforms traditional and state-of-the-art self-supervised methods in robustness and accuracy.

Title: PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal

Authors: Dinh-Khoi Vo, Van-Loc Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27555
Pdf URL: https://arxiv.org/pdf/2603.27555
Copy Paste: [[2603.27555]] PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal(https://arxiv.org/abs/2603.27555)
Keywords: diffusion
Abstract: Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at this https URL.

Title: STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding

Authors: Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, Yong Man Ro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27593
Pdf URL: https://arxiv.org/pdf/2603.27593
Copy Paste: [[2603.27593]] STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding(https://arxiv.org/abs/2603.27593)
Keywords: diffusion
Abstract: Recent progress in video large language models (Video-LLMs) has enabled strong offline reasoning over long and complex videos. However, real-world deployments increasingly require streaming perception and proactive interaction, where video frames arrive online and the system must decide not only what to respond, but also when to respond. In this work, we revisit proactive activation in streaming video as a structured sequence modeling problem, motivated by the observation that temporal transitions in streaming video naturally form span-structured activation patterns. To capture this span-level structure, we model activation signals jointly over a sliding temporal window and update them iteratively as new frames arrive. We propose STRIDE (Structured Temporal Refinement with Iterative DEnoising), which employs a lightweight masked diffusion module at the activation interface to jointly predict and progressively refine activation signals across the window. Extensive experiments on diverse streaming benchmarks and downstream models demonstrate that STRIDE shows more reliable and temporally coherent proactive responses, significantly improving when-to-speak decision quality in online streaming scenarios.

Title: You Only Erase Once: Erasing Anything without Bringing Unexpected Content

Authors: Yixing Zhu, Qing Zhang, Wenju Xu, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27599
Pdf URL: https://arxiv.org/pdf/2603.27599
Copy Paste: [[2603.27599]] You Only Erase Once: Erasing Anything without Bringing Unexpected Content(https://arxiv.org/abs/2603.27599)
Keywords: diffusion
Abstract: We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Code will be available at this https URL.

Title: On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

Authors: Mohammad Tinati, Stephen Tu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27631
Pdf URL: https://arxiv.org/pdf/2603.27631
Copy Paste: [[2603.27631]] On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry(https://arxiv.org/abs/2603.27631)
Keywords: self-supervised
Abstract: Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.

Title: OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation

Authors: Sanghyeon Lee, Minwoo Lee, Euijin Shin, Kangyeol Kim, Seunghwan Choi, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27637
Pdf URL: https://arxiv.org/pdf/2603.27637
Copy Paste: [[2603.27637]] OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation(https://arxiv.org/abs/2603.27637)
Keywords: diffusion, in-context
Abstract: We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.

Title: OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery

Authors: Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27645
Pdf URL: https://arxiv.org/pdf/2603.27645
Copy Paste: [[2603.27645]] OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery(https://arxiv.org/abs/2603.27645)
Keywords: diffusion, foundation model
Abstract: Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at this https URL.

Title: Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling

Authors: Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran, Mehrtash Harandi, Dinh Phung, Trung Le
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.27665
Pdf URL: https://arxiv.org/pdf/2603.27665
Copy Paste: [[2603.27665]] Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling(https://arxiv.org/abs/2603.27665)
Keywords: diffusion, generative
Abstract: Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.

Title: Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

Authors: Yuhe Liu, Zhenxiong Tan, Yujia Hu, Songhua Liu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27666
Pdf URL: https://arxiv.org/pdf/2603.27666
Copy Paste: [[2603.27666]] Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers(https://arxiv.org/abs/2603.27666)
Keywords: diffusion
Abstract: Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.

Title: CrossHGL: A Text-Free Foundation Model for Cross-Domain Heterogeneous Graph Learning

Authors: Xuanze Chen, Jiajun Zhou, Yadong Li, Shanqing Yu, Qi Xuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27685
Pdf URL: https://arxiv.org/pdf/2603.27685
Copy Paste: [[2603.27685]] CrossHGL: A Text-Free Foundation Model for Cross-Domain Heterogeneous Graph Learning(https://arxiv.org/abs/2603.27685)
Keywords: self-supervised, foundation model
Abstract: Heterogeneous graph representation learning (HGRL) is essential for modeling complex systems with diverse node and edge types. However, most existing methods are limited to closed-world settings with shared schemas and feature spaces, hindering cross-domain generalization. While recent graph foundation models improve transferability, they often target homogeneous graphs, rely on domain-specific schemas, or require rich textual attributes. Consequently, text-free and few-shot cross-domain HGRL remains underexplored. To address this, we propose CrossHGL, a foundation framework that preserves and transfers multi-relational structural semantics without external textual supervision. Specifically, a semantic-preserving transformation strategy homogenizes heterogeneous graphs while encoding interaction semantics into edge features. Based on this, a prompt-aware multi-domain pre-training framework with a Tri-Prompt mechanism captures transferable knowledge across feature, edge, and structure perspectives via self-supervised contrastive learning. For target-domain adaptation, we develop a parameter-efficient fine-tuning strategy that freezes the pre-trained backbone and performs few-shot classification via prompt composition and prototypical learning. Experiments on node-level and graph-level tasks show that CrossHGL consistently outperforms state-of-the-art baselines, yielding average relative improvements of 25.1% and 7.6% in Micro-F1 for node and graph classification, respectively, while remaining competitive in challenging feature-degenerated settings.

Title: LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Authors: Shentong Mo, Sukmin Yun
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.MM
Abstract URL: https://arxiv.org/abs/2603.27693
Pdf URL: https://arxiv.org/pdf/2603.27693
Copy Paste: [[2603.27693]] LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation(https://arxiv.org/abs/2603.27693)
Keywords: foundation model
Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.

Title: Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

Authors: Samik Some, Vinay P. Namboodiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27697
Pdf URL: https://arxiv.org/pdf/2603.27697
Copy Paste: [[2603.27697]] Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?(https://arxiv.org/abs/2603.27697)
Keywords: foundation model
Abstract: Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Title: Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM-based In-Context Learning in Medical Diagnosis

Authors: Wenkai Zhao, Zipei Wang, Mengjie Fang, Di Dong, Jie Tian, Lingwei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27737
Pdf URL: https://arxiv.org/pdf/2603.27737
Copy Paste: [[2603.27737]] Synergizing Discriminative Exemplars and Self-Refined Experience for MLLM-based In-Context Learning in Medical Diagnosis(https://arxiv.org/abs/2603.27737)
Keywords: in-context
Abstract: General Multimodal Large Language Models (MLLMs) often underperform in capturing domain-specific nuances in medical diagnosis, trailing behind fully supervised baselines. Although fine-tuning provides a remedy, the high costs of expert annotation and massive computational overhead limit its scalability. To bridge this gap without updating the weights of the pre-trained backbone of the MLLM, we propose a Clinician Mimetic Workflow. This is a novel In-Context Learning (ICL) framework designed to synergize Discriminative Exemplar Coreset Selection (DECS) and Self-Refined Experience Summarization (SRES). Specifically, DECS simulates a clinician's ability to reference "anchor cases" by selecting discriminative visual coresets from noisy data at the computational level; meanwhile, SRES mimics the cognition and reflection in clinical diagnosis by distilling diverse rollouts into a dynamic textual Experience Bank. Extensive evaluation across all 12 datasets of the MedMNIST 2D benchmark demonstrates that our method outperforms zero-shot general and medical MLLMs. Simultaneously, it achieves performance levels comparable to fully supervised vision models and domain-specific fine-tuned MLLMs, setting a new benchmark for parameter-efficient medical in-context learning. Our code is available at an anonymous repository: this https URL.

Title: AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification

Authors: Emily A Cooper, Hany Farid
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27747
Pdf URL: https://arxiv.org/pdf/2603.27747
Copy Paste: [[2603.27747]] AI-Powered Facial Mask Removal Is Not Suitable For Biometric Identification(https://arxiv.org/abs/2603.27747)
Keywords: generative
Abstract: Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.

Title: When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Authors: Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27759
Pdf URL: https://arxiv.org/pdf/2603.27759
Copy Paste: [[2603.27759]] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models(https://arxiv.org/abs/2603.27759)
Keywords: generative
Abstract: Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.

Title: What-If Explanations Over Time: Counterfactuals for Time Series Classification

Authors: Udo Schlegel, Thomas Seidl
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.27792
Pdf URL: https://arxiv.org/pdf/2603.27792
Copy Paste: [[2603.27792]] What-If Explanations Over Time: Counterfactuals for Time Series Classification(https://arxiv.org/abs/2603.27792)
Keywords: generative
Abstract: Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model's prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library's contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.

Title: Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection

Authors: Nusrat Tasnim, Kutub Uddin, Khalid Malik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27800
Pdf URL: https://arxiv.org/pdf/2603.27800
Copy Paste: [[2603.27800]] Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection(https://arxiv.org/abs/2603.27800)
Keywords: diffusion, generative
Abstract: The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.

Title: Towards Context-Aware Image Anonymization with Multi-Agent Reasoning

Authors: Robert Aufschläger, Jakob Folz, Gautam Savaliya, Manjitha D Vidanalage, Michael Heigl, Martin Schramm
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.27817
Pdf URL: https://arxiv.org/pdf/2603.27817
Copy Paste: [[2603.27817]] Towards Context-Aware Image Anonymization with Multi-Agent Reasoning(https://arxiv.org/abs/2603.27817)
Keywords: diffusion
Abstract: Street-level imagery contains personally identifiable information (PII), some of which is context-dependent. Existing anonymization methods either over-process images or miss subtle identifiers, while API-based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underline{C}ontext-\underline{A}ware \underline{I}mage \underline{A}nonymization with \underline{M}ulti-\underline{A}gent \underline{R}easoning) for context-aware PII segmentation with diffusion-based anonymization, combining pre-defined processing for high-confidence cases with multi-agent reasoning for indirect identifiers. Three specialized agents coordinate via round-robin speaker selection in a Plan-Do-Check-Act (PDCA) cycle, enabling large vision-language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially-filtered coarse-to-fine detection where a scout-and-zoom strategy identifies candidates, open-vocabulary segmentation processes localized crops, and $IoU$-based deduplication ($30\%$ threshold) prevents redundant processing. Modal-specific diffusion guidance with appearance decorrelation substantially reduces re-identification (Re-ID) risks. On CUHK03-NP, our method reduces person Re-ID risk by $73\%$ ($R1$: $16.9\%$ vs. $62.4\%$ baseline). For image quality preservation on CityScapes, we achieve KID: $0.001$, and FID: $9.1$, significantly outperforming existing anonymization. The agentic workflow detects non-direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on-premise with open-source models, the framework generates human-interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.

Title: Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation

Authors: Irene Kim, Sai Tanmay Reddy Chakkera, Alexandros Graikos, Dimitris Samaras, Akshat Dave
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27891
Pdf URL: https://arxiv.org/pdf/2603.27891
Copy Paste: [[2603.27891]] Poppy: Polarization-based Plug-and-Play Guidance for Enhancing Monocular Normal Estimation(https://arxiv.org/abs/2603.27891)
Keywords: diffusion
Abstract: Monocular surface normal estimators trained on large-scale RGB-normal data often perform poorly in the edge cases of reflective, textureless, and dark surfaces. Polarization encodes surface orientation independently of texture and albedo, offering a physics-based complement for these cases. Existing polarization methods, however, require multi-view capture or specialized training data, limiting generalization. We introduce Poppy, a training-free framework that refines normals from any frozen RGB backbone using single-shot polarization measurements at test time. Keeping backbone weights frozen, Poppy optimizes per-pixel offsets to the input RGB and output normal along with a learned reflectance decomposition. A differentiable rendering layer converts the refined normals into polarization predictions and penalizes mismatches with the observed signal. Across seven benchmarks and three backbone architectures (diffusion, flow, and feed-forward), Poppy reduces mean angular error by 23-26% on synthetic data and 6-16% on real data. These results show that guiding learned RGB-based normal estimators with polarization cues at test time refines normals on challenging surfaces without retraining.

Title: FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation

Authors: Liuzhou Zhang, Zeyu Zhang, Biao Wu, Luyao Tang, Zirui Song, Hongyang He, Renda Han, Guangzhen Yao, Huacan Wang, Ronghao Chen, Xiuying Chen, Guan Huang, Zheng Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27915
Pdf URL: https://arxiv.org/pdf/2603.27915
Copy Paste: [[2603.27915]] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation(https://arxiv.org/abs/2603.27915)
Keywords: diffusion, generative
Abstract: Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code: this https URL.

Title: Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs

Authors: Ehsan Zeraatkar, Rodion Podorozhny, Jelena Tešić
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27929
Pdf URL: https://arxiv.org/pdf/2603.27929
Copy Paste: [[2603.27929]] Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs(https://arxiv.org/abs/2603.27929)
Keywords: diffusion
Abstract: Reconstructing continuous physical fields from sparse, irregular observations is a central challenge in scientific machine learning, particularly for systems governed by partial differential equations (PDEs). Existing physics-informed methods typically enforce governing equations as soft penalty terms during optimization, often leading to gradient imbalance, instability, and degraded physical consistency under limited data. We introduce the Physics-Guided Transformer (PGT), a neural architecture that embeds physical structure directly into the self-attention mechanism. Specifically, PGT incorporates a heat-kernel-derived additive bias into attention logits, encoding diffusion dynamics and temporal causality within the representation. Query coordinates attend to these physics-conditioned context tokens, and the resulting features are decoded using a FiLM-modulated sinusoidal implicit network that adaptively controls spectral response. We evaluate PGT on the one-dimensional heat equation and two-dimensional incompressible Navier-Stokes systems. In sparse 1D reconstruction with 100 observations, PGT achieves a relative L2 error of 5.9e-3, significantly outperforming both PINNs and sinusoidal representations. In the 2D cylinder wake problem, PGT uniquely achieves both low PDE residual (8.3e-4) and competitive relative error (0.034), outperforming methods that optimize only one objective. These results demonstrate that embedding physics within attention improves stability, generalization, and physical fidelity under data-scarce conditions.

Title: Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute

Authors: Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.27950
Pdf URL: https://arxiv.org/pdf/2603.27950
Copy Paste: [[2603.27950]] Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute(https://arxiv.org/abs/2603.27950)
Keywords: generative
Abstract: Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.

Title: MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Authors: Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.27959
Pdf URL: https://arxiv.org/pdf/2603.27959
Copy Paste: [[2603.27959]] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation(https://arxiv.org/abs/2603.27959)
Keywords: generative
Abstract: Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. Can generative models still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0% overall accuracy, while open-source models achieve just ~ 1-11%, often near 0% on structured tasks. Overall, current T2I models remain far from competent at even elementary mathematical visual generation.

Title: Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

Authors: Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.27987
Pdf URL: https://arxiv.org/pdf/2603.27987
Copy Paste: [[2603.27987]] Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment(https://arxiv.org/abs/2603.27987)
Keywords: diffusion
Abstract: The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.

Title: From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers

Authors: Nihal Sanjay Singh, Mazdak Mohseni-Rajaee, Shaila Niazi, Kerem Y. Camsari
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2603.27996
Pdf URL: https://arxiv.org/pdf/2603.27996
Copy Paste: [[2603.27996]] From Independent to Correlated Diffusion: Generalized Generative Modeling with Probabilistic Computers(https://arxiv.org/abs/2603.27996)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a powerful framework for generative tasks in deep learning. They decompose generative modeling into two computational primitives: deterministic neural-network evaluation and stochastic sampling. Current implementations usually place most computation in the neural network, but diffusion as a framework allows a broader range of choices for the stochastic transition kernel. Here, we generalize the stochastic sampling component by replacing independent noise injection with Markov chain Monte Carlo (MCMC) dynamics that incorporate known interaction structure. Standard independent diffusion is recovered as a special case when couplings are set to zero. By explicitly incorporating Ising couplings into the diffusion dynamics, the noising and denoising processes exploit spatial correlations representative of the target system. The resulting framework maps naturally onto probabilistic computers (p-computers) built from probabilistic bits (p-bits), which provide orders-of-magnitude advantages in sampling throughput and energy efficiency over GPUs. We demonstrate the approach on equilibrium states of the 2D ferromagnetic Ising model and the 3D Edwards-Anderson spin glass, showing that correlated diffusion produces samples in closer agreement with MCMC reference distributions than independent diffusion. More broadly, the framework shows that p-computers can enable new classes of diffusion algorithms that exploit structured probabilistic sampling for generative modeling.

Title: Diffusion Maps is not Dimensionality Reduction

Authors: Julio Candanedo, Alejandro Patiño
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28037
Pdf URL: https://arxiv.org/pdf/2603.28037
Copy Paste: [[2603.28037]] Diffusion Maps is not Dimensionality Reduction(https://arxiv.org/abs/2603.28037)
Keywords: diffusion
Abstract: Diffusion maps (DMAP) are often used as a dimensionality-reduction tool, but more precisely they provide a spectral representation of the intrinsic geometry rather than a complete charting method. To illustrate this distinction, we study a Swiss roll with known isometric coordinates and compare DMAP, Isomap, and UMAP across latent dimensions. For each representation, we fit an oracle affine readout to the ground-truth chart and measure reconstruction error. Isomap most efficiently recovers the low-dimensional chart, UMAP provides an intermediate tradeoff, and DMAP becomes accurate only after combining multiple diffusion modes. Thus the correct chart lies in the span of diffusion coordinates, but standard DMAP do not by themselves identify the appropriate combination.

Title: Seeing the Unseen: Rethinking Illicit Promotion Detection with In-Context Learning

Authors: Sangyi Wu, Junpu Guo, Xianghang Mi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.28043
Pdf URL: https://arxiv.org/pdf/2603.28043
Copy Paste: [[2603.28043]] Seeing the Unseen: Rethinking Illicit Promotion Detection with In-Context Learning(https://arxiv.org/abs/2603.28043)
Keywords: in-context
Abstract: Illicit online promotion is a persistent threat that evolves to evade detection. Existing moderation systems remain tethered to platform-specific supervision and static taxonomies, a reactive paradigm that struggles to generalize across domains or uncover novel threats. This paper presents a systematic study of In-Context Learning (ICL) as a unified framework for illicit promotion detection. Through rigorous analysis, we show that properly configured ICL achieves performance comparable to fine-tuned models using 22x fewer labeled examples. We demonstrate three key capabilities: (1) Generalization to unseen threats: ICL generalizes to new illicit categories without category-specific demonstrations, with a performance drop of less than 6% for most evaluated categories. (2) Autonomous discovery: A novel two-stage pipeline distills 2,900 free-form labels into coherent taxonomies, surfacing eight previously undocumented illicit categories such as usury and illegal immigration. (3) Cross-platform generalization: Deployed on 200,000 real-world samples from search engines and Twitter without adaptation, ICL achieves 92.6% accuracy. Furthermore, 61.8% of its uniquely flagged samples correspond to borderline or obfuscated content missed by existing detectors. Our findings position ICL as a new paradigm for content moderation, combining the precision of specialized classifiers with cross-platform generalization and autonomous threat discovery. By shifting to inference-time reasoning, ICL offers a path toward proactively adaptive moderation systems.

Title: Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

Authors: Zhen Zou, Xiaoxiao Ma, Mingde Yao, Jie Huang, LinJiang Huang, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28049
Pdf URL: https://arxiv.org/pdf/2603.28049
Copy Paste: [[2603.28049]] Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting(https://arxiv.org/abs/2603.28049)
Keywords: diffusion
Abstract: Autoregressive (AR)-Diffusion hybrid paradigms combine AR's structured semantic modeling with diffusion's high-fidelity synthesis, yet suffer from a dual speed bottleneck: the sequential AR stage and the iterative multi-step denoising of the diffusion vision decode stage. Existing methods address each in isolation without a unified principle design. We observe that the per-position \emph{prediction entropy} of continuous-space AR models naturally encodes spatially varying generation uncertainty, which simultaneously governing draft prediction quality in the AR stage and reflecting the corrective effort required by vision decoding stage, which is not fully explored before. Since entropy is inherently tied to both bottlenecks, it serves as a natural unifying signal for joint acceleration. In this work, we propose \textbf{Drift-AR}, which leverages entropy signal to accelerate both stages: 1) for AR acceleration, we introduce Entropy-Informed Speculative Decoding that align draft--target entropy distributions via a causal-normalized entropy loss, resolving the entropy mismatch that causes excessive draft rejection; 2) for visual decoder acceleration, we reinterpret entropy as the \emph{physical variance} of the initial state for an anti-symmetric drifting field -- high-entropy positions activate stronger drift toward the data manifold while low-entropy positions yield vanishing drift -- enabling single-step (1-NFE) decoding without iterative denoising or distillation. Moreover, both stages share the same entropy signal, which is computed once with no extra cost. Experiments on MAR, TransDiff, and NextStep-1 demonstrate 3.8--5.5$\times$ speedup with genuine 1-NFE decoding, matching or surpassing original quality. Code will be available at this https URL.

Title: Physics-Embedded Feature Learning for AI in Medical Imaging

Authors: Pulock Das, Al Amin, Kamrul Hasan, Rohan Thompson, Azubike D. Okpalaeze, Liang Hong
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2603.28057
Pdf URL: https://arxiv.org/pdf/2603.28057
Copy Paste: [[2603.28057]] Physics-Embedded Feature Learning for AI in Medical Imaging(https://arxiv.org/abs/2603.28057)
Keywords: diffusion
Abstract: Deep learning (DL) models have achieved strong performance in an intelligence healthcare setting, yet most existing approaches operate as black boxes and ignore the physical processes that govern tumor growth, limiting interpretability, robustness, and clinical trust. To address this limitation, we propose PhysNet, a physics-embedded DL framework that integrates tumor growth dynamics directly into the feature learning process of a convolutional neural network (CNN). Unlike conventional physics-informed methods that impose physical constraints only at the output level, PhysNet embeds a reaction diffusion model of tumor growth within intermediate feature representations of a ResNet backbone. The architecture jointly performs multi-class tumor classification while learning a latent tumor density field, its temporal evolution, and biologically meaningful physical parameters, including tumor diffusion and growth rates, through end-to-end training. This design is necessary because purely data-driven models, even when highly accurate or ensemble-based, cannot guarantee physically consistent predictions or provide insight into tumor behavior. Experimental results on a large brain MRI dataset demonstrate that PhysNet outperforms multiple state-of-the-art DL baselines, including MobileNetV2, VGG16, VGG19, and ensemble models, achieving superior classification accuracy and F1-score. In addition to improved performance, PhysNet produces interpretable latent representations and learned bio-physical parameters that align with established medical knowledge, highlighting physics-embedded representation learning as a practical pathway toward more trustworthy and clinically meaningful medical AI systems.

Title: From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing

Authors: Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28067
Pdf URL: https://arxiv.org/pdf/2603.28067
Copy Paste: [[2603.28067]] From Vessel Trajectories to Safety-Critical Encounter Scenarios: A Generative AI Framework for Autonomous Ship Digital Testing(https://arxiv.org/abs/2603.28067)
Keywords: generative
Abstract: Digital testing has emerged as a key paradigm for the development and verification of autonomous maritime navigation systems, yet the availability of realistic and diverse safety-critical encounter scenarios remains limited. Existing approaches either rely on handcrafted templates, which lack realism, or extract cases directly from historical data, which cannot systematically expand rare high-risk situations. This paper proposes a data-driven framework that converts large-scale Automatic Identification System (AIS) trajectories into structured safety-critical encounter scenarios. The framework combines generative trajectory modeling with automated encounter pairing and temporal parameterization to enable scalable scenario construction while preserving real traffic characteristics. To enhance trajectory realism and robustness under noisy AIS observations, a multi-scale temporal variational autoencoder is introduced to capture vessel motion dynamics across different temporal resolutions. Experiments on real-world maritime traffic flows demonstrate that the proposed method improves trajectory fidelity and smoothness, maintains statistical consistency with observed data, and enables the generation of diverse safety-critical encounter scenarios beyond those directly recorded. The resulting framework provides a practical pathway for building scenario libraries to support digital testing, benchmarking, and safety assessment of autonomous navigation and intelligent maritime traffic management systems. Code is available at this https URL.

Title: GEMS: Agent-Native Multimodal Generation with Memory and Skills

Authors: Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28088
Pdf URL: https://arxiv.org/pdf/2603.28088
Copy Paste: [[2603.28088]] GEMS: Agent-Native Multimodal Generation with Memory and Skills(https://arxiv.org/abs/2603.28088)
Keywords: generative
Abstract: Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

Title: To View Transform or Not to View Transform: NeRF-based Pre-training Perspective

Authors: Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28090
Pdf URL: https://arxiv.org/pdf/2603.28090
Copy Paste: [[2603.28090]] To View Transform or Not to View Transform: NeRF-based Pre-training Perspective(https://arxiv.org/abs/2603.28090)
Keywords: self-supervised
Abstract: Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.

Title: Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention

Authors: Seunghun Oh, Unsang Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28114
Pdf URL: https://arxiv.org/pdf/2603.28114
Copy Paste: [[2603.28114]] Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention(https://arxiv.org/abs/2603.28114)
Keywords: diffusion
Abstract: Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.

Title: SVGS: Single-View to 3D Object Editing via Gaussian Splatting

Authors: Pengcheng Xue, Yan Tian, Qiutao Song, Ziyi Wang, Linyang He, Weiping Ding, Mahmoud Hassaballah, Karen Egiazarian, Wei-Fa Yang, Leszek Rutkowski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28126
Pdf URL: https://arxiv.org/pdf/2603.28126
Copy Paste: [[2603.28126]] SVGS: Single-View to 3D Object Editing via Gaussian Splatting(https://arxiv.org/abs/2603.28126)
Keywords: diffusion
Abstract: Text-driven 3D scene editing has attracted considerable interest due to its convenience and user-friendliness. However, methods that rely on implicit 3D representations, such as Neural Radiance Fields (NeRF), while effective in rendering complex scenes, are hindered by slow processing speeds and limited control over specific regions of the scene. Moreover, existing approaches, including Instruct-NeRF2NeRF and GaussianEditor, which utilize multi-view editing strategies, frequently produce inconsistent results across different views when executing text instructions. This inconsistency can adversely affect the overall performance of the model, complicating the task of balancing the consistency of editing results with editing efficiency. To address these challenges, we propose a novel method termed Single-View to 3D Object Editing via Gaussian Splatting (SVGS), which is a single-view text-driven editing technique based on 3D Gaussian Splatting (3DGS). Specifically, in response to text instructions, we introduce a single-view editing strategy grounded in multi-view diffusion models, which reconstructs 3D scenes by leveraging only those views that yield consistent editing results. Additionally, we employ sparse 3D Gaussian Splatting as the 3D representation, which significantly enhances editing efficiency. We conducted a comparative analysis of SVGS against existing baseline methods across various scene settings, and the results indicate that SVGS outperforms its counterparts in both editing capability and processing speed, representing a significant advancement in 3D editing technology. For further details, please visit our project page at: this https URL.

Title: RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation

Authors: Chanseul Cho, Seokju Yun, Jeaseong Jeon, Seungjae Moon, Youngmin Ro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28142
Pdf URL: https://arxiv.org/pdf/2603.28142
Copy Paste: [[2603.28142]] RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation(https://arxiv.org/abs/2603.28142)
Keywords: foundation model
Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi-domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under-explored, with many existing methods focusing primarily on preserving pre-trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank-Revealing QR Decomposition (RRQR) to systematically exploit VFM's subspace structures and enhance LoRA's representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter's strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre-trained subspace structures through RRQR-based initialization leads to superior domain generalization performance. RecycleLoRA achieves state-of-the-art performance on both synthetic-to-real generalization and real-to-real generalization tasks without complex architectures or additional inference latency.

Title: ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models

Authors: Yuhuan Xie, Aoxuan Pan, Yi-Hua Huang, Chirui Chang, Peng Dai, Xin Yu, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28152
Pdf URL: https://arxiv.org/pdf/2603.28152
Copy Paste: [[2603.28152]] ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models(https://arxiv.org/abs/2603.28152)
Keywords: diffusion
Abstract: Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.

Title: ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

Authors: Bingchen Li, Zhixin Wang, Fan Li, Jiaqi Xu, Jiaming Guo, Renjing Pei, Xin Li, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28162
Pdf URL: https://arxiv.org/pdf/2603.28162
Copy Paste: [[2603.28162]] ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization(https://arxiv.org/abs/2603.28162)
Keywords: diffusion, generative
Abstract: Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

Title: ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Authors: Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28178
Pdf URL: https://arxiv.org/pdf/2603.28178
Copy Paste: [[2603.28178]] ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining(https://arxiv.org/abs/2603.28178)
Keywords: self-supervised
Abstract: 3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.

Title: \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language

Authors: Verena Platzgummer, John McCrae, Sina Ahmadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.28213
Pdf URL: https://arxiv.org/pdf/2603.28213
Copy Paste: [[2603.28213]] \textit{Versteasch du mi?} Computational and Socio-Linguistic Perspectives on GenAI, LLMs, and Non-Standard Language(https://arxiv.org/abs/2603.28213)
Keywords: generative
Abstract: The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.

Title: Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal

Authors: Kazuma Ikeda, Ryosei Hara, Rokuto Nagata, Ozora Sako.Zihao Ding, Takahiro Kado, Ibuki Fujioka, Taro Beppu, Mariko Isogawa, Kentaro Yoshioka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28224
Pdf URL: https://arxiv.org/pdf/2603.28224
Copy Paste: [[2603.28224]] Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal(https://arxiv.org/abs/2603.28224)
Keywords: self-supervised
Abstract: LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghosts), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal relies on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100x larger than existing annotated FWL datasets. Benefiting from this large-scale dataset, we establish a FWL-based baseline model for ghost detection and propose FWL-MAE, a masked autoencoder for efficient self-supervised representation learning on FWL data. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhances downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50x false positive reduction). The dataset and code is publicly available and can be accessed via the project page: this https URL

Title: Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring

Authors: Rahul Jaiswal, Joakim Hellum, Halvor Heiberg
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28225
Pdf URL: https://arxiv.org/pdf/2603.28225
Copy Paste: [[2603.28225]] Detecting the Unexpected: AI-Driven Anomaly Detection in Smart Bridge Monitoring(https://arxiv.org/abs/2603.28225)
Keywords: anomaly
Abstract: Bridges are critical components of national infrastructure and smart cities. Therefore, smart bridge monitoring is essential for ensuring public safety and preventing catastrophic failures or accidents. Traditional bridge monitoring methods rely heavily on human visual inspections, which are time-consuming and prone to subjectivity and error. This paper proposes an artificial intelligence (AI)-driven anomaly detection approach for smart bridge monitoring. Specifically, a simple machine learning (ML) model is developed using real-time sensor data collected by the iBridge sensor devices installed on a bridge in Norway. The proposed model is evaluated against different ML models. Experimental results demonstrate that the density-based spatial clustering of applications with noise (DBSCAN)-based model outperforms in accurately detecting the anomalous events (bridge accident). These findings indicate that the proposed model is well-suited for smart bridge monitoring and can enhance public safety by enabling the timely detection of unforeseen incidents.

Title: DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Authors: Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28251
Pdf URL: https://arxiv.org/pdf/2603.28251
Copy Paste: [[2603.28251]] DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning(https://arxiv.org/abs/2603.28251)
Keywords: diffusion
Abstract: Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

Title: MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

Authors: Xianyong Xu, Yuanjun Zuo, Zhihong Huang, Yihan Qin, Haoxian Xu, Leilei Du, Haotian Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28253
Pdf URL: https://arxiv.org/pdf/2603.28253
Copy Paste: [[2603.28253]] MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations(https://arxiv.org/abs/2603.28253)
Keywords: diffusion
Abstract: Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive embedding mechanism for variable-length inputs, and a multi-scale conditional diffusion process. Evaluations on four real-world datasets demonstrate that MR-CDM significantly outperforms state-of-the-art baselines (e.g., CSDI, Informer), reducing MAE and RMSE by approximately 6-10 to a certain degree.

Title: FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks

Authors: Gnankan Landry Regis N'guessan
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2603.28288
Pdf URL: https://arxiv.org/pdf/2603.28288
Copy Paste: [[2603.28288]] FI-KAN: Fractal Interpolation Kolmogorov-Arnold Networks(https://arxiv.org/abs/2603.28288)
Keywords: diffusion
Abstract: Kolmogorov-Arnold Networks (KAN) employ B-spline bases on a fixed grid, providing no intrinsic multi-scale decomposition for non-smooth function approximation. We introduce Fractal Interpolation KAN (FI-KAN), which incorporates learnable fractal interpolation function (FIF) bases from iterated function system (IFS) theory into KAN. Two variants are presented: Pure FI-KAN (Barnsley, 1986) replaces B-splines entirely with FIF bases; Hybrid FI-KAN (Navascues, 2005) retains the B-spline path and adds a learnable fractal correction. The IFS contraction parameters give each edge a differentiable fractal dimension that adapts to target regularity during training. On a Holder regularity benchmark ($\alpha \in [0.2, 2.0]$), Hybrid FI-KAN outperforms KAN at every regularity level (1.3x to 33x). On fractal targets, FI-KAN achieves up to 6.3x MSE reduction over KAN, maintaining 4.7x advantage at 5 dB SNR. On non-smooth PDE solutions (scikit-fem), Hybrid FI-KAN achieves up to 79x improvement on rough-coefficient diffusion and 3.5x on L-shaped domain corner singularities. Pure FI-KAN's complementary behavior, dominating on rough targets while underperforming on smooth ones, provides controlled evidence that basis geometry must match target regularity. A fractal dimension regularizer provides interpretable complexity control whose learned values recover the true fractal dimension of each target. These results establish regularity-matched basis design as a principled strategy for neural function approximation.

Title: DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis

Authors: Kun Tang, Xinquan Yang, Mianjie Zheng, Xuefen Liu, Xuguang Li, Xiaoqi Guo, Ruihan Chen, Linlin Shen, He Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28297
Pdf URL: https://arxiv.org/pdf/2603.28297
Copy Paste: [[2603.28297]] DinoDental: Benchmarking DINOv3 as a Unified Vision Encoder for Dental Image Analysis(https://arxiv.org/abs/2603.28297)
Keywords: self-supervised, foundation model
Abstract: The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state-of-the-art, self-supervised vision foundation model pre-trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off-the-shelf encoder for comprehensive dental image analysis without requiring domain-specific pre-training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model's transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine-tuning, and the parameter-efficient Low-Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary-sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.

Title: NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information

Authors: Qing Qing, Huafei Huang, Mingliang Hou, Renqiang Luo, Mohsen Guizani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28300
Pdf URL: https://arxiv.org/pdf/2603.28300
Copy Paste: [[2603.28300]] NeiGAD: Augmenting Graph Anomaly Detection via Spectral Neighbor Information(https://arxiv.org/abs/2603.28300)
Keywords: anomaly
Abstract: Graph anomaly detection (GAD) aims to identify irregular nodes or structures in attributed graphs. Neighbor information, which reflects both structural connectivity and attribute consistency with surrounding nodes, is essential for distinguishing anomalies from normal patterns. Although recent graph neural network (GNN)-based methods incorporate such information through message passing, they often fail to explicitly model its effect or interaction with attributes, limiting detection performance. This work introduces NeiGAD, a novel plug-and-play module that captures neighbor information through spectral graph analysis. Theoretical insights demonstrate that eigenvectors of the adjacency matrix encode local neighbor interactions and progressively amplify anomaly signals. Based on this, NeiGAD selects a compact set of eigenvectors to construct efficient and discriminative representations. Experiments on eight real-world datasets show that NeiGAD consistently improves detection accuracy and outperforms state-of-the-art GAD methods. These results demonstrate the importance of explicit neighbor modeling and the effectiveness of spectral analysis in anomaly detection. Code is available at: this https URL.

Title: Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data

Authors: Yuanqiao Zhang, Tiantian He, Yuan Gao, Yixin Wang, Yew-Soon Ong, Maoguo Gong, A.K. Qin, Hui Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28316
Pdf URL: https://arxiv.org/pdf/2603.28316
Copy Paste: [[2603.28316]] Taming the Instability: A Robust Second-Order Optimizer for Federated Learning over Non-IID Data(https://arxiv.org/abs/2603.28316)
Keywords: anomaly
Abstract: In this paper, we present Federated Robust Curvature Optimization (FedRCO), a novel second-order optimization framework designed to improve convergence speed and reduce communication cost in Federated Learning systems under statistical heterogeneity. Existing second-order optimization methods are often computationally expensive and numerically unstable in distributed settings. In contrast, FedRCO addresses these challenges by integrating an efficient approximate curvature optimizer with a provable stability mechanism. Specifically, FedRCO incorporates three key components: (1) a Gradient Anomaly Monitor that detects and mitigates exploding gradients in real-time, (2) a Fail-Safe Resilience protocol that resets optimization states upon numerical instability, and (3) a Curvature-Preserving Adaptive Aggregation strategy that safely integrates global knowledge without erasing the local curvature geometry. Theoretical analysis shows that FedRCO can effectively mitigate instability and prevent unbounded updates while preserving optimization efficiency. Extensive experiments show that FedRCO achieves superior robustness against diverse non-IID scenarios while achieving higher accuracy and faster convergence than both state-of-the-art first-order and second-order methods.

Title: Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Authors: Heecheol Yun, Eunho Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28333
Pdf URL: https://arxiv.org/pdf/2603.28333
Copy Paste: [[2603.28333]] Integrating Multimodal Large Language Model Knowledge into Amodal Completion(https://arxiv.org/abs/2603.28333)
Keywords: generative
Abstract: With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

Title: AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation

Authors: Milton Zhou, Sizhong Qin, Yongzhi Li, Quan Chen, Peng Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28366
Pdf URL: https://arxiv.org/pdf/2603.28366
Copy Paste: [[2603.28366]] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation(https://arxiv.org/abs/2603.28366)
Keywords: foundation model
Abstract: Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation. However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end advertisement video editing framework based on multimodal discretization and controllable editing. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video-audio-text token space. Built upon a foundation model, we further develop a multimodal large language model for video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts the predicted token sequences into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable video creation.

Title: Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models

Authors: Tao Xia, Jiawei Liu, Yukun Zhang, Ting Liu, Wei Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28367
Pdf URL: https://arxiv.org/pdf/2603.28367
Copy Paste: [[2603.28367]] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models(https://arxiv.org/abs/2603.28367)
Keywords: diffusion, generative
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising family of generative models, enabling a wide range of downstream vision tasks such as text-guided image editing. By shifting the editing paradigm from noise manipulation in diffusion-based methods to token-level operations, VAR-based approaches achieve better background preservation and significantly faster inference. However, existing VAR-based editing methods still face two key challenges: accurately localizing editable tokens and maintaining structural consistency in the edited results. In this work, we propose a novel text-guided image editing framework rooted in an analysis of intermediate feature distributions within VAR models. First, we introduce a coarse-to-fine token localization strategy that can refine editable regions, balancing editing fidelity and background preservation. Second, we analyze the intermediate representations of VAR models and identify structure-related features, by which we design a simple yet effective feature injection mechanism to enhance structural consistency between the edited and source images. Third, we develop a reinforcement learning-based adaptive feature injection scheme that automatically learns scale- and layer-specific injection ratios to jointly optimize editing fidelity and structure preservation. Extensive experiments demonstrate that our method achieves superior structural consistency and editing quality compared with state-of-the-art approaches, across both local and global editing scenarios.

Title: EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation

Authors: Sravanth Kodavanti, Manjunath Arveti, Sowmya Vajrala, Srinivas Miriyala, Vikram N R
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28405
Pdf URL: https://arxiv.org/pdf/2603.28405
Copy Paste: [[2603.28405]] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation(https://arxiv.org/abs/2603.28405)
Keywords: diffusion, foundation model, generative
Abstract: Diffusion Transformers (DiT) have established a new state-of-the-art in high-fidelity image synthesis; however, their massive computational complexity and memory requirements hinder local deployment on resource-constrained edge devices. In this paper, we introduce EdgeDiT, a family of hardware-efficient generative transformers specifically engineered for mobile Neural Processing Units (NPUs), such as the Qualcomm Hexagon and Apple Neural Engine (ANE). By leveraging a hardware-aware optimization framework, we systematically identify and prune structural redundancies within the DiT backbone that are particularly taxing for mobile data-flows. Our approach yields a series of lightweight models that achieve a 20-30% reduction in parameters, a 36-46% decrease in FLOPs, and a 1.65-fold reduction in on-device latency without sacrificing the scaling advantages or the expressive capacity of the original transformer architecture. Extensive benchmarking demonstrates that EdgeDiT offers a superior Pareto-optimal trade-off between Frechet Inception Distance (FID) and inference latency compared to both optimized mobile U-Nets and vanilla DiT variants. By enabling responsive, private, and offline generative AI directly on-device, EdgeDiT provides a scalable blueprint for transitioning large-scale foundation models from high-end GPUs to the palm of the user.

Title: Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Authors: Alkis Sygkounas, Amy Loutfi, Andreas Persson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.28416
Pdf URL: https://arxiv.org/pdf/2603.28416
Copy Paste: [[2603.28416]] Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models(https://arxiv.org/abs/2603.28416)
Keywords: generative
Abstract: Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.

Title: $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Authors: Linqian Fan, Peiqin Sun, Tiancheng Wen, Shun Lu, Chengru Song
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28460
Pdf URL: https://arxiv.org/pdf/2603.28460
Copy Paste: [[2603.28460]] $R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation(https://arxiv.org/abs/2603.28460)
Keywords: diffusion, generative
Abstract: Diffusion models achieve state-of-the-art generative performance but are fundamentally bottlenecked by their slow iterative sampling process. While diffusion distillation techniques enable high-fidelity few-step generation, traditional objectives often restrict the student's performance by anchoring it solely to the teacher. Recent approaches have attempted to break this ceiling by integrating Reinforcement Learning (RL), typically through a simple summation of distillation and RL objectives. In this work, we propose a novel paradigm by reconceptualizing distribution matching as a reward, denoted as $R_{dm}$. This unified perspective bridges the algorithmic gap between Diffusion Matching Distillation (DMD) and RL, providing several key benefits. (1) Enhanced optimization stability: we introduce Group Normalized Distribution Matching (GNDM), which adapts standard RL group normalization to stabilize $R_{dm}$ estimation. By leveraging group-mean statistics, GNDM establishes a more robust and effective optimization direction. (2) Seamless reward integration: our reward-centric formulation inherently supports adaptive weighting mechanisms, allowing flexible combination of DMD with external reward models. (3) Improved sampling efficiency: by aligning with RL principles, the framework readily incorporates importance sampling (IS), leading to a significant boost in sampling efficiency. Extensive experiments demonstrate that GNDM outperforms vanilla DMD, reducing the FID by 1.87. Furthermore, our multi-reward variant, GNDMR, surpasses existing baselines by achieving a strong balance between aesthetic quality and fidelity, reaching a peak HPS of 30.37 and a low FID-SD of 12.21. Overall, $R_{dm}$ provides a flexible, stable, and efficient framework for real-time high-fidelity synthesis. Code will be released upon publication.

Title: INSID3: Training-Free In-Context Segmentation with DINOv3

Authors: Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, Stefan Roth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28480
Pdf URL: https://arxiv.org/pdf/2603.28480
Copy Paste: [[2603.28480]] INSID3: Training-Free In-Context Segmentation with DINOv3(https://arxiv.org/abs/2603.28480)
Keywords: self-supervised, foundation model, in-context
Abstract: In-context segmentation (ICS) aims to segment arbitrary concepts, e.g., objects, parts, or personalized instances, given one annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but harms generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concepts at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +7.5 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision. Code is available at this https URL .

Title: ConceptWeaver: Weaving Disentangled Concepts with Flow

Authors: Jintao Chen, Aiming Hao, Xiaoqing Chen, Chengyu Bai, Chubin Chen, Yanxun Li, Jiahong Wu, Xiangxiang Chu, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28493
Pdf URL: https://arxiv.org/pdf/2603.28493
Copy Paste: [[2603.28493]] ConceptWeaver: Weaving Disentangled Concepts with Flow(https://arxiv.org/abs/2603.28493)
Keywords: generative
Abstract: Pre-trained flow-based models excel at synthesizing complex scenes yet lack a direct mechanism for disentangling and customizing their underlying concepts from one-shot real-world sources. To demystify this process, we first introduce a novel differential probing technique to isolate and analyze the influence of individual concept tokens on the velocity field over time. This investigation yields a critical insight: the generative process is not monolithic but unfolds in three distinct stages. An initial \textbf{Blueprint Stage} establishes low-frequency structure, followed by a pivotal \textbf{Instantiation Stage} where content concepts emerge with peak intensity and become naturally disentangled, creating an optimal window for manipulation. A final concept-insensitive refinement stage then synthesizes fine-grained details. Guided by this discovery, we propose \textbf{ConceptWeaver}, a framework for one-shot concept disentanglement. ConceptWeaver learns concept-specific semantic offsets from a single reference image using a stage-aware optimization strategy that aligns with the three-stage framework. These learned offsets are then deployed during inference via our novel ConceptWeaver Guidance (CWG) mechanism, which strategically injects them at the appropriate generative stage. Extensive experiments validate that ConceptWeaver enables high-fidelity, compositional synthesis and editing, demonstrating that understanding and leveraging the intrinsic, staged nature of flow models is key to unlocking precise, multi-granularity content manipulation.

Title: Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

Authors: Fei Wu, Guanghao Ding, Zijian Niu, Zhenrui Wang, Lei Yang, Zhuosheng Zhang, Shilin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28508
Pdf URL: https://arxiv.org/pdf/2603.28508
Copy Paste: [[2603.28508]] Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree(https://arxiv.org/abs/2603.28508)
Keywords: generative
Abstract: The malicious use and widespread dissemination of AI-generated images pose a serious threat to the authenticity of digital content. Existing detection methods exploit low-level artifacts left by common manipulation steps within the generation pipeline, but they often lack generalization due to model-specific overfitting. Recently, researchers have resorted to Multimodal Large Language Models (MLLMs) for AIGC detection, leveraging their high-level semantic reasoning and broad generalization capabilities. While promising, MLLMs lack the fine-grained perceptual sensitivity to subtle generation artifacts, making them inadequate as standalone detectors. To address this issue, we propose a novel AI-generated image detection framework that synergistically integrates lightweight artifact-aware detectors with MLLMs via a fuzzy decision tree. The decision tree treats the outputs of basic detectors as fuzzy membership values, enabling adaptive fusion of complementary cues from semantic and perceptual perspectives. Extensive experiments demonstrate that the proposed method achieves state-of-the-art accuracy and strong generalization across diverse generative models.

Title: Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

Authors: Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2603.28532
Pdf URL: https://arxiv.org/pdf/2603.28532
Copy Paste: [[2603.28532]] Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework(https://arxiv.org/abs/2603.28532)
Keywords: foundation model
Abstract: Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

Title: "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents

Authors: Zifan Peng
Subjects: cs.CR, cs.ET, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2603.28551
Pdf URL: https://arxiv.org/pdf/2603.28551
Copy Paste: [[2603.28551]] "What Did It Actually Do?": Understanding Risk Awareness and Traceability for Computer-Use Agents(https://arxiv.org/abs/2603.28551)
Keywords: anomaly
Abstract: Personalized computer-use agents are rapidly moving from expert communities into mainstream use. Unlike conventional chatbots, these systems can install skills, invoke tools, access private resources, and modify local environments on users' behalf. Yet users often do not know what authority they have delegated, what the agent actually did during task execution, or whether the system has been safely removed afterward. We investigate this gap as a combined problem of risk understanding and post-hoc auditability, using OpenClaw as a motivating case. We first build a multi-source corpus of the OpenClaw ecosystem, including incidents, advisories, malicious-skill reports, news coverage, tutorials, and social-media narratives. We then conduct an interview study to examine how users and practitioners understand skills, autonomy, privilege, persistence, and uninstallation. Our findings suggest that participants often recognized these systems as risky in the abstract, but lacked concrete mental models of what skills can do, what resources agents can access, and what changes may remain after execution or removal. Motivated by these findings, we propose AgentTrace, a traceability framework and prototype interface for visualizing agent actions, touched resources, permission history, provenance, and persistent side effects. A scenario-based evaluation suggests that traceability-oriented interfaces can improve understanding of agent behavior, support anomaly detection, and foster more calibrated trust.

Title: Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation

Authors: Yoann Boget, Alexandros Kalousis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28572
Pdf URL: https://arxiv.org/pdf/2603.28572
Copy Paste: [[2603.28572]] Unrestrained Simplex Denoising for Discrete Data. A Non-Markovian Approach Applied to Graph Generation(https://arxiv.org/abs/2603.28572)
Keywords: diffusion, generative
Abstract: Denoising models such as Diffusion or Flow Matching have recently advanced generative modeling for discrete structures, yet most approaches either operate directly in the discrete state space, causing abrupt state changes. We introduce simplex denoising, a simple yet effective generative framework that operates on the probability simplex. The key idea is a non-Markovian noising scheme in which, for a given clean data point, noisy representations at different times are conditionally independent. While preserving the theoretical guarantees of denoising-based generative models, our method removes unnecessary constraints, thereby improving performance and simplifying the formulation. Empirically, \emph{unrestrained simplex denoising} surpasses strong discrete diffusion and flow-matching baselines across synthetic and real-world graph benchmarks. These results highlight the probability simplex as an effective framework for discrete generative modeling.

Title: ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection

Authors: Haojing Chen, Yutong Li, Zhihang Liu, Tao Tan, Haoyu Bian, Qiuju Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28584
Pdf URL: https://arxiv.org/pdf/2603.28584
Copy Paste: [[2603.28584]] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection(https://arxiv.org/abs/2603.28584)
Keywords: diffusion, generative
Abstract: Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) remains challenging due to complex backgrounds, low contrast, irregular object shapes, and large variations in object scale. Existing discriminative methods directly regress saliency maps, while recent diffusion-based generative approaches suffer from stochastic sampling and high computational cost. In this paper, we propose ORSIFlow, a saliency-guided rectified flow framework that reformulates ORSI-SOD as a deterministic latent flow generation problem. ORSIFlow performs saliency mask generation in a compact latent space constructed by a frozen variational autoencoder, enabling efficient inference with only a few steps. To enhance saliency awareness, we design a Salient Feature Discriminator for global semantic discrimination and a Salient Feature Calibrator for precise boundary refinement. Extensive experiments on multiple public benchmarks show that ORSIFlow achieves state-of-the-art performance with significantly improved efficiency. Codes are available at: this https URL.

Title: Unsafe2Safe: Controllable Image Anonymization for Downstream Utility

Authors: Mih Dinh, SouYoung Jin
Subjects: cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28605
Pdf URL: https://arxiv.org/pdf/2603.28605
Copy Paste: [[2603.28605]] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility(https://arxiv.org/abs/2603.28605)
Keywords: diffusion
Abstract: Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision-language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy, and Utility dimensions. Across MS-COCO, Caltech101, and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.

Title: TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Authors: Hannes Mareen, Dimitrios Karageorgiou, Paschalis Giakoumoglou, Peter Lambert, Symeon Papadopoulos, Glenn Van Wallendael
Subjects: cs.CV, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2603.28613
Pdf URL: https://arxiv.org/pdf/2603.28613
Copy Paste: [[2603.28613]] TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark(https://arxiv.org/abs/2603.28613)
Keywords: generative
Abstract: Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at this https URL.

Title: Interpretable Ensemble Learning for Network Traffic Anomaly Detection: A SHAP-based Explainable AI Framework for Embedded Systems Security

Authors: Wanru Shao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.28654
Pdf URL: https://arxiv.org/pdf/2603.28654
Copy Paste: [[2603.28654]] Interpretable Ensemble Learning for Network Traffic Anomaly Detection: A SHAP-based Explainable AI Framework for Embedded Systems Security(https://arxiv.org/abs/2603.28654)
Keywords: anomaly
Abstract: Network security threats in embedded systems pose significant challenges to critical infrastructure protection. This paper presents a comprehensive framework combining ensemble learning methods with explainable artificial intelligence (XAI) techniques for robust anomaly detection in network traffic. We evaluate multiple machine learning models including Random Forest, Gradient Boosting, Support Vector Machines, and ensemble methods on a real-world network traffic dataset containing 19 features derived from packet-level and frequency domain characteristics. Our experimental results demonstrate that ensemble methods achieve superior performance, with Random Forest attaining 90% accuracy and an AUC of 0.617 on validation data. Furthermore, we employ SHAP (SHapley Additive exPlanations) analysis to provide interpretable insights into model predictions, revealing that packet_count_5s,inter_arrival_time, and spectral_entropy are the most influential features for anomaly detection. The integration of XAI techniques enhances model trustworthiness and facilitates deployment in security-critical embedded systems where interpretability is paramount.

Title: Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries

Authors: Md Raz, Venkata Sai Charan Putrevu, Meet Udeshi, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.28655
Pdf URL: https://arxiv.org/pdf/2603.28655
Copy Paste: [[2603.28655]] Safeguarding LLMs Against Misuse and AI-Driven Malware Using Steganographic Canaries(https://arxiv.org/abs/2603.28655)
Keywords: generative
Abstract: AI-powered malware increasingly exploits cloud-hosted generative-AI services and large language models (LLMs) as analysis engines for reconnaissance and code generation. Simultaneously, enterprise uploads expose sensitive documents to third-party AI vendors. Both threats converge at the AI service ingestion boundary, yet existing defenses focus on endpoints and network perimeters, leaving organizations with limited visibility once plaintext reaches an LLM service. To address this, we present a framework based on steganographic canary files: realistic documents carrying cryptographically derived identifiers embedded via complementary encoding channels. A pre-ingestion filter extracts and verifies these identifiers before LLM processing, enabling passive, format-agnostic detection without semantic classification. We support two modes of operation where Mode A marks existing sensitive documents with layered symbolic encodings (whitespace substitution, zero-width character insertion, homoglyph substitution), while Mode B generates synthetic canary documents using linguistic steganography (arithmetic coding over GPT-2), augmented with compatible symbolic layers. We model increasing document pre-processing and adversarial capability for both modes via a four-tier transport-transform taxonomy: All methods achieve 100% identifier recovery under benign and sanitization workflows (Tiers 1-2). The hybrid Mode B maintains 97% through targeted adversarial transforms (Tier 3). An end-to-end case study against an LLM-orchestrated ransomware pipeline confirms that both modes detect and block canary-bearing uploads before file encryption begins. To our knowledge, this is the first framework to systematically combine symbolic and linguistic text steganography into layered canary documents for detecting unauthorized LLM processing, evaluated against a transport-threat taxonomy tailored to AI malware.

Title: Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure

Authors: Chao Yin, Hongzhe Yue, Qing Han, Difeng Hu, Zhenyu Liang, Fangzhou Lin, Bing Sun, Boyu Wang, Mingkai Li, Wei Yao, Jack C.P. Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28660
Pdf URL: https://arxiv.org/pdf/2603.28660
Copy Paste: [[2603.28660]] Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure(https://arxiv.org/abs/2603.28660)
Keywords: foundation model
Abstract: Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification--core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%--a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at this https URL.

Title: DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Authors: Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28713
Pdf URL: https://arxiv.org/pdf/2603.28713
Copy Paste: [[2603.28713]] DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing(https://arxiv.org/abs/2603.28713)
Keywords: diffusion, in-context
Abstract: Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

Title: Stepwise Credit Assignment for GRPO on Flow-Matching Models

Authors: Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.28718
Pdf URL: https://arxiv.org/pdf/2603.28718
Copy Paste: [[2603.28718]] Stepwise Credit Assignment for GRPO on Flow-Matching Models(https://arxiv.org/abs/2603.28718)
Keywords: diffusion
Abstract: Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Title: See it to Place it: Evolving Macro Placements with Vision-Language Models

Authors: Ikechukwu Uchendu, Swati Goel, Karly Hou, Ebrahim Songhori, Kuang-Huei Lee, Joe Wenjie Jiang, Vijay Janapa Reddi, Vincent Zhuang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.28733
Pdf URL: https://arxiv.org/pdf/2603.28733
Copy Paste: [[2603.28733]] See it to Place it: Evolving Macro Placements with Vision-Language Models(https://arxiv.org/abs/2603.28733)
Keywords: foundation model
Abstract: We propose using Vision-Language Models (VLMs) for macro placement in chip floorplanning, a complex optimization task that has recently shown promising advancements through machine learning methods. Because human designers rely heavily on spatial reasoning to arrange components on the chip canvas, we hypothesize that VLMs with strong visual reasoning abilities can effectively complement existing learning-based approaches. We introduce VeoPlace (Visual Evolutionary Optimization Placement), a novel framework that uses a VLM, without any fine-tuning, to guide the actions of a base placer by constraining them to subregions of the chip canvas. The VLM proposals are iteratively optimized through an evolutionary search strategy with respect to resulting placement quality. On open-source benchmarks, VeoPlace outperforms the best prior learning-based approach on 9 of 10 benchmarks with peak wirelength reductions exceeding 32%. We further demonstrate that VeoPlace generalizes to analytical placers, improving DREAMPlace performance on all 8 evaluated benchmarks with gains up to 4.3%. Our approach opens new possibilities for electronic design automation tools that leverage foundation models to solve complex physical design problems.

Title: On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Authors: Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.28762
Pdf URL: https://arxiv.org/pdf/2603.28762
Copy Paste: [[2603.28762]] On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers(https://arxiv.org/abs/2603.28762)
Keywords: diffusion, generative
Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

Title: PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Authors: Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28763
Pdf URL: https://arxiv.org/pdf/2603.28763
Copy Paste: [[2603.28763]] PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models(https://arxiv.org/abs/2603.28763)
Keywords: diffusion
Abstract: Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

Title: Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds

Authors: N Alex Cayco Gajic, Arthur Pellegrino
Subjects: cs.LG, cs.AI, math.DG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.28764
Pdf URL: https://arxiv.org/pdf/2603.28764
Copy Paste: [[2603.28764]] Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds(https://arxiv.org/abs/2603.28764)
Keywords: diffusion
Abstract: Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.

Title: HandX: Scaling Bimanual Motion and Interaction Generation

Authors: Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.28766
Pdf URL: https://arxiv.org/pdf/2603.28766
Copy Paste: [[2603.28766]] HandX: Scaling Bimanual Motion and Interaction Generation(https://arxiv.org/abs/2603.28766)
Keywords: diffusion
Abstract: Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.