2026-03-04

Title: Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

Authors: Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Dai Tao, Shu-Tao Xia
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.02220
Pdf URL: https://arxiv.org/pdf/2603.02220
Copy Paste: [[2603.02220]] Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting(https://arxiv.org/abs/2603.02220)
Keywords: generative
Abstract: Time series forecasting (TSF) remains a challenging problem due to the intricate entanglement of intraperiod-fluctuations and interperiod-trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal this http URL, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a continuous latent surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art performance.

Title: Generalized Discrete Diffusion with Self-Correction

Authors: Linxuan Wang, Ziyi Wang, Yikun Bai, Wei Deng, Guang Lin, Qifan Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02230
Pdf URL: https://arxiv.org/pdf/2603.02230
Copy Paste: [[2603.02230]] Generalized Discrete Diffusion with Self-Correction(https://arxiv.org/abs/2603.02230)
Keywords: diffusion
Abstract: Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

Title: CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Authors: Jiace Zhu, Wentao Chen, Qi Fan, Zhixing Ren, Junying Wu, Xing Zhe Chai, Chotiwit Rungrueangwutthinon, Yehan Ma, An Zou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02236
Pdf URL: https://arxiv.org/pdf/2603.02236
Copy Paste: [[2603.02236]] CUDABench: Benchmarking LLMs for Text-to-CUDA Generation(https://arxiv.org/abs/2603.02236)
Keywords: generative
Abstract: Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of text-to-CUDA generation. Furthermore, given the hardware-specific and performance-critical features of GPU programming, accurately assessing the performance of LLM-generated GPU programs is nontrivial. In this work, we introduce CUDABench, a comprehensive benchmark designed to evaluate the text-to-CUDA capabilities of LLMs. First, we construct CUDABench-Set, which covers Breadth-Depth-Difficulty evaluation space in diverse application domains, including artificial intelligence, scientific computing, and data analytics, etc. Furthermore, we propose CUDABench-Score and Generative Verification Pipeline that assess (1) compilation correctness, (2) functional consistency through execution-based verification, and (3) a novel roofline-based metric, Performance-Score. Benchmarking state-of-the-art LLMs reveals insightful findings and challenges of text-to-CUDA, such as a notable mismatch between high compilation success rates and low functional correctness, a lack of domain-specific algorithmic knowledge, and suboptimal utilization of GPU hardware resources. Our benchmark is available at this https URL.

Title: CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Authors: Zhihao Shi, Kejia Yin, Weilin Wan, Yuhongze Zhou, Yuanhao Yu, Xinxin Zuo, Qiang Sun, Juwei Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02256
Pdf URL: https://arxiv.org/pdf/2603.02256
Copy Paste: [[2603.02256]] CamDirector: Towards Long-Term Coherent Video Trajectory Editing(https://arxiv.org/abs/2603.02256)
Keywords: diffusion
Abstract: Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Title: PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis

Authors: Jeet Bandhu Lahiri, Parshva Runwal, Arvasu Kulkarni, Mahir Jain, Aditya Ray Mishra, Siddharth Panwar, Sandeep Singh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02268
Pdf URL: https://arxiv.org/pdf/2603.02268
Copy Paste: [[2603.02268]] PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis(https://arxiv.org/abs/2603.02268)
Keywords: foundation model
Abstract: EEG foundation models are typically pretrained on narrow-source clinical archives and evaluated on benchmarks from the same ecosystem, leaving unclear whether representations encode neural physiology or recording-distribution artifacts. We introduce PRISM (Population Representative Invariant Signal Model), a masked autoencoder ablated along two axes -- pretraining population and downstream adaptation -- with architecture and preprocessing fixed. We compare a narrow-source EU/US corpus (TUH + PhysioNet) against a geographically diverse pool augmented with multi-center South Asian clinical recordings across multiple EEG systems. Three findings emerge. First, narrow-source pretraining yields stronger linear probes on distribution-matched benchmarks, while diverse pretraining produces more adaptable representations under fine-tuning -- a trade-off invisible under single-protocol evaluation. Trained on three source corpora, PRISM matches or outperforms REVE (92 datasets, 60,000+ hours) on the majority of tasks, demonstrating that targeted diversity can substitute for indiscriminate scale and that dataset count is a confounding variable in model comparison. Second, on a clinically challenging and previously untested task -- distinguishing epilepsy from diagnostic mimickers via interictal EEG -- the diverse checkpoint outperforms the narrow-source checkpoint by +12.3 pp balanced accuracy, the largest gap across all evaluations. Third, systematic inconsistencies between EEG-Bench and EEG-FM-Bench reverse model rankings on identical datasets by up to 24 pp; we identify six concrete sources including split construction, checkpoint selection, segment length, and normalization, showing these factors compound non-additively.

Title: Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

Authors: Binon Teji, Subhajit Bandyopadhyay, Swarup Roy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02273
Pdf URL: https://arxiv.org/pdf/2603.02273
Copy Paste: [[2603.02273]] Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network(https://arxiv.org/abs/2603.02273)
Keywords: diffusion, generative
Abstract: Prioritizing disease-associated genes is central to understanding the molecular mechanisms of complex disorders such as Alzheimer's disease (AD). Traditional network-based approaches rely on static centrality measures and often fail to capture cross-modal biological heterogeneity. We propose NETRA (Node Evaluation through Transformer-based Representation and Attention), a multimodal graph transformer framework that replaces heuristic centrality metrics with attention-driven relevance scoring. Using AD as a case study, gene regulatory networks are independently constructed from microarray, single-cell RNA-seq, and single-nucleus RNA-seq data. Random-walk sequences derived from these networks are used to train a BERT-based model for learning global gene embeddings, while modality-specific gene expression profiles are compressed using variational autoencoders. These representations are integrated with auxiliary biological networks, including protein-protein interactions, Gene Ontology semantic similarity, and diffusion-based gene similarity, into a unified multimodal graph. A graph transformer assigns NETRA scores that quantify gene relevance in a disease-specific and context-aware manner. Gene set enrichment analysis shows that NETRA achieves a normalized enrichment score of about 3.9 for the Alzheimer's disease pathway, substantially outperforming classical centrality measures and diffusion models. Top-ranked genes enrich multiple neurodegenerative pathways, recover a known late-onset AD susceptibility locus at chr12q13, and reveal conserved cross-disease gene modules. The framework preserves biologically realistic heavy-tailed network topology and is readily extensible to other complex disorders.

Title: Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

Authors: Xiaoyu Luo, Wenrui Yu, Qiongxiu Li, Johannes Bjerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02333
Pdf URL: https://arxiv.org/pdf/2603.02333
Copy Paste: [[2603.02333]] Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects(https://arxiv.org/abs/2603.02333)
Keywords: diffusion
Abstract: Autoregressive language models (ARMs) have been shown to memorize and occasionally reproduce training data verbatim, raising concerns about privacy and copyright liability. Diffusion language models (DLMs) have recently emerged as a competitive alternative, yet their memorization behavior remains largely unexplored due to fundamental differences in generation dynamics. To address this gap, we present a systematic theoretical and empirical characterization of memorization in DLMs. We propose a generalized probabilistic extraction framework that unifies prefix-conditioned decoding and diffusion-based generation under arbitrary masking patterns and stochastic sampling trajectories. Theorem 4.3 establishes a monotonic relationship between sampling resolution and memorization: increasing resolution strictly increases the probability of exact training data extraction, implying that autoregressive decoding corresponds to a limiting case of diffusion-based generation by setting the sampling resolution maximal. Extensive experiments across model scales and sampling strategies validate our theoretical predictions. Under aligned prefix-conditioned evaluations, we further demonstrate that DLMs exhibit substantially lower memorization-based leakage of personally identifiable information (PII) compared to ARMs.

Title: Preconditioned Score and Flow Matching

Authors: Shadab Ahamed, Eshed Gal, Simon Ghyselincks, Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.02337
Pdf URL: https://arxiv.org/pdf/2603.02337
Copy Paste: [[2603.02337]] Preconditioned Score and Flow Matching(https://arxiv.org/abs/2603.02337)
Keywords: diffusion, generative
Abstract: Flow matching and score-based diffusion train vector fields under intermediate distributions $p_t$, whose geometry can strongly affect their optimization. We show that the covariance $\Sigma_t$ of $p_t$ governs optimization bias: when $\Sigma_t$ is ill-conditioned, and gradient-based training rapidly fits high-variance directions while systematically under-optimizing low-variance modes, leading to learning that plateaus at suboptimal weights. We formalize this effect in analytically tractable settings and propose reversible, label-conditional \emph{preconditioning} maps that reshape the geometry of $p_t$ by improving the conditioning of $\Sigma_t$ without altering the underlying generative model. Rather than accelerating early convergence, preconditioning primarily mitigates optimization stagnation by enabling continued progress along previously suppressed directions. Across MNIST latent flow matching, and additional high-resolution datasets, we empirically track conditioning diagnostics and distributional metrics and show that preconditioning consistently yields better-trained models by avoiding suboptimal plateaus.

Title: Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris

Authors: Haochuan Kevin Wang
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.02348
Pdf URL: https://arxiv.org/pdf/2603.02348
Copy Paste: [[2603.02348]] Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris(https://arxiv.org/abs/2603.02348)
Keywords: diffusion
Abstract: We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.

Title: MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

Authors: Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, Nandita Vijaykumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02351
Pdf URL: https://arxiv.org/pdf/2603.02351
Copy Paste: [[2603.02351]] MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry(https://arxiv.org/abs/2603.02351)
Keywords: foundation model
Abstract: Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.

Title: Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Authors: Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02406
Pdf URL: https://arxiv.org/pdf/2603.02406
Copy Paste: [[2603.02406]] Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles(https://arxiv.org/abs/2603.02406)
Keywords: self-supervised, generative
Abstract: Generative models have recently advanced $\textit{de novo}$ protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce $\textbf{RigidSSL}$ ($\textit{Rigidity-Aware Self-Supervised Learning}$), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43\% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8\% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: this https URL.

Title: DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting

Authors: Rui-Feng Wang, Daniel Petti, Yue Chen, Changying Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02419
Pdf URL: https://arxiv.org/pdf/2603.02419
Copy Paste: [[2603.02419]] DINOv3 Visual Representations for Blueberry Perception Toward Robotic Harvesting(https://arxiv.org/abs/2603.02419)
Keywords: self-supervised, foundation model
Abstract: Vision Foundation Models trained via large-scale self-supervised learning have demonstrated strong generalization in visual perception; however, their practical role and performance limits in agricultural settings remain insufficiently understood. This work evaluates DINOv3 as a frozen backbone for blueberry robotic harvesting-related visual tasks, including fruit and bruise segmentation, as well as fruit and cluster detection. Under a unified protocol with lightweight decoders, segmentation benefits consistently from stable patch-level representations and scales with backbone size. In contrast, detection is constrained by target scale variation, patch discretization, and localization compatibility. The failure of cluster detection highlights limitations in modeling relational targets defined by spatial aggregation. Overall, DINOv3 is best viewed not as an end-to-end task model, but as a semantic backbone whose effectiveness depends on downstream spatial modeling aligned with fruit-scale and aggregation structures, providing guidance for blueberry robotic harvesting. Code and dataset will be available upon acceptance.

Title: Spectral Regularization for Diffusion Models

Authors: Satish Chandran, Nicolas Roque dos Santos, Yunshu Wu, Greg Ver Steeg, Evangelos Papalexakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02447
Pdf URL: https://arxiv.org/pdf/2603.02447
Copy Paste: [[2603.02447]] Spectral Regularization for Diffusion Models(https://arxiv.org/abs/2603.02447)
Keywords: diffusion
Abstract: Diffusion models are typically trained using pointwise reconstruction objectives that are agnostic to the spectral and multi-scale structure of natural signals. We propose a loss-level spectral regularization framework that augments standard diffusion training with differentiable Fourier- and wavelet-domain losses, without modifying the diffusion process, model architecture, or sampling procedure. The proposed regularizers act as soft inductive biases that encourage appropriate frequency balance and coherent multi-scale structure in generated samples. Our approach is compatible with DDPM, DDIM, and EDM formulations and introduces negligible computational overhead. Experiments on image and audio generation demonstrate consistent improvements in sample quality, with the largest gains observed on higher-resolution, unconditional datasets where fine-scale structure is most challenging to model.

Title: NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

Authors: Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02522
Pdf URL: https://arxiv.org/pdf/2603.02522
Copy Paste: [[2603.02522]] NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining(https://arxiv.org/abs/2603.02522)
Keywords: self-supervised
Abstract: Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.

Title: Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics

Authors: Kwanyoung Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02531
Pdf URL: https://arxiv.org/pdf/2603.02531
Copy Paste: [[2603.02531]] Bridging Diffusion Guidance and Anderson Acceleration via Hopfield Dynamics(https://arxiv.org/abs/2603.02531)
Keywords: diffusion, generative
Abstract: Classifier-Free Guidance (CFG) has significantly enhanced the generative quality of diffusion models by extrapolating between conditional and unconditional outputs. However, its high inference cost and limited applicability to distilled or single-step models have shifted research focus toward attention-space extrapolation. While these methods offer computational efficiency, their theoretical underpinnings remain elusive. In this work, we establish a foundational framework for attention-space extrapolation by modeling attention dynamics as fixed-point iterations within Modern Hopfield Networks. We demonstrate that the extrapolation effect in attention space constitutes a special case of Anderson Acceleration applied to these dynamics. Building on this insight and the weak contraction property, we propose Geometry Aware Attention Guidance (GAG). By decomposing attention updates into parallel and orthogonal components relative to the guidance direction, GAG stabilizes the acceleration process and maximizes guidance efficiency. Our plug-and-play method seamlessly integrates with existing frameworks while significantly improving generation quality.

Title: On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

Authors: Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02546
Pdf URL: https://arxiv.org/pdf/2603.02546
Copy Paste: [[2603.02546]] On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding(https://arxiv.org/abs/2603.02546)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.

Title: CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think

Authors: Junzhe Shen, Jieru Zhao, Ziwei He, Zhouhan Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02547
Pdf URL: https://arxiv.org/pdf/2603.02547
Copy Paste: [[2603.02547]] CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think(https://arxiv.org/abs/2603.02547)
Keywords: diffusion, generative
Abstract: We study why continuous diffusion language models (DLMs) have lagged behind discrete diffusion approaches despite their appealing continuous generative dynamics. Under a controlled token--recovery study, we identify token rounding, the final projection from denoised embeddings to tokens, as a primary bottleneck. Building on these insights, we propose CoDAR (Continuous Diffusion with Contextual AutoRegressive Decoder), a two--stage framework that keeps diffusion entirely continuous in an embedding space while learning a strong, context--conditional discretizer: an autoregressive Transformer decoder that cross--attends to the denoised embedding sequence and performs contextualized rounding to tokens. Experiments on LM1B and OpenWebText demonstrate that CoDAR substantially improves generation quality over latent diffusion and becomes competitive with strong discrete DLMs, while exposing a simple decoder--temperature knob to navigate the fluency--diversity trade off.

Title: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation

Authors: Chonghua Lv, Dong Zhao, Shuang Wang, Dou Quan, Ning Huyan, Nicu Sebe, Zhun Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02554
Pdf URL: https://arxiv.org/pdf/2603.02554
Copy Paste: [[2603.02554]] Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation(https://arxiv.org/abs/2603.02554)
Keywords: foundation model
Abstract: Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation-to-foundation (F2F) and +10.6% in foundation-to-local (F2L) distillation. The code will be available at this https URL.

Title: Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving

Authors: Tianze Zhu, Yinuo Wang, Wenjun Zou, Tianyi Zhang, Likun Wang, Letian Tao, Feihong Zhang, Yao Lyu, Shengbo Eben Li
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2603.02613
Pdf URL: https://arxiv.org/pdf/2603.02613
Copy Paste: [[2603.02613]] Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving(https://arxiv.org/abs/2603.02613)
Keywords: diffusion, generative
Abstract: Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.

Title: Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

Authors: Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02619
Pdf URL: https://arxiv.org/pdf/2603.02619
Copy Paste: [[2603.02619]] Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild(https://arxiv.org/abs/2603.02619)
Keywords: diffusion, generative
Abstract: Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks. Project page: this https URL.

Title: Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

Authors: Kaifang Long, Lianbo Ma, Jiaqi Liu, Liming Liu, Guoyang Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02629
Pdf URL: https://arxiv.org/pdf/2603.02629
Copy Paste: [[2603.02629]] Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective(https://arxiv.org/abs/2603.02629)
Keywords: anomaly
Abstract: The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module: the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD.

Title: Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Authors: Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.02650
Pdf URL: https://arxiv.org/pdf/2603.02650
Copy Paste: [[2603.02650]] Improving Diffusion Planners by Self-Supervised Action Gating with Energies(https://arxiv.org/abs/2603.02650)
Keywords: diffusion, self-supervised
Abstract: Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

Title: Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Authors: Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02655
Pdf URL: https://arxiv.org/pdf/2603.02655
Copy Paste: [[2603.02655]] Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches(https://arxiv.org/abs/2603.02655)
Keywords: in-context
Abstract: Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

Title: DREAM: Where Visual Understanding Meets Text-to-Image Generation

Authors: Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02667
Pdf URL: https://arxiv.org/pdf/2603.02667
Copy Paste: [[2603.02667]] DREAM: Where Visual Understanding Meets Text-to-Image Generation(https://arxiv.org/abs/2603.02667)
Keywords: generative
Abstract: Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

Title: ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT

Authors: Yong Eun Choi, Hyoung Suk Park, Kiwan Jeon, Hyun-Cheol Park, Sung Ho Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02691
Pdf URL: https://arxiv.org/pdf/2603.02691
Copy Paste: [[2603.02691]] ReCo-Diff: Residual-Conditioned Deterministic Sampling for Cold Diffusion in Sparse-View CT(https://arxiv.org/abs/2603.02691)
Keywords: diffusion
Abstract: Cold and generalized diffusion models have recently shown strong potential for sparse-view CT reconstruction by explicitly modeling deterministic degradation processes. However, existing sampling strategies often rely on ad hoc sampling controls or fixed schedules, which remain sensitive to error accumulation and sampling instability. We propose ReCo-Diff, a residual-conditioned diffusion framework that leverages observation residuals through residual-conditioned self-guided sampling. At each sampling step, ReCo-Diff first produces a null (unconditioned) baseline reconstruction and then conditions subsequent predictions on the observation residual between the predicted image and the measured sparse-view input. This residual-driven guidance provides continuous, measurement-aware correction while preserving a deterministic sampling schedule, without requiring heuristic interventions. Experimental results demonstrate that ReCo-Diff consistently outperforms existing cold diffusion sampling baselines, achieving higher reconstruction accuracy, improved stability, and enhanced robustness under severe sparsity.

Title: FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution

Authors: Aro Kim, Myeongjin Jang, Chaewon Moon, Youngjin Shin, Jinwoo Jeong, Sang-hyo Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02692
Pdf URL: https://arxiv.org/pdf/2603.02692
Copy Paste: [[2603.02692]] FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution(https://arxiv.org/abs/2603.02692)
Keywords: diffusion
Abstract: Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration. The source code will be released at: this https URL.

Title: Sensory-Aware Sequential Recommendation via Review-Distilled Representations

Authors: Yeo Chan Yoon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02709
Pdf URL: https://arxiv.org/pdf/2603.02709
Copy Paste: [[2603.02709]] Sensory-Aware Sequential Recommendation via Review-Distilled Representations(https://arxiv.org/abs/2603.02709)
Keywords: generative
Abstract: We propose a novel framework for sensory-aware sequential recommendation that enriches item representations with linguistically extracted sensory attributes from product reviews. Our approach, \textsc{ASEGR} (Attribute-based Sensory Enhanced Generative Recommendation), introduces a two-stage pipeline in which a large language model is first fine-tuned as a teacher to extract structured sensory attribute--value pairs, such as \textit{color: matte black} and \textit{scent: vanilla}, from unstructured review text. The extracted structures are then distilled into a compact student transformer that produces fixed-dimensional sensory embeddings for each item. These embeddings encode experiential semantics in a reusable form and are incorporated into standard sequential recommender architectures as additional item-level representations. We evaluate our method on four Amazon domains and integrate the learned sensory embeddings into representative sequential recommendation models, including SASRec, BERT4Rec, and BSARec. Across domains, sensory-enhanced models consistently outperform their identifier-based counterparts, indicating that linguistically grounded sensory representations provide complementary signals to behavioral interaction patterns. Qualitative analysis further shows that the extracted attributes align closely with human perceptions of products, enabling interpretable connections between natural language descriptions and recommendation behavior. Overall, this work demonstrates that sensory attribute distillation offers a principled and scalable way to bridge information extraction and sequential recommendation through structured semantic representation learning.

Title: MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration

Authors: Lingshun Kong, Jiawei Zhang, Zhengpeng Duan, Xiaohe Wu, Yueqi Yang, Xiaotao Wang, Dongqing Zou, Lei Lei, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02710
Pdf URL: https://arxiv.org/pdf/2603.02710
Copy Paste: [[2603.02710]] MiM-DiT: MoE in MoE with Diffusion Transformers for All-in-One Image Restoration(https://arxiv.org/abs/2603.02710)
Keywords: diffusion
Abstract: All-in-one image restoration is challenging because different degradation types, such as haze, blur, noise, and low-light, impose diverse requirements on restoration strategies, making it difficult for a single model to handle them effectively. In this paper, we propose a unified image restoration framework that integrates a dual-level Mixture-of-Experts (MoE) architecture with a pretrained diffusion model. The framework operates at two levels: the Inter-MoE layer adaptively combines expert groups to handle major degradation types, while the Intra-MoE layer further selects specialized sub-experts to address fine-grained variations within each type. This design enables the model to achieve coarse-grained adaptation across diverse degradation categories while performing fine-grained modulation for specific intra-class variations, ensuring both high specialization in handling complex, real-world corruptions. Extensive experiments demonstrate that the proposed method performs favorably against the state-of-the-art approaches on multiple image restoration task.

Title: CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model

Authors: Waqas Ahmed, Dean Diepeveen, Ferdous Sohel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02743
Pdf URL: https://arxiv.org/pdf/2603.02743
Copy Paste: [[2603.02743]] CoShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model(https://arxiv.org/abs/2603.02743)
Keywords: diffusion
Abstract: Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.

Title: Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Authors: Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02760
Pdf URL: https://arxiv.org/pdf/2603.02760
Copy Paste: [[2603.02760]] Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration(https://arxiv.org/abs/2603.02760)
Keywords: diffusion
Abstract: Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

Title: Scores Know Bobs Voice: Speaker Impersonation Attack

Authors: Chanwoo Hwang, Sunpill Kim, Yong Kiam Tan, Tianchi Liu, Seunghun Paik, Dongsoo Kim, Mondal Soumik, Khin Mi Mi Aung, Jae Hong Seo
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02781
Pdf URL: https://arxiv.org/pdf/2603.02781
Copy Paste: [[2603.02781]] Scores Know Bobs Voice: Speaker Impersonation Attack(https://arxiv.org/abs/2603.02781)
Keywords: generative
Abstract: Advances in deep learning have enabled the widespread deployment of speaker recognition systems (SRSs), yet they remain vulnerable to score-based impersonation attacks. Existing attacks that operate directly on raw waveforms require a large number of queries due to the difficulty of optimizing in high-dimensional audio spaces. Latent-space optimization within generative models offers improved efficiency, but these latent spaces are shaped by data distribution matching and do not inherently capture speaker-discriminative geometry. As a result, optimization trajectories often fail to align with the adversarial direction needed to maximize victim scores. To address this limitation, we propose an inversion-based generative attack framework that explicitly aligns the latent space of the synthesis model with the discriminative feature space of SRSs. We first analyze the requirements of an inverse model for score-based attacks and introduce a feature-aligned inversion strategy that geometrically synchronizes latent representations with speaker embeddings. This alignment ensures that latent updates directly translate into score improvements. Moreover, it enables new attack paradigms, including subspace-projection-based attacks, which were previously infeasible due to the absence of a faithful feature-to-audio mapping. Experiments show that our method significantly improves query efficiency, achieving competitive attack success rates with on average 10x fewer queries than prior approaches. In particular, the enabled subspace-projection-based attack attains up to 91.65% success using only 50 queries. These findings establish feature-aligned inversion as a key tool for evaluating the robustness of modern SRSs against score-based impersonation threats.

Title: Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Authors: Michelle Stegeman, Lena Philipp, Fennie van der Graaf, Marina D'Amato, Clément Grisi, Luc Builtjes, Joeran S. Bosma, Judith Lefkes, Rianne A. Weber, James A. Meakin, Thomas Koopman, Anne Mickan, Mathias Prokop, Ewoud J. Smit, Geert Litjens, Jeroen van der Laak, Bram van Ginneken, Maarten de Rooij, Henkjan Huisman, Colin Jacobs, Francesco Ciompi, Alessa Hering (and on behalf of the UNICORN consortium)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02790
Pdf URL: https://arxiv.org/pdf/2603.02790
Copy Paste: [[2603.02790]] Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language(https://arxiv.org/abs/2603.02790)
Keywords: foundation model
Abstract: Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via this http URL.

Title: ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Authors: Douglass Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02805
Pdf URL: https://arxiv.org/pdf/2603.02805
Copy Paste: [[2603.02805]] ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink(https://arxiv.org/abs/2603.02805)
Keywords: self-supervised
Abstract: Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations require large vocabularies, face out-of-vocabulary issues, and underperform vectors on recognition. We propose ScribeTokens, a tokenization that decomposes pen movement into unit pixel steps. Together with two pen-state tokens, this fixed 10-token base vocabulary suffices to represent any digital ink and enables aggressive BPE compression. On handwritten text generation, ScribeTokens dramatically outperforms vectors (17.33% vs. 70.29% CER), showing tokens are far more effective for generation. On recognition, ScribeTokens is the only token representation to outperform vectors without pretraining. We further introduce next-ink-token prediction as a self-supervised pretraining strategy, which consistently improves recognition across all token-based models and accelerates convergence by up to 83x. With pretraining, ScribeTokens achieves the best recognition results across all representations on both datasets (8.27% CER on IAM, 9.83% on DeepWriting).

Title: Toward Early Quality Assessment of Text-to-Image Diffusion Models

Authors: Huanlei Guo, Hongxin Wei, Bingyi Jing
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02829
Pdf URL: https://arxiv.org/pdf/2603.02829
Copy Paste: [[2603.02829]] Toward Early Quality Assessment of Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.02829)
Keywords: diffusion, generative
Abstract: Recent text-to-image (T2I) diffusion and flow-matching models can produce highly realistic images from natural language prompts. In practical scenarios, T2I systems are often run in a ``generate--then--select'' mode: many seeds are sampled and only a few images are kept for use. However, this pipeline is highly resource-intensive since each candidate requires tens to hundreds of denoising steps, and evaluation metrics such as CLIPScore and ImageReward are post-hoc. In this work, we address this inefficiency by introducing Probe-Select, a plug-in module that enables efficient evaluation of image quality within the generation process. We observe that certain intermediate denoiser activations, even at early timesteps, encode a stable coarse structure, object layout and spatial arrangement--that strongly correlates with final image fidelity. Probe-Select exploits this property by predicting final quality scores directly from early activations, allowing unpromising seeds to be terminated early. Across diffusion and flow-matching backbones, our experiments show that early evaluation at only 20\% of the trajectory accurately ranks candidate seeds and enables selective continuation. This strategy reduces sampling cost by over 60\% while improving the quality of the retained images, demonstrating that early structural signals can effectively guide selective generation without altering the underlying generative model. Code is available at this https URL.

Title: Adapting Time Series Foundation Models through Data Mixtures

Authors: Thomas L. Lee, Edoardo M. Ponti, Amos Storkey
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.02840
Pdf URL: https://arxiv.org/pdf/2603.02840
Copy Paste: [[2603.02840]] Adapting Time Series Foundation Models through Data Mixtures(https://arxiv.org/abs/2603.02840)
Keywords: foundation model
Abstract: Time series foundation models (TSFMs) have become increasingly popular for zero-shot forecasting. However, for a new time series domain not fully covered by the pretraining set, performance can suffer. Therefore, when a practitioner cares about a new domain and has access to a set of related datasets, the question arises: how best to fine-tune a TSFM to improve zero-shot forecasting? A typical approach to this type of problem is to fine-tune a LoRA module on all datasets or separately on each dataset. Tuning a separate module on each dataset allows for the specialisation of the TSFM to different types of data distribution, by selecting differing combinations of per-dataset modules for different time series contexts. However, we find that, using per-dataset modules might not be optimal, since a time series dataset can contain data from several types of distributions, i.e. sub-domains. This can be due to the distribution shifting or having differing distributions for different dimensions of the time series. Hence, we propose MixFT which re-divides the data using Bayesian mixtures into sets that best represent the sub-domains present in the data, and fine-tunes separately on each of these sets. This re-division of the data ensures that each set is more homogeneous, leading to fine-tuned modules focused on specific sub-domains. Our experiments show that MixFT performs better than per-dataset methods and when fine-tuning a single module on all the data. This suggests that by re-partitioning the data to represent sub-domains we can better specialise TSFMs to improve zero-shot forecasting.

Title: A Browser-based Open Source Assistant for Multimodal Content Verification

Authors: Rosanna Milner, Michael Foster, Olesya Razuvayevskaya, Ian Roberts, Valentin Porcellini, Denis Teyssou, Kalina Bontcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02842
Pdf URL: https://arxiv.org/pdf/2603.02842
Copy Paste: [[2603.02842]] A Browser-based Open Source Assistant for Multimodal Content Verification(https://arxiv.org/abs/2603.02842)
Keywords: generative
Abstract: Disinformation and false content produced by generative AI pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media information. While there is an abundance of NLP models for detecting credibility signals such as persuasion techniques, subjectivity, or machine-generated text, such methods often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the VERIFICATION ASSISTANT, a browser-based tool designed to bridge this gap. The VERIFICATION ASSISTANT, a core component of the widely adopted VERIFICATION PLUGIN (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, delivering actionable credibility signals, estimating AI-generated content, and providing other verification guidance in a clear, easy-to-digest format. This paper showcases the tool architecture, its integration of multiple NLP services, and its real-world application to detecting disinformation.

Title: DSBA: Dynamic Stealthy Backdoor Attack with Collaborative Optimization in Self-Supervised Learning

Authors: Jiayao Wang, Mohammad Maruf Hasan, Yiping Zhang, Xiaoying Lei, Jiale Zhang, Qilin Wu, Junwu Zhu, Dongfang Zhao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2603.02849
Pdf URL: https://arxiv.org/pdf/2603.02849
Copy Paste: [[2603.02849]] DSBA: Dynamic Stealthy Backdoor Attack with Collaborative Optimization in Self-Supervised Learning(https://arxiv.org/abs/2603.02849)
Keywords: self-supervised
Abstract: Self-Supervised Learning (SSL) has emerged as a significant paradigm in representation learning thanks to its ability to learn without extensive labeled data, its strong generalization capabilities, and its potential for privacy preservation. However, recent research reveals that SSL models are also vulnerable to backdoor attacks. Existing backdoor attack methods in the SSL context commonly suffer from issues such as high detectability of triggers, feature entanglement, and pronounced out-of-distribution properties in poisoned samples, all of which compromises attack effectiveness and stealthiness. To that, we propose a Dynamic Stealthy Backdoor Attack (DSBA) backed by a new technique we term Collaborative Optimization. This method decouples the attack process into two collaborative optimization layers: the outer-layer optimization trains a backdoor encoder responsible for global feature space remodeling, aiming to achieve precise backdoor implantation while preserving core functionality; meanwhile, the inner-layer optimization employs a dynamically optimized generator to adaptively produce optimally concealed triggers for individual samples, achieving coordinated concealment across feature space and visual space. We also introduce multiple loss functions to dynamically balance attack performance and stealthiness, in which we employ an adaptive weight scheduling mechanism to enhance training stability. Extensive experiments on various mainstream SSL algorithms and five public datasets demonstrate that: (i) DSBA significantly enhances Attack Success Rate (ASR) and stealthiness while maintaining downstream task accuracy; and (ii) DSBA exhibits superior robustness against existing mainstream defense methods.

Title: SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Authors: Xinjie Zhu, Zijing Zhao, Hui Jin, Qingxiao Guo, Yilong Ma, Yunhao Wang, Xiaobing Guo, Weifeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02882
Pdf URL: https://arxiv.org/pdf/2603.02882
Copy Paste: [[2603.02882]] SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion(https://arxiv.org/abs/2603.02882)
Keywords: diffusion
Abstract: Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at this https URL.

Title: SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers

Authors: Wonsuk Jang, Thierry Tambe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02883
Pdf URL: https://arxiv.org/pdf/2603.02883
Copy Paste: [[2603.02883]] SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers(https://arxiv.org/abs/2603.02883)
Keywords: diffusion
Abstract: Diffusion Transformers (DiT) achieve strong video generation quality, but their memory and compute costs hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality under high activation variation and the need to preserve semantic/temporal coherence. We propose SemanticDialect, which advances recent block-wise mixed-format quantization-selecting a per-block optimal format (a dialect) from multiple candidates (a formatbook)-by scaling the formatbook with lookup tables for quantization error and quantized values, enabling efficient per-block format selection and quantization at low online cost. We also introduce activation decomposition that reduces quantization error by re-quantizing and adding back residual errors, with attention-guided salient token selection. We further propose semantic-aware dialect assignment (SeDA) to improve quantized value consistency by sharing a sub-formatbook among semantically correlated tokens. Experiments on video DiT (VDiT) models show that SemanticDialect outperforms prior VDiT quantization methods and fine-grained block-wise format baselines, while approaching FP16 quality on Open-Sora 2.0.

Title: Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

Authors: Kaiqiang Xiong, Rui Peng, Jiahao Wu, Zhanke Wang, Jie Liang, Xiaoyun Zheng, Feng Gao, Ronggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02893
Pdf URL: https://arxiv.org/pdf/2603.02893
Copy Paste: [[2603.02893]] Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting(https://arxiv.org/abs/2603.02893)
Keywords: diffusion
Abstract: 3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating a back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction. Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

Title: ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

Authors: Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02897
Pdf URL: https://arxiv.org/pdf/2603.02897
Copy Paste: [[2603.02897]] ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization(https://arxiv.org/abs/2603.02897)
Keywords: generative
Abstract: Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

Title: Eliciting Numerical Predictive Distributions of LLMs Without Autoregression

Authors: Julianna Piskorz, Katarzyna Kobalczyk, Mihaela van der Schaar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02913
Pdf URL: https://arxiv.org/pdf/2603.02913
Copy Paste: [[2603.02913]] Eliciting Numerical Predictive Distributions of LLMs Without Autoregression(https://arxiv.org/abs/2603.02913)
Keywords: in-context
Abstract: Large Language Models (LLMs) have recently been successfully applied to regression tasks -- such as time series forecasting and tabular prediction -- by leveraging their in-context learning abilities. However, their autoregressive decoding process may be ill-suited to continuous-valued outputs, where obtaining predictive distributions over numerical targets requires repeated sampling, leading to high computational cost and inference time. In this work, we investigate whether distributional properties of LLM predictions can be recovered without explicit autoregressive generation. To this end, we study a set of regression probes trained to predict statistical functionals (e.g., mean, median, quantiles) of the LLM's numerical output distribution directly from its internal representations. Our results suggest that LLM embeddings carry informative signals about summary statistics of their predictive distributions, including the numerical uncertainty. This investigation opens up new questions about how LLMs internally encode uncertainty in numerical tasks, and about the feasibility of lightweight alternatives to sampling-based approaches for uncertainty-aware numerical predictions.

Title: Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Authors: Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02919
Pdf URL: https://arxiv.org/pdf/2603.02919
Copy Paste: [[2603.02919]] Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers(https://arxiv.org/abs/2603.02919)
Keywords: diffusion
Abstract: Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.

Title: GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights

Authors: Qiming He, Jing Li, Tian Guan, Yifei Ma, Zimo Zhao, Yanxia Wang, Hongjing Chen, Yingming Xu, Shuang Ge, Yexing Zhang, Yizhi Wang, Xinrui Chen, Lianghui Zhu, Yiqing Liu, Qingxia Hou, Shuyan Zhao, Xiaoqin Wang, Lili Ma, Peizhen Hu, Qiang Huang, Zihan Wang, Zhiyuan Shen, Junru Cheng, Siqi Zeng, Jiurun Chen, Zhen Song, Chao He, Zhe Wang, Yonghong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02926
Pdf URL: https://arxiv.org/pdf/2603.02926
Copy Paste: [[2603.02926]] GloPath: An Entity-Centric Foundation Model for Glomerular Lesion Assessment and Clinicopathological Insights(https://arxiv.org/abs/2603.02926)
Keywords: self-supervised, foundation model
Abstract: Glomerular pathology is central to the diagnosis and prognosis of renal diseases, yet the heterogeneity of glomerular morphology and fine-grained lesion patterns remain challenging for current AI approaches. We present GloPath, an entity-centric foundation model trained on over one million glomeruli extracted from 14,049 renal biopsy specimens using multi-scale and multi-view self-supervised learning. GloPath addresses two major challenges in nephropathology: glomerular lesion assessment and clinicopathological insights discovery. For lesion assessment, GloPath was benchmarked across three independent cohorts on 52 tasks, including lesion recognition, grading, few-shot classification, and cross-modality diagnosis-outperforming state-of-the-art methods in 42 tasks (80.8%). In the large-scale real-world study, it achieved an ROC-AUC of 91.51% for lesion recognition, demonstrating strong robustness in routine clinical settings. For clinicopathological insights, GloPath systematically revealed statistically significant associations between glomerular morphological parameters and clinical indicators across 224 morphology-clinical variable pairs, demonstrating its capacity to connect tissue-level pathology with patient-level outcomes. Together, these results position GloPath as a scalable and interpretable platform for glomerular lesion assessment and clinicopathological discovery, representing a step toward clinically translatable AI in renal pathology.

Title: TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Authors: Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, JinQiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02929
Pdf URL: https://arxiv.org/pdf/2603.02929
Copy Paste: [[2603.02929]] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval(https://arxiv.org/abs/2603.02929)
Keywords: generative
Abstract: Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

Title: Contextual Latent World Models for Offline Meta Reinforcement Learning

Authors: Mohammadreza Nakheai, Aidan Scannell, Kevin Luck, Joni Pajarinen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02935
Pdf URL: https://arxiv.org/pdf/2603.02935
Copy Paste: [[2603.02935]] Contextual Latent World Models for Offline Meta Reinforcement Learning(https://arxiv.org/abs/2603.02935)
Keywords: self-supervised
Abstract: Offline meta-reinforcement learning seeks to learn policies that generalize across related tasks from fixed datasets. Context-based methods infer a task representation from transition histories, but learning effective task representations without supervision remains a challenge. In parallel, latent world models have demonstrated strong self-supervised representation learning through temporal consistency. We introduce contextual latent world models, which condition latent world models on inferred task representations and train them jointly with the context encoder. This enforces task-conditioned temporal consistency, yielding task representations that capture task-dependent dynamics rather than merely discriminating between tasks. Our method learns more expressive task representations and significantly improves generalization to unseen tasks across MuJoCo, Contextual-DeepMind Control, and Meta-World benchmarks.

Title: TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration

Authors: Benlei Cui, Shaoxuan He, Bukun Huang, Zhizeng Ye, Yunyun Sun, Longtao Huang, Hui Xue, Yang Yang, Jingqun Tang, Zhou Zhao, Haiwen Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02943
Pdf URL: https://arxiv.org/pdf/2603.02943
Copy Paste: [[2603.02943]] TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration(https://arxiv.org/abs/2603.02943)
Keywords: diffusion
Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases. To address these challenges, we propose Trajectory-Consistent Padé approximation, a feature prediction framework grounded in Padé approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Padé incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Padé. For instance, TC-Padé achieves 2.88x acceleration on FLUX.1-dev and 1.72x on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.

Title: Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention

Authors: Wensheng Wu, Zheming Lu, Ziqian Lu, Zewei He, Xuecheng Sun, Zhao Wang, Jungong Han, Yunlong Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02964
Pdf URL: https://arxiv.org/pdf/2603.02964
Copy Paste: [[2603.02964]] Improving Anomaly Detection with Foundation-Model Synthesis and Wavelet-Domain Attention(https://arxiv.org/abs/2603.02964)
Keywords: foundation model, anomaly
Abstract: Industrial anomaly detection faces significant challenges due to the scarcity of anomalous samples and the complexity of real-world anomalies. In this paper, we propose a foundation model-based anomaly synthesis pipeline (FMAS) that generates highly realistic anomalous samples without fine-tuning or class-specific training. Motivated by the distinct frequency-domain characteristics of anomalies, we introduce aWavelet Domain Attention Module (WDAM), which exploits adaptive sub-band processing to enhance anomaly feature extraction. The combination of FMAS and WDAM significantly improves anomaly detection sensitivity while maintaining computational efficiency. Comprehensive experiments on MVTec AD and VisA datasets demonstrate that WDAM, as a plug-and-play module, achieves substantial performance gains against existing baselines.

Title: Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Authors: Ertunc Erdil, Nico Schulthess, Guney Tombak, Ender Konukoglu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02974
Pdf URL: https://arxiv.org/pdf/2603.02974
Copy Paste: [[2603.02974]] Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection(https://arxiv.org/abs/2603.02974)
Keywords: anomaly
Abstract: DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from ``normal'' images and model them independently, ignoring spatial and neighborhood relationships between patches. This implicitly assumes that self-attention and positional encodings sufficiently encode contextual information within each patch embedding. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements. Code is available at the project page: this https URL.

Title: Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients

Authors: Tian-Shuang Wu, Shen-Huan Lyu, Ning Chen, Yi-Xiao He, Bing Tang, Baoliu Ye, Qingfu Zhang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2603.03007
Pdf URL: https://arxiv.org/pdf/2603.03007
Copy Paste: [[2603.03007]] Breaking the Prototype Bias Loop: Confidence-Aware Federated Contrastive Learning for Highly Imbalanced Clients(https://arxiv.org/abs/2603.03007)
Keywords: generative
Abstract: Local class imbalance and data heterogeneity across clients often trap prototype-based federated contrastive learning in a prototype bias loop: biased local prototypes induced by imbalanced data are aggregated into biased global prototypes, which are repeatedly reused as contrastive anchors, accumulating errors across communication rounds. To break this loop, we propose Confidence-Aware Federated Contrastive Learning (CAFedCL), a novel framework that improves the prototype aggregation mechanism and strengthens the contrastive alignment guided by prototypes. CAFedCL employs a confidence-aware aggregation mechanism that leverages predictive uncertainty to downweight high-variance local prototypes. In addition, generative augmentation for minority classes and geometric consistency regularization are integrated to stabilize the structure between classes. From a theoretical perspective, we provide an expectation-based analysis showing that our aggregation reduces estimation variance, thereby bounding global prototype drift and ensuring convergence. Extensive experiments under varying levels of class imbalance and data heterogeneity demonstrate that CAFedCL consistently outperforms representative federated baselines in both accuracy and client fairness.

Title: BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology

Authors: Xiaojing Guo, Jiatai Lin, Yumian Jia, Jingqi Huang, Zeyan Xu, Weidong Li, Longfei Wang, Jingjing Chen, Qin Li, Weiwei Wang, Lifang Cui, Wen Yue, Zhiqiang Cheng, Xiaolong Wei, Jianzhong Yu, Xia Jin, Baizhou Li, Honghong Shen, Jing Li, Chunlan Li, Yanfen Cui, Yi Dai, Yiling Yang, Xiaolong Qian, Liu Yang, Yang Yang, Guangshen Gao, Yaqing Li, Lili Zhai, Chenying Liu, Tianhua Zhang, Zhenwei Shi, Cheng Lu, Xingchen Zhou, Jing Xu, Miaoqing Zhao, Fang Mei, Jiaojiao Zhou, Ning Mao, Fangfang Liu, Chu Han, Zaiyi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03030
Pdf URL: https://arxiv.org/pdf/2603.03030
Copy Paste: [[2603.03030]] BRIGHT: A Collaborative Generalist-Specialist Foundation Model for Breast Pathology(https://arxiv.org/abs/2603.03030)
Keywords: foundation model
Abstract: Generalist pathology foundation models (PFMs), pretrained on large-scale multi-organ datasets, have demonstrated remarkable predictive capabilities across diverse clinical applications. However, their proficiency on the full spectrum of clinically essential tasks within a specific organ system remains an open question due to the lack of large-scale validation cohorts for a single organ as well as the absence of a tailored training paradigm that can effectively translate broad histomorphological knowledge into the organ-specific expertise required for specialist-level interpretation. In this study, we propose BRIGHT, the first PFM specifically designed for breast pathology, trained on approximately 210 million histopathology tiles from over 51,000 breast whole-slide images derived from a cohort of over 40,000 patients across 19 hospitals. BRIGHT employs a collaborative generalist-specialist framework to capture both universal and organ-specific features. To comprehensively evaluate the performance of PFMs on breast oncology, we curate the largest multi-institutional cohorts to date for downstream task development and evaluation, comprising over 25,000 WSIs across 10 hospitals. The validation cohorts cover the full spectrum of breast pathology across 24 distinct clinical tasks spanning diagnosis, biomarker prediction, treatment response and survival prediction. Extensive experiments demonstrate that BRIGHT outperforms three leading generalist PFMs, achieving state-of-the-art (SOTA) performance in 21 of 24 internal validation tasks and in 5 of 10 external validation tasks with excellent heatmap interpretability. By evaluating on large-scale validation cohorts, this study not only demonstrates BRIGHT's clinical utility in breast oncology but also validates a collaborative generalist-specialist paradigm, providing a scalable template for developing PFMs on a specific organ system.

Title: Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection

Authors: Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03095
Pdf URL: https://arxiv.org/pdf/2603.03095
Copy Paste: [[2603.03095]] Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection(https://arxiv.org/abs/2603.03095)
Keywords: generative
Abstract: Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.

Title: MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

Authors: Jun Yeong Park, JunYoung Seo, Minji Kang, Yu Rang Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03101
Pdf URL: https://arxiv.org/pdf/2603.03101
Copy Paste: [[2603.03101]] MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection(https://arxiv.org/abs/2603.03101)
Keywords: anomaly
Abstract: The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose \textbf{MoECLIP}, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. The code is available at this https URL.

Title: AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis

Authors: Maryam Heidari (1), Nantheera Anantrasirichai (1), Steven Walker (2), Rahul Bhatnagar (2), Alin Achim (1) ((1) University of Bristol, UK, (2) Bristol Medical School, University of Bristol, UK)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03125
Pdf URL: https://arxiv.org/pdf/2603.03125
Copy Paste: [[2603.03125]] AWDiff: An a trous wavelet diffusion model for lung ultrasound image synthesis(https://arxiv.org/abs/2603.03125)
Keywords: diffusion, foundation model, generative
Abstract: Lung ultrasound (LUS) is a safe and portable imaging modality, but the scarcity of data limits the development of machine learning methods for image interpretation and disease monitoring. Existing generative augmentation methods, such as Generative Adversarial Networks (GANs) and diffusion models, often lose subtle diagnostic cues due to resolution reduction, particularly B-lines and pleural irregularities. We propose A trous Wavelet Diffusion (AWDiff), a diffusion based augmentation framework that integrates the a trous wavelet transform to preserve fine-scale structures while avoiding destructive downsampling. In addition, semantic conditioning with BioMedCLIP, a vision language foundation model trained on large scale biomedical corpora, enforces alignment with clinically meaningful labels. On a LUS dataset, AWDiff achieved lower distortion and higher perceptual quality compared to existing methods, demonstrating both structural fidelity and clinical diversity.

Title: Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Authors: Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03143
Pdf URL: https://arxiv.org/pdf/2603.03143
Copy Paste: [[2603.03143]] Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing(https://arxiv.org/abs/2603.03143)
Keywords: diffusion, foundation model
Abstract: Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

Title: Information Routing in Atomistic Foundation Models: How Equivariance Creates Linearly Disentangled Representations

Authors: Joshua Steier
Subjects: cs.LG, cs.AI, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2603.03155
Pdf URL: https://arxiv.org/pdf/2603.03155
Copy Paste: [[2603.03155]] Information Routing in Atomistic Foundation Models: How Equivariance Creates Linearly Disentangled Representations(https://arxiv.org/abs/2603.03155)
Keywords: foundation model
Abstract: What do atomistic foundation models encode in their intermediate representations, and how is that information organized? We introduce Composition Projection Decomposition (CPD), which uses QR projection to linearly remove composition signal from learned representations and probes the geometric residual. Across eight models from five architectural families on QM9 molecules and Materials Project crystals, we find a disentanglement gradient: tensor product equivariant architectures (MACE) produce representations where geometry is almost fully linearly accessible after composition removal ($R^2_{\text{geom}} = 0.782$ for HOMO-LUMO gap), while handcrafted descriptors (ANI-2x) entangle the same information nonlinearly ($R^2_{\text{geom}} = -0.792$ under Ridge; $R^2 = +0.784$ under MLP). MACE routes target-specific signal through irreducible representation channels -- dipole to $L = 1$, HOMO-LUMO gap to $L = 0$ -- a pattern not observed in ViSNet's vector-scalar architecture under the same probe. We show that gradient boosted tree probes on projected residuals are systematically inflated, recovering $R^2 = 0.68$--$0.95$ on a purely compositional target, and recommend linear probes as the primary metric. Linearly disentangled representations are more sample-efficient under linear probing, suggesting a practical advantage for equivariant architectures beyond raw prediction accuracy.

Title: Kling-MotionControl Technical Report

Authors: Kling Team: Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Kang He, Xu He, Jingyun Hua, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Fan Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Tiancheng Wen, Zhiyong Wu, Haoxian Zhang, Runze Zhao, Yuanxing Zhang, Yan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03160
Pdf URL: https://arxiv.org/pdf/2603.03160
Copy Paste: [[2603.03160]] Kling-MotionControl Technical Report(https://arxiv.org/abs/2603.03160)
Keywords: generative
Abstract: Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

Title: MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

Authors: Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03192
Pdf URL: https://arxiv.org/pdf/2603.03192
Copy Paste: [[2603.03192]] MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization(https://arxiv.org/abs/2603.03192)
Keywords: foundation model
Abstract: Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.

Title: Understanding and Mitigating Dataset Corruption in LLM Steering

Authors: Cullen Anderson, Narmeen Oozeer, Foad Namjoo, Remy Ogasawara, Amirali Abdullah, Jeff M. Phillips
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.03206
Pdf URL: https://arxiv.org/pdf/2603.03206
Copy Paste: [[2603.03206]] Understanding and Mitigating Dataset Corruption in LLM Steering(https://arxiv.org/abs/2603.03206)
Keywords: generative
Abstract: Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.

Title: Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning

Authors: Adam Watts (1), Andrew Jeon (1), Destry Newton (1), Ryan Bowering (2) ((1) Los Alamos National Laboratory, (2) University of Rochester)
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2603.03229
Pdf URL: https://arxiv.org/pdf/2603.03229
Copy Paste: [[2603.03229]] Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning(https://arxiv.org/abs/2603.03229)
Keywords: generative
Abstract: The shock response spectrum (SRS) is widely used to characterize the response of single-degree-of-freedom (SDOF) systems to transient accelerations. Because the mapping from acceleration time history to SRS is nonlinear and many-to-one, reconstructing time-domain signals from a target spectrum is inherently ill-posed. Conventional approaches address this problem through iterative optimization, typically representing signals as sums of exponentially decayed sinusoids, but these methods are computationally expensive and constrained by predefined basis functions. We propose a conditional variational autoencoder (CVAE) that learns a data-driven inverse mapping from SRS to acceleration time series. Once trained, the model generates signals consistent with prescribed target spectra without requiring iterative optimization. Experiments demonstrate improved spectral fidelity relative to classical techniques, strong generalization to unseen spectra, and inference speeds three to six orders of magnitude faster. These results establish deep generative modeling as a scalable and efficient approach for inverse SRS reconstruction.

Title: On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics

Authors: Mikhail Osipov
Subjects: cs.LG, math.NA, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2603.03238
Pdf URL: https://arxiv.org/pdf/2603.03238
Copy Paste: [[2603.03238]] On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics(https://arxiv.org/abs/2603.03238)
Keywords: diffusion
Abstract: We investigate geometric regularization strategies for learned latent representations in encoder--decoder reduced-order models. In a fixed experimental setting for the advection--diffusion--reaction (ADR) equation, we model latent dynamics using a neural ODE and evaluate four regularization approaches applied during autoencoder pre-training: (a) near-isometry regularization of the decoder Jacobian, (b) a stochastic decoder gain penalty based on random directional gains, (c) a second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Across multiple seeds, we find that (a)--(c) often produce latent representations that make subsequent latent-dynamics training with a frozen autoencoder more difficult, especially for long-horizon rollouts, even when they improve local decoder smoothness or related sensitivity proxies. In contrast, (d) consistently improves conditioning-related diagnostics of the learned latent dynamics and tends to yield better rollout performance. We discuss the hypothesis that, in this setting, the downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness.

Title: COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Authors: Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03239
Pdf URL: https://arxiv.org/pdf/2603.03239
Copy Paste: [[2603.03239]] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design(https://arxiv.org/abs/2603.03239)
Keywords: diffusion, generative
Abstract: Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// this http URL

Title: UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Authors: Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03241
Pdf URL: https://arxiv.org/pdf/2603.03241
Copy Paste: [[2603.03241]] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?(https://arxiv.org/abs/2603.03241)
Keywords: generative
Abstract: Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

Title: Using Learning Progressions to Guide AI Feedback for Science Learning

Authors: Xin Xia (1), Nejla Yuruk (2), Yun Wang (1), Xiaoming Zhai (1) ((1) University of Georgia, (2) Gazi University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03249
Pdf URL: https://arxiv.org/pdf/2603.03249
Copy Paste: [[2603.03249]] Using Learning Progressions to Guide AI Feedback for Science Learning(https://arxiv.org/abs/2603.03249)
Keywords: generative
Abstract: Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

Title: DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction

Authors: Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03265
Pdf URL: https://arxiv.org/pdf/2603.03265
Copy Paste: [[2603.03265]] DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction(https://arxiv.org/abs/2603.03265)
Keywords: diffusion, generative
Abstract: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: this https URL

Title: LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03269
Pdf URL: https://arxiv.org/pdf/2603.03269
Copy Paste: [[2603.03269]] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory(https://arxiv.org/abs/2603.03269)
Keywords: foundation model
Abstract: Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

Title: Beyond Language Modeling: An Exploration of Multimodal Pretraining

Authors: Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03276
Pdf URL: https://arxiv.org/pdf/2603.03276
Copy Paste: [[2603.03276]] Beyond Language Modeling: An Exploration of Multimodal Pretraining(https://arxiv.org/abs/2603.03276)
Keywords: diffusion, foundation model
Abstract: The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

Title: CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Authors: Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03281
Pdf URL: https://arxiv.org/pdf/2603.03281
Copy Paste: [[2603.03281]] CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance(https://arxiv.org/abs/2603.03281)
Keywords: diffusion, generative
Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: this https URL

Title: MIBURI: Towards Expressive Interactive Gesture Synthesis

Authors: M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt
Subjects: cs.CV, cs.GR, cs.HC
Abstract URL: https://arxiv.org/abs/2603.03282
Pdf URL: https://arxiv.org/pdf/2603.03282
Copy Paste: [[2603.03282]] MIBURI: Towards Expressive Interactive Gesture Synthesis(https://arxiv.org/abs/2603.03282)
Keywords: generative
Abstract: Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on this https URL.

Title: Utonia: Toward One Encoder for All Point Clouds

Authors: Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.03283
Pdf URL: https://arxiv.org/pdf/2603.03283
Copy Paste: [[2603.03283]] Utonia: Toward One Encoder for All Point Clouds(https://arxiv.org/abs/2603.03283)
Keywords: self-supervised, foundation model
Abstract: We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.