2026-03-03

Title: StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser

Authors: Jintao Zhang, Zirui Liu, Mingyue Cheng, Xianquan Wang, Zhiding Liu, Qi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00037
Pdf URL: https://arxiv.org/pdf/2603.00037
Copy Paste: [[2603.00037]] StaTS: Spectral Trajectory Schedule Learning for Adaptive Time Series Forecasting with Frequency Guided Denoiser(https://arxiv.org/abs/2603.00037)
Keywords: diffusion
Abstract: Diffusion models have been used for probabilistic time series forecasting and show strong potential. However, fixed noise schedules often produce intermediate states that are hard to invert and a terminal state that deviates from the near noise assumption. Meanwhile, prior methods rely on time domain conditioning and seldom model schedule induced spectral degradation, which limits structure recovery across noise levels. We propose StaTS, a diffusion model for probabilistic time series forecasting that learns the noise schedule and the denoiser through alternating updates. StaTS includes Spectral Trajectory Scheduler (STS) that learns a data adaptive noise schedule with spectral regularization to improve structural preservation and stepwise invertibility, and Frequency Guided Denoiser (FGD) that estimates schedule induced spectral distortion and uses it to modulate denoising strength for heterogeneous restoration across diffusion steps and variables. A two stage training procedure stabilizes the coupling between schedule learning and denoiser optimization. Experiments on multiple real world benchmarks show consistent gains, while maintaining strong performance with fewer sampling steps. Our code is available at this https URL.

Title: Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Authors: Peiyuan Zhang, Matthew Noto, Wenxuan Tan, Chengquan Jiang, Will Lin, Wei Zhou, Hao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00040
Pdf URL: https://arxiv.org/pdf/2603.00040
Copy Paste: [[2603.00040]] Attn-QAT: 4-Bit Attention With Quantization-Aware Training(https://arxiv.org/abs/2603.00040)
Keywords: diffusion
Abstract: Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware training (QAT) for attention. We find that "drop-in" QAT, which naively combines an FP4 forward pass with a high-precision Flash Attention (FA)-style backward pass, leads to training instability. We identify two key principles for stable FP4 attention: (1) matching low-precision recomputation of attention scores in the backward pass, and (2) resolving implicit precision assumptions in FA's gradient calculation. Based on these insights, we propose Attn-QAT and implement fused Triton kernels for training as well as FP4 inference kernels. Across diffusion and language models, Attn-QAT recovers the quality drop from FP4 attention without explicit outlier-mitigation heuristics used in prior FP4 attention, and delivers up to a 1.5x speedup on an RTX 5090. Video demos can be found at this https URL.

Title: Breaking the Factorization Barrier in Diffusion Language Models

Authors: Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00045
Pdf URL: https://arxiv.org/pdf/2603.00045
Copy Paste: [[2603.00045]] Breaking the Factorization Barrier in Diffusion Language Models(https://arxiv.org/abs/2603.00045)
Keywords: diffusion
Abstract: Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade-off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. Code available at: this https URL

Title: BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00049
Pdf URL: https://arxiv.org/pdf/2603.00049
Copy Paste: [[2603.00049]] BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning(https://arxiv.org/abs/2603.00049)
Keywords: self-supervised
Abstract: Self-Supervised Learning (SSL) has shifted from pixel-level reconstruction to latent space prediction, spearheaded by the Joint Embedding Predictive Architecture (JEPA). While effective, standard JEPA models typically rely on a uni-directional prediction mechanism (e.g. Context $\to$ Target), potentially neglecting the informative signal inherent in the inverse relationship, degrading its performance. In this work, we propose \textbf{BiJEPA}, a \textit{Bi-Directional Joint Embedding Predictive Architecture} that enforces cycle-consistent predictability between data segments. We address the inherent instability of symmetric prediction (representation explosion) by introducing a critical norm regularization mechanism on the representation vectors. We evaluate BiJEPA on three distinct modalities: synthetic periodic signals, chaotic Lorenz attractor trajectories, and high-dimensional image data (MNIST). Our results demonstrate that BiJEPA achieves stable convergence without collapse, captures the semantic structure of chaotic systems, and learns robust temporal and spatial representations capable of generation and generalisation, offering a more holistic approach to representation learning.

Title: Knowledge-guided generative surrogate modeling for high-dimensional design optimization under scarce data

Authors: Bingran Wang, Seongha Jeong, Sebastiaan P. C. van Schie, Dongyeon Han, Jaeho Min, John T. Hwang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00052
Pdf URL: https://arxiv.org/pdf/2603.00052
Copy Paste: [[2603.00052]] Knowledge-guided generative surrogate modeling for high-dimensional design optimization under scarce data(https://arxiv.org/abs/2603.00052)
Keywords: generative
Abstract: Surrogate models are widely used in mechanical design and manufacturing process optimization, where high-fidelity computational models may be unavailable or prohibitively expensive. Their effectiveness, however, is often limited by data scarcity, as purely data-driven surrogates struggle to achieve high predictive accuracy in such situations. Subject matter experts (SMEs) frequently possess valuable domain knowledge about functional relationships, yet few surrogate modeling techniques can systematically integrate this information with limited data. We address this challenge with RBF-Gen, a knowledge-guided surrogate modeling framework that combines scarce data with domain knowledge. This method constructs a radial basis function (RBF) space with more centers than training samples and leverages the null space via a generator network, inspired by the principle of maximum information preservation. The introduced latent variables provide a principled mechanism to encode structural relationships and distributional priors during training, thereby guiding the surrogate toward physically meaningful solutions. Numerical studies demonstrate that RBF-Gen significantly outperforms standard RBF surrogates on 1D and 2D structural optimization problems in data-scarce settings, and achieves superior predictive accuracy on a real-world semiconductor manufacturing dataset. These results highlight the potential of combining limited experimental data with domain expertise to enable accurate and practical surrogate modeling in mechanical and process design problems.

Title: M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection

Authors: Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00055
Pdf URL: https://arxiv.org/pdf/2603.00055
Copy Paste: [[2603.00055]] M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection(https://arxiv.org/abs/2603.00055)
Keywords: anomaly
Abstract: Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective mechanisms. To address this issue, we propose M3-AD, a unified reflection-aware multimodal framework for industrial anomaly detection. M3-AD comprises two complementary data resources: M3-AD-FT, designed for reflection-aligned fine-tuning, and M3-AD-Bench, designed for systematic cross-category evaluation, together providing a foundation for reflection-aware learning and reliability assessment. Building upon this foundation, we propose RA-Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3-AD-Bench demonstrate that RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks. Code will be released at this https URL.

Title: VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation

Authors: Takumi Hachimine, Yuhwan Kwon, Cheng-Yu Kuo, Tomoya Yamanokuchi, Takamitsu Matsubara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00116
Pdf URL: https://arxiv.org/pdf/2603.00116
Copy Paste: [[2603.00116]] VoxelDiffusionCut: Non-destructive Internal-part Extraction via Iterative Cutting and Structure Estimation(https://arxiv.org/abs/2603.00116)
Keywords: diffusion, generative
Abstract: Non-destructive extraction of the target internal part, such as batteries and motors, by cutting surrounding structures is crucial at recycling and disposal sites. However, the diversity of products and the lack of information on disassembly procedures make it challenging to decide where to cut. This study explores a method for non-destructive extraction of a target internal part that iteratively estimates the internal structure from observed cutting surfaces and formulates cutting plans based on the estimation results. A key requirement is to estimate the probability of the target part's presence from partial observations. However, learning conditional generative models for this task is challenging: The high dimensionality of 3D shape representations makes learning difficult, and conventional models (e.g., conditional variational autoencoders) often fail to capture multi-modal predictive uncertainty due to mode collapse, resulting in overconfident predictions. To address these issues, we propose VoxelDiffusionCut, which iteratively estimates the internal structure represented as voxels using a diffusion model and plans cuts for non-destructive extraction of the target internal part based on the estimation results. Voxel representation allows the model to predict only attributes at fixed grid positions, i.e., types of constituent parts, making learning more tractable. The diffusion model completes the voxel representation conditioned on observed cutting surfaces, capturing uncertainty in unobserved regions to avoid erroneous cuts. Experimental results in simulation suggest that the proposed method can estimate internal structures from observed cutting surfaces and enable non-destructive extraction of the target internal part by leveraging the estimated uncertainty.

Title: NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence

Authors: Aman Ulla
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.00122
Pdf URL: https://arxiv.org/pdf/2603.00122
Copy Paste: [[2603.00122]] NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence(https://arxiv.org/abs/2603.00122)
Keywords: generative
Abstract: Document extraction is an important step before retrieval-augmented generation (RAG), knowledge bases, and downstream generative AI can work. It turns unstructured documents like PDFs and scans into structured text and layout-aware representations. We introduce NovaLAD, a comprehensive document parsing system that integrates two concurrent YOLO object detection models - element detection and layout detection - with rule-based grouping and optional vision-language enhancement. When a page image is sent in, the first thing that happens is that it goes through both models at the same time. The element model finds semantic content like the title, header, text, table, image, and so on, and the layout model finds structural regions like layout_box, column_group, multi_column, row_group, and so on. A key design decision is to first send an image or figure through an image classifier (ViT) that decides whether it is relevant or not. Only useful images are then submitted to the Vision LLM for title, summary, and structured information, which cuts down on noise and costs. NovaLAD is built for speed: it works on CPU, employs parallel execution for detection, classification, OCR, and conversion, and generates several forms, including structured JSON, Markdown, RAG-ready texts, and knowledge graphs. We test on the DP-Bench benchmark (upstage/dp-bench) and get 96.49% TEDS and 98.51% NID, which is better than both commercial and open-source parsers. This paper explains how to extract data, how the architecture works, how data flows, and how to make NovaLAD both accurate and usable without needing a GPU.

Title: You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Authors: Kairan Zhao, Eleni Triantafillou, Peter Triantafillou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00133
Pdf URL: https://arxiv.org/pdf/2603.00133
Copy Paste: [[2603.00133]] You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.00133)
Keywords: diffusion, generative
Abstract: Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

Title: Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion

Authors: Sathwik Karnik, Juyeop Kim, Sanmi Koyejo, Jong-Seok Lee, Somil Bansal
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00140
Pdf URL: https://arxiv.org/pdf/2603.00140
Copy Paste: [[2603.00140]] Steering Away from Memorization: Reachability-Constrained Reinforcement Learning for Text-to-Image Diffusion(https://arxiv.org/abs/2603.00140)
Keywords: diffusion
Abstract: Text-to-image diffusion models often memorize training data, revealing a fundamental failure to generalize beyond the training set. Current mitigation strategies typically sacrifice image quality or prompt alignment to reduce memorization. To address this, we propose Reachability-Aware Diffusion Steering (RADS), an inference-time framework that prevents memorization while preserving generation fidelity. RADS models the diffusion denoising process as a dynamical system and applies concepts from reachability analysis to approximate the "backward reachable tube"--the set of intermediate states that inevitably evolve into memorized samples. We then formulate mitigation as a constrained reinforcement learning (RL) problem, where a policy learns to steer the trajectory away from memorization via minimal perturbations in the caption embedding space. Empirical evaluations show that RADS achieves a superior Pareto frontier between generation diversity (SSCD), quality (FID), and alignment (CLIP) compared to state-of-the-art baselines. Crucially, RADS provides robust mitigation without modifying the diffusion backbone, offering a plug-and-play solution for safe generation. Our website is available at: this https URL.

Title: GrapHist: Graph Self-Supervised Learning for Histopathology

Authors: Sevda Öğüt, Cédric Vincent-Cuaz, Natalia Dubljevic, Carlos Hurtado, Vaishnavi Subramanian, Pascal Frossard, Dorina Thanou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00143
Pdf URL: https://arxiv.org/pdf/2603.00143
Copy Paste: [[2603.00143]] GrapHist: Graph Self-Supervised Learning for Histopathology(https://arxiv.org/abs/2603.00143)
Keywords: self-supervised
Abstract: Self-supervised vision models have achieved notable success in digital pathology. However, their domain-agnostic transformer architectures are not originally designed to account for fundamental biological elements of histopathology images, namely cells and their complex interactions. In this work, we hypothesize that a biologically-informed modeling of tissues as cell graphs offers a more efficient representation learning. Thus, we introduce GrapHist, a novel graph-based self-supervised learning framework for histopathology, which learns generalizable and structurally-informed embeddings that enable diverse downstream tasks. GrapHist integrates masked autoencoders and heterophilic graph neural networks that are explicitly designed to capture the heterogeneity of tumor microenvironments. We pre-train GrapHist on a large collection of 11 million cell graphs derived from breast tissues and evaluate its transferability across in- and out-of-domain benchmarks. Our results show that GrapHist achieves competitive performance compared to its vision-based counterparts in slide-, region-, and cell-level tasks, while requiring four times fewer parameters. It also drastically outperforms fully-supervised graph models on cancer subtyping tasks. Finally, we also release five graph-based digital pathology datasets used in our study at this https URL , establishing the first large-scale graph benchmark in this field. Our code is available at this https URL .

Title: Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Authors: Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00144
Pdf URL: https://arxiv.org/pdf/2603.00144
Copy Paste: [[2603.00144]] Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation(https://arxiv.org/abs/2603.00144)
Keywords: diffusion
Abstract: Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.

Title: Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction

Authors: Zhihao Li, Shengwei Dong, Chuang Yi, Junxuan Gao, Zhilu Lai, Zhiqiang Liu, Wei Wang, Guangtao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00149
Pdf URL: https://arxiv.org/pdf/2603.00149
Copy Paste: [[2603.00149]] Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction(https://arxiv.org/abs/2603.00149)
Keywords: diffusion
Abstract: Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with \textbf{ReMD} (\underline{Re}sidual-\underline{M}ultigrid \underline{D}iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a \emph{multigrid residual correction}: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a \emph{multi-wavelet} basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency \emph{inside} the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR. Our code are available on this https URL.

Title: Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!

Authors: Zihang Zou, Boqing Gong, Liqiang Wang
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2603.00150
Pdf URL: https://arxiv.org/pdf/2603.00150
Copy Paste: [[2603.00150]] Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!(https://arxiv.org/abs/2603.00150)
Keywords: diffusion
Abstract: In this paper, we highlight a critical threat posed by emerging neural models: data plagiarism. We demonstrate how modern neural models (e.g., diffusion models) can replicate copyrighted images, even when protected by advanced watermarking techniques. To expose vulnerabilities in copyright protection and facilitate future research, we propose a general approach to neural plagiarism that can either forge replicas of copyrighted data or introduce copyright ambiguity. Our method, based on "anchors and shims", employs inverse latents as anchors and finds shim perturbations that gradually deviate the anchor latents, thereby evading watermark or copyright detection. By applying perturbations to the cross-attention mechanism at different timesteps, our approach induces varying degrees of semantic modification in copyrighted images, enabling it to bypass protections ranging from visible trademarks and signatures to invisible watermarks. Notably, our method is a purely gradient-based search that requires no additional training or fine-tuning. Experiments on MS-COCO and real-world copyrighted images show that diffusion models can replicate copyrighted images, underscoring the urgent need for countermeasures against neural plagiarism.

Title: DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops

Authors: Boyang Deng, Yuzhen Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00160
Pdf URL: https://arxiv.org/pdf/2603.00160
Copy Paste: [[2603.00160]] DINOv3 Meets YOLO26 for Weed Detection in Vegetable Crops(https://arxiv.org/abs/2603.00160)
Keywords: self-supervised
Abstract: Developing robust models for precision vegetable weeding is currently constrained by the scarcity of large-scale, annotated weed-crop datasets. To address this limitation, this study proposes a foundational crop-weed detection model by integrating heterogeneous datasets and leveraging self-supervised learning. A total of 618,642 crop-weed images were initially collected and subsequently refined to 199,388 filtered images for fine-tuning a DINOv3 vision transformer (ViT-small) through a sequential curation strategy. The fine-tuned DINOv3 backbone was then integrated into YOLO26, serving either as a primary backbone or part of a dual-backbone architecture. A feature alignment loss was introduced in the dual backbone framework to enhance feature fusion with minimal computational overhead. Experimental results show that the proposed DINOv3-finetuned ViT-small-based YOLO26-large achieved up to a +5.4% mAP50 gain on in-domain images collected in the 2025 season. Moreover, it demonstrated strong cross-domain generalization with mAP50 improvements of +14.0% on the 2021-2023 season dataset and +11.9% on the 2024 season dataset, compared to the standard YOLO26-large. Although the DINOv3-YOLO26-large model has 45.6% more parameters and a 2.9x increase in inference latency, it maintains real-time performance at ~28.5 frames per second (fps). The curated dataset and software programs developed in this study will be made publicly available.

Title: Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Authors: Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00166
Pdf URL: https://arxiv.org/pdf/2603.00166
Copy Paste: [[2603.00166]] Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?(https://arxiv.org/abs/2603.00166)
Keywords: generative
Abstract: Recent advances in generative AI have demonstrated remarkable ability to produce high-quality content. However, these models often exhibit "Paradox of Simplicity": while they can render intricate landscapes, they often fail at simple, deterministic tasks. To address this, we formalize Obedience as the ability to align with instructions and establish a hierarchical grading system ranging from basic semantic alignment to pixel-level systemic precision, which provides a unified paradigm for incorporating and categorizing existing literature. Then, we conduct case studies to identify common obedience gaps, revealing how generative priors often override logical constraints. To evaluate high-level obedience, we present VIOLIN (VIsual Obedience Level-4 EvaluatIoN), the first benchmark focused on pure color generation across six variants. Extensive experiments on SOTA models reveal fundamental obedience limitations and further exploratory insights. By establishing this framework, we aim to draw more attention on AI Obedience and encourage deeper exploration to bridge this gap.

Title: Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

Authors: Simo Ryu, Chunghwan Han
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00173
Pdf URL: https://arxiv.org/pdf/2603.00173
Copy Paste: [[2603.00173]] Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model(https://arxiv.org/abs/2603.00173)
Keywords: foundation model
Abstract: We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $\mu$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $\mu$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.

Title: Infinite Self-Attention

Authors: Giorgio Roffo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00175
Pdf URL: https://arxiv.org/pdf/2603.00175
Copy Paste: [[2603.00175]] Infinite Self-Attention(https://arxiv.org/abs/2603.00175)
Keywords: diffusion
Abstract: The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985).

Title: NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces

Authors: Jiwoo Kim, Swarajh Mehta, Hao-Lun Hsu, Hyunwoo Ryu, Yudong Liu, Miroslav Pajic
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00180
Pdf URL: https://arxiv.org/pdf/2603.00180
Copy Paste: [[2603.00180]] NNiT: Width-Agnostic Neural Network Generation with Structurally Aligned Weight Spaces(https://arxiv.org/abs/2603.00180)
Keywords: diffusion, generative
Abstract: Generative modeling of neural network parameters is often tied to architectures because standard parameter representations rely on known weight-matrix dimensions. Generation is further complicated by permutation symmetries that allow networks to model similar input-output functions while having widely different, unaligned parameterizations. In this work, we introduce Neural Network Diffusion Transformers (NNiTs), which generate weights in a width-agnostic manner by tokenizing weight matrices into patches and modeling them as locally structured fields. We establish that Graph HyperNetworks (GHNs) with a convolutional neural network (CNN) decoder structurally align the weight space, creating the local correlation necessary for patch-based processing. Focusing on MLPs, where permutation symmetry is especially apparent, NNiT generates fully functional networks across a range of architectures. Our approach jointly models discrete architecture tokens and continuous weight patches within a single sequence model. On ManiSkill3 robotics tasks, NNiT achieves >85% success on architecture topologies unseen during training, while baseline approaches fail to generalize.

Title: Engineering FAIR Privacy-preserving Applications that Learn Histories of Disease

Authors: Ines N. Duarte, Praphulla M. S. Bhawsar, Lee K. Mason, Jeya Balaji Balasubramanian, Daniel E. Russ, Arlindo L. Oliveira, Jonas S. Almeida
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.00181
Pdf URL: https://arxiv.org/pdf/2603.00181
Copy Paste: [[2603.00181]] Engineering FAIR Privacy-preserving Applications that Learn Histories of Disease(https://arxiv.org/abs/2603.00181)
Keywords: generative
Abstract: A recent report on "Learning the natural history of human disease with generative transformers" created an opportunity to assess the engineering challenge of delivering user-facing Generative AI applications in privacy-sensitive domains. The application of these models, particularly for personalized healthcare tasks like predicting individual morbidity risk, is typically constrained by data privacy concerns. This project was accordingly designed as an in-browser model deployment exercise (an "App") testing the architectural boundaries of client-side inference generation (no downloads or installations). We relied exclusively on the documentation provided in the reference report to develop the model, specifically testing the "R" component of the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. The successful model deployment, leveraging ONNX and a custom JavaScript SDK, establishes a secure, high-performance architectural blueprint for the future of private generative AI in medicine.

Title: Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1

Authors: Abhinav Munagala
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00184
Pdf URL: https://arxiv.org/pdf/2603.00184
Copy Paste: [[2603.00184]] Zero-Shot and Supervised Bird Image Segmentation Using Foundation Models: A Dual-Pipeline Approach with Grounding DINO~1.5, YOLOv11, and SAM~2.1(https://arxiv.org/abs/2603.00184)
Keywords: foundation model
Abstract: Bird image segmentation remains a challenging task in computer vision due to extreme pose diversity, complex plumage patterns, and variable lighting conditions. This paper presents a dual-pipeline framework for binary bird image segmentation leveraging 2025 foundation models. We introduce two operating modes built upon Segment Anything Model 2.1 (SAM 2.1) as a shared frozen backbone: (1) a zero-shot pipeline using Grounding DINO 1.5 to detect birds via the text prompt "bird" before prompting SAM 2.1 with bounding boxes requiring no labelled bird data; and (2) a supervised pipeline that fine-tunes YOLOv11 on the CUB-200-2011 dataset for high-precision detection, again prompting SAM 2.1 for pixel-level masks. The segmentation model is never retrained for new species or domains. On CUB-200-2011 (11,788 images, 200 species), the supervised pipeline achieves IoU 0.912, Dice 0.954, and F1 0.953 outperforming all prior baselines including SegFormer-B2 (IoU 0.842) by +7.0 percentage points. The zero-shot pipeline achieves IoU 0.831 using only a text prompt, the first such result reported on this benchmark. We demonstrate that prompt-based foundation model pipelines outperform task specific end-to-end trained segmentation networks, while requiring only lightweight detector fine-tuning (~1 hour) for domain adaptation. Complete PyTorch implementation, dataset preparation scripts, and trained weights are publicly available.

Title: ThreatFormer-IDS: Robust Transformer Intrusion Detection with Zero-Day Generalization and Explainable Attribution

Authors: Srikumar Nayak
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00185
Pdf URL: https://arxiv.org/pdf/2603.00185
Copy Paste: [[2603.00185]] ThreatFormer-IDS: Robust Transformer Intrusion Detection with Zero-Day Generalization and Explainable Attribution(https://arxiv.org/abs/2603.00185)
Keywords: self-supervised
Abstract: Intrusion detection in IoT and industrial networks requires models that can detect rare attacks at low false-positive rates while remaining reliable under evolving traffic and limited labels. Existing IDS solutions often report strong in-distribution accuracy, but they may degrade when evaluated on future traffic, unseen (zero-day) attack families, or adversarial feature manipulations, and many systems provide limited evidence to support analyst triage. To address these gaps, we propose ThreatFormer- IDS, a Transformer-based sequence modeling framework that converts flow records into time-ordered windows and learns contextual representations for robust intrusion screening. The method combines (i) weighted supervised learning for imbalanced detection, (ii) masked self-supervised learning to improve representation stability under drift and sparse labels, (iii) PGDbased adversarial training with scale-normalized perturbations to strengthen resilience against feature-level evasion, and (iv) Integrated Gradients attribution to highlight influential time steps and features for each alert. On the ToN IoT benchmark with chronological evaluation, ThreatFormer-IDS achieves AUCROC 0.994, AUC-PR 0.956, and Recall@1%FPR 0.910, outperforming strong tree-based and sequence baselines. Under a zero-day protocol with held-out attack families, it maintains superior generalization (AUC-PR 0.721, Recall@1%FPR 0.783). Robustness tests further show slower degradation in AUCPR as the adversarial budget increases, confirming improved stability under bounded perturbations. Overall, ThreatFormer- IDS provides a unified, deployment-oriented IDS pipeline that balances detection quality, zero-day behavior, robustness, and explainability.

Title: OSF: On Pre-training and Scaling of Sleep Foundation Models

Authors: Zitao Shuai, Zongzhe Xu, David Yang, Wei Wang, Yuzhe Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00190
Pdf URL: https://arxiv.org/pdf/2603.00190
Copy Paste: [[2603.00190]] OSF: On Pre-training and Scaling of Sleep Foundation Models(https://arxiv.org/abs/2603.00190)
Keywords: self-supervised, foundation model
Abstract: Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-purpose foundation models (FMs) for sleep physiology, but lack an in-depth understanding of the pre-training process and scaling patterns that lead to more generalizable sleep FMs. To fill this gap, we curate a massive corpus of 166,500 hours of sleep recordings from nine public sources and establish SleepBench, a comprehensive, fully open-source benchmark. Leveraging SleepBench, we systematically evaluate four families of self-supervised pre-training objectives and uncover three critical findings: (1) existing FMs fail to generalize to missing channels at inference; (2) channel-invariant feature learning is essential for pre-training; and (3) scaling sample size, model capacity, and multi-source data mixture consistently improves downstream this http URL an enhanced pre-training and scaling recipe, we introduce OSF, a family of sleep FMs that achieves state-of-the-art performance across nine datasets on diverse sleep and disease prediction tasks. Further analysis of OSF also reveals intriguing properties in sample efficiency, hierarchical aggregation, and cross-dataset scaling.

Title: SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

Authors: Yang Yang, Xinze Zou, Zehua Ma, Han Fang, Weiming Zhang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.00194
Pdf URL: https://arxiv.org/pdf/2603.00194
Copy Paste: [[2603.00194]] SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models(https://arxiv.org/abs/2603.00194)
Keywords: diffusion, generative
Abstract: The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.

Title: TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Authors: Daniel Nobrega Medeiros
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00206
Pdf URL: https://arxiv.org/pdf/2603.00206
Copy Paste: [[2603.00206]] TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models(https://arxiv.org/abs/2603.00206)
Keywords: generative
Abstract: Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: https://doi.org/10.57967/hf/7904).

Title: Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection

Authors: Brianna D'Urso, Tahmid Hasan Sakib, Syed Rafay Hasan, Terry N. Guo
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.00217
Pdf URL: https://arxiv.org/pdf/2603.00217
Copy Paste: [[2603.00217]] Physical Evaluation of Naturalistic Adversarial Patches for Camera-Based Traffic-Sign Detection(https://arxiv.org/abs/2603.00217)
Keywords: generative
Abstract: This paper studies how well Naturalistic Adversarial Patches (NAPs) transfer to a physical traffic sign setting when the detector is trained on a customized dataset for an autonomous vehicle (AV) environment. We construct a composite dataset, CompGTSRB (which is customized dataset for AV environment), by pasting traffic sign instances from the German Traffic Sign Recognition Benchmark (GTSRB) onto undistorted backgrounds captured from the target platform. CompGTSRB is used to train a YOLOv5 model and generate patches using a Generative Adversarial Network (GAN) with latent space optimization, following existing NAP methods. We carried out a series of experiments on our Quanser QCar testbed utilizing the front CSI camera provided in QCar. Across configurations, NAPs reduce the detector's STOP class confidence. Different configurations include distance, patch sizes, and patch placement. These results along with a detailed step-by-step methodology indicate the utility of CompGTSRB dataset and the proposed systematic physical protocols for credible patch evaluation. The research further motivate researching the defenses that address localized patch corruption in embedded perception pipelines.

Title: Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors

Authors: Xuanshuo Fu, Lei Kang, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00337
Pdf URL: https://arxiv.org/pdf/2603.00337
Copy Paste: [[2603.00337]] Diffusion-Based Low-Light Image Enhancement with Color and Luminance Priors(https://arxiv.org/abs/2603.00337)
Keywords: diffusion
Abstract: Low-light images often suffer from low contrast, noise, and color distortion, degrading visual quality and impairing downstream vision tasks. We propose a novel conditional diffusion framework for low-light image enhancement that incorporates a Structured Control Embedding Module (SCEM). SCEM decomposes a low-light image into four informative components including illumination, illumination-invariant features, shadow priors, and color-invariant cues. These components serve as control signals that condition a U-Net-based diffusion model trained with a simplified noise-prediction loss. Thus, the proposed SCEM equipped Diffusion method enforces structured enhancement guided by physical priors. In experiments, our model is trained only on the LOLv1 dataset and evaluated without fine-tuning on LOLv2-real, LSRW, DICM, MEF, and LIME. The method achieves state-of-the-art performance in quantitative and perceptual metrics, demonstrating strong generalization across benchmarks. this https URL.

Title: Distribution-Aware Companding Quantization of Large Language Models

Authors: Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00364
Pdf URL: https://arxiv.org/pdf/2603.00364
Copy Paste: [[2603.00364]] Distribution-Aware Companding Quantization of Large Language Models(https://arxiv.org/abs/2603.00364)
Keywords: generative
Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.

Title: DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography

Authors: Yujia Wu, Shuoqi Chen, Shiru Wang, Yucheng Tang, Petr Bruza, Geoffrey P. Luke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00382
Pdf URL: https://arxiv.org/pdf/2603.00382
Copy Paste: [[2603.00382]] DiffSOS: Acoustic Conditional Diffusion Model for Speed-of-Sound Reconstruction in Ultrasound Computed Tomography(https://arxiv.org/abs/2603.00382)
Keywords: diffusion, generative
Abstract: Accurate Speed-of-Sound (SoS) reconstruction from acoustic waveforms is a cornerstone of ultrasound computed tomography (USCT), enabling quantitative velocity mapping that reveals subtle anatomical details and pathological variations often invisible in conventional imaging. However, practical utility is hindered by the limitations of existing algorithms; traditional Full Waveform Inversion (FWI) is computationally intensive, while current deep learning approaches tend to produce oversmoothed results lacking fine details. We propose DiffSOS, a conditional diffusion model that directly maps acoustic waveforms to SoS maps. Our framework employs a specialized acoustic ControlNet to strictly ground the denoising process in physical wave measurements. To ensure structural consistency, we optimize a hybrid loss function that integrates noise prediction, spatial reconstruction, and noise frequency content. To accelerate inference, we employ stochastic Denoising Diffusion Implicit Model (DDIM) sampling, achieving near real-time reconstruction with only 10 steps. Crucially, we exploit the stochastic generative nature of our framework to estimate pixel-wise uncertainty, providing a measure of reliability that is often absent in deterministic approaches. Evaluated on the OpenPros USCT benchmark, DiffSOS significantly outperforms state-of-the-art networks, achieving an average Multi-scale Structural Similarity of 0.957. Our approach provides high-fidelity SoS maps with a principled measure of confidence, facilitating safer and faster clinical interpretation.

Title: TENG-BC: Unified Time-Evolving Natural Gradient for Neural PDE Solvers with General Boundary Conditions

Authors: Hongjie Jiang, Di Luo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00397
Pdf URL: https://arxiv.org/pdf/2603.00397
Copy Paste: [[2603.00397]] TENG-BC: Unified Time-Evolving Natural Gradient for Neural PDE Solvers with General Boundary Conditions(https://arxiv.org/abs/2603.00397)
Keywords: diffusion
Abstract: Accurately solving time-dependent partial differential equations (PDEs) with neural networks remains challenging due to long-time error accumulation and the difficulty of enforcing general boundary conditions. We introduce TENG-BC, a high-precision neural PDE solver based on the Time-Evolving Natural Gradient, designed to perform under general boundary constraints. At each time step, TENG-BC performs a boundary-aware optimization that jointly enforces interior dynamics and boundary conditions, accommodating Dirichlet, Neumann, Robin, and mixed types within a unified framework. This formulation admits a natural-gradient interpretation, enabling stable time evolution without delicate penalty tuning. Across benchmarks over diffusion, transport, and nonlinear PDEs with various boundary conditions, TENG-BC achieves solver-level accuracy under comparable sampling budgets, outperforming conventional solvers and physics-informed neural network (PINN) baselines.

Title: Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models

Authors: Hulingxiao He, Zhi Tan, Yuxin Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00431
Pdf URL: https://arxiv.org/pdf/2603.00431
Copy Paste: [[2603.00431]] Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models(https://arxiv.org/abs/2603.00431)
Keywords: foundation model
Abstract: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at this https URL.

Title: TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis

Authors: Hui Wan, Libin Lan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00433
Pdf URL: https://arxiv.org/pdf/2603.00433
Copy Paste: [[2603.00433]] TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis(https://arxiv.org/abs/2603.00433)
Keywords: foundation model
Abstract: Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.

Title: Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling

Authors: Xueyang Li, Yunzhong Lou, Yu Song, Xiangdong Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00439
Pdf URL: https://arxiv.org/pdf/2603.00439
Copy Paste: [[2603.00439]] Mamba-CAD: State Space Model For 3D Computer-Aided Design Generative Modeling(https://arxiv.org/abs/2603.00439)
Keywords: self-supervised, generative
Abstract: Computer-Aided Design (CAD) generative modeling has a strong and long-term application in the industry. Recently, the parametric CAD sequence as the design logic of an object has been widely mined by sequence models. However, the industrial CAD models, especially in component objects, are fine-grained and complex, requiring a longer parametric CAD sequence to define. To address the problem, we introduce Mamba-CAD, a self-supervised generative modeling for complex CAD models in the industry, which can model on a longer parametric CAD sequence. Specifically, we first design an encoder-decoder framework based on a Mamba architecture and pair it with a CAD reconstruction task for pre-training to model the latent representation of CAD models; and then we utilize the learned representation to guide a generative adversarial network to produce the fake representation of CAD models, which would be finally recovered into parametric CAD sequences via the decoder of MambaCAD. To train Mamba-CAD, we further create a new dataset consisting of 77,078 CAD models with longer parametric CAD sequences. Comprehensive experiments are conducted to demonstrate the effectiveness of our model under various evaluation metrics, especially in the generation length of valid parametric CAD sequences. The code and dataset can be achieved from this https URL.

Title: SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment

Authors: Zhuoran Zhao, Xianghao Kong, Linlin Yang, Zheng Wei, Pan Hui, Anyi Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00443
Pdf URL: https://arxiv.org/pdf/2603.00443
Copy Paste: [[2603.00443]] SesaHand: Enhancing 3D Hand Reconstruction via Controllable Generation with Semantic and Structural Alignment(https://arxiv.org/abs/2603.00443)
Keywords: generative
Abstract: Recent studies on 3D hand reconstruction have demonstrated the effectiveness of synthetic training data to improve estimation performance. However, most methods rely on game engines to synthesize hand images, which often lack diversity in textures and environments, and fail to include crucial components like arms or interacting objects. Generative models are promising alternatives to generate diverse hand images, but still suffer from misalignment issues. In this paper, we present SesaHand, which enhances controllable hand image generation from both semantic and structural alignment perspectives for 3D hand reconstruction. Specifically, for semantic alignment, we propose a pipeline with Chain-of-Thought inference to extract human behavior semantics from image captions generated by the Vision-Language Model. This semantics suppresses human-irrelevant environmental details and ensures sufficient human-centric contexts for hand image generation. For structural alignment, we introduce hierarchical structural fusion to integrate structural information with different granularity for feature refinement to better align the hand and the overall human body in generated images. We further propose a hand structure attention enhancement method to efficiently enhance the model's attention on hand regions. Experiments demonstrate that our method not only outperforms prior work in generation performance but also improves 3D hand reconstruction with the generated hand images.

Title: Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

Authors: Xi Wang, Wenbo Lu, Shengjie Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00454
Pdf URL: https://arxiv.org/pdf/2603.00454
Copy Paste: [[2603.00454]] Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training(https://arxiv.org/abs/2603.00454)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

Title: Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

Authors: Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00458
Pdf URL: https://arxiv.org/pdf/2603.00458
Copy Paste: [[2603.00458]] Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution(https://arxiv.org/abs/2603.00458)
Keywords: diffusion
Abstract: While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.

Title: DreamWorld: Unified World Modeling in Video Generation

Authors: Boming Tan, Xiangdong Zhang, Ning Liao, Yuqing Zhang, Shaofeng Zhang, Xue Yang, Qi Fan, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00466
Pdf URL: https://arxiv.org/pdf/2603.00466
Copy Paste: [[2603.00466]] DreamWorld: Unified World Modeling in Video Generation(https://arxiv.org/abs/2603.00466)
Keywords: foundation model
Abstract: Despite impressive progress in video generation, existing models remain limited to surface-level plausibility, lacking a coherent and unified understanding of the world. Prior approaches typically incorporate only a single form of world-related knowledge or rely on rigid alignment strategies to introduce additional knowledge. However, aligning the single world knowledge is insufficient to constitute a world model that requires jointly modeling multiple heterogeneous dimensions (e.g., physical commonsense, 3D and temporal consistency). To address this limitation, we introduce \textbf{DreamWorld}, a unified framework that integrates complementary world knowledge into video generators via a \textbf{Joint World Modeling Paradigm}, jointly predicting video pixels and features from foundation models to capture temporal dynamics, spatial geometry, and semantic consistency. However, naively optimizing these heterogeneous objectives can lead to visual instability and temporal flickering. To mitigate this issue, we propose \textit{Consistent Constraint Annealing (CCA)} to progressively regulate world-level constraints during training, and \textit{Multi-Source Inner-Guidance} to enforce learned world priors at inference. Extensive evaluations show that DreamWorld improves world consistency, outperforming Wan2.1 by 2.26 points on VBench. Code will be made publicly available at \href{this https URL}{\textcolor{mypink}{\textbf{Github}}}.

Title: RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

Authors: Liyao Jiang, Ruichen Chen, Chao Gao, Di Niu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00483
Pdf URL: https://arxiv.org/pdf/2603.00483
Copy Paste: [[2603.00483]] RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment(https://arxiv.org/abs/2603.00483)
Keywords: diffusion
Abstract: Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at this https URL.

Title: ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

Authors: Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang, Yuxuan Zhang, Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Katarina Tothova, Zan Gojcic, Haithem Turki
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00492
Pdf URL: https://arxiv.org/pdf/2603.00492
Copy Paste: [[2603.00492]] ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models(https://arxiv.org/abs/2603.00492)
Keywords: diffusion, generative
Abstract: Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.

Title: COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

Authors: Yuchen Che, Jingtu Wu, Hao Zheng, Asako Kanezaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00493
Pdf URL: https://arxiv.org/pdf/2603.00493
Copy Paste: [[2603.00493]] COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation(https://arxiv.org/abs/2603.00493)
Keywords: foundation model
Abstract: Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, view-point changes, and outliers. A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse key-points. We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as optimal transport marginals, suppressing non-overlapping regions. Semantic priors from vision foundation models further regularize the correspondences, leading to stable pose estimation. This design integrates confidence into the correspondence finding and pose estimation pipeline, enabling unsupervised learning. Experiments show unsupervised COG achieves comparable performance to supervised methods, and supervised COG outperforms them.

Title: Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Authors: Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00518
Pdf URL: https://arxiv.org/pdf/2603.00518
Copy Paste: [[2603.00518]] Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training(https://arxiv.org/abs/2603.00518)
Keywords: self-supervised
Abstract: Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which compresses the visual token sequence in a novel self-supervised learning manner. By incorporating bidirectional scan strategy and the Conv2d module, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve 77.3%,81.2%,82.5% Top-1 accuracy on ImageNet classification and also greatly outperform their counterparts on downstream tasks. At 1280x1280 resolution, \texttt{Vittt-T} reduces FLOPs by 79.4% and runs 4.38x faster with 88.9% less memory than DeiT-T. These results demonstrate the expressiveness and efficiency of Vision-TTT as a strong candidate for the next-generation generic visual backbone.

Title: Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness

Authors: Yuyang Chen, Linqian Zeng, Yijin ZHou, Hengjie Li, Jidong Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00519
Pdf URL: https://arxiv.org/pdf/2603.00519
Copy Paste: [[2603.00519]] Jano: Adaptive Diffusion Generation with Early-stage Convergence Awareness(https://arxiv.org/abs/2603.00519)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved remarkable success in generative AI, yet their computational efficiency remains a significant challenge, particularly for Diffusion Transformers (DiTs) requiring intensive full-attention computation. While existing acceleration approaches focus on content-agnostic uniform optimization strategies, we observe that different regions in generated content exhibit heterogeneous convergence patterns during the denoising process. We present Jano, a training-free framework that leverages this insight for efficient region-aware generation. Jano introduces an early-stage complexity recognition algorithm that accurately identifies regional convergence requirements within initial denoising steps, coupled with an adaptive token scheduling runtime that optimizes computational resource allocation. Through comprehensive evaluation on state-of-the-art models, Jano achieves substantial acceleration (average 2.0 times speedup, up to 2.4 times) while preserving generation quality. Our work challenges conventional uniform processing assumptions and provides a practical solution for accelerating large-scale content generation. The source code of our implementation is available at this https URL.

Title: Phys-Diff: A Physics-Inspired Latent Diffusion Model for Tropical Cyclone Forecasting

Authors: Lei Liu, Xiaoning Yu, Kang Chen, Jiahui Huang, Tengyuan Liu, Hongwei Zhao, Bin Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00521
Pdf URL: https://arxiv.org/pdf/2603.00521
Copy Paste: [[2603.00521]] Phys-Diff: A Physics-Inspired Latent Diffusion Model for Tropical Cyclone Forecasting(https://arxiv.org/abs/2603.00521)
Keywords: diffusion
Abstract: Tropical cyclone (TC) forecasting is critical for disaster warning and emergency response. Deep learning methods address computational challenges but often neglect physical relationships between TC attributes, resulting in predictions lacking physical consistency. To address this, we propose Phys-Diff, a physics-inspired latent diffusion model that disentangles latent features into task-specific components (trajectory, pressure, wind speed) and employs cross-task attention to introduce prior physics-inspired inductive biases, thereby embedding physically consistent dependencies among TC attributes. Phys-Diff integrates multimodal data including historical cyclone attributes, ERA5 reanalysis data, and FengWu forecast fields via a Transformer encoder-decoder architecture, further enhancing forecasting performance. Experiments demonstrate state-of-the-art performance on global and regional datasets.

Title: Bridge Matching Sampler: Scalable Sampling via Generalized Fixed-Point Diffusion Matching

Authors: Denis Blessing, Lorenz Richter, Julius Berner, Egor Malitskiy, Gerhard Neumann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00530
Pdf URL: https://arxiv.org/pdf/2603.00530
Copy Paste: [[2603.00530]] Bridge Matching Sampler: Scalable Sampling via Generalized Fixed-Point Diffusion Matching(https://arxiv.org/abs/2603.00530)
Keywords: diffusion
Abstract: Sampling from unnormalized densities using diffusion models has emerged as a powerful paradigm. However, while recent approaches that use least-squares `matching' objectives have improved scalability, they often necessitate significant trade-offs, such as restricting prior distributions or relying on unstable optimization schemes. By generalizing these methods as special forms of fixed-point iterations rooted in Nelson's relation, we develop a new method that addresses these limitations, called Bridge Matching Sampler (BMS). Our approach enables learning a stochastic transport map between arbitrary prior and target distributions with a single, scalable, and stable objective. Furthermore, we introduce a damped variant of this iteration that incorporates a regularization term to mitigate mode collapse and further stabilize training. Empirically, we demonstrate that our method enables sampling at unprecedented scales while preserving mode diversity, achieving state-of-the-art results on complex synthetic densities and high-dimensional molecular benchmarks.

Title: Spectral Condition for $μ$P under Width-Depth Scaling

Authors: Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.00541
Pdf URL: https://arxiv.org/pdf/2603.00541
Copy Paste: [[2603.00541]] Spectral Condition for $μ$P under Width-Depth Scaling(https://arxiv.org/abs/2603.00541)
Keywords: foundation model, generative
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $\mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $\mu$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $\mu$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $\mu$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.

Title: Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning

Authors: Yu Wang, Shengjie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00550
Pdf URL: https://arxiv.org/pdf/2603.00550
Copy Paste: [[2603.00550]] Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning(https://arxiv.org/abs/2603.00550)
Keywords: anomaly
Abstract: Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for Learning Anomaly Semantics for WS-VAD, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.

Title: AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

Authors: Cencen Liu (1), Dongyang Zhang (1 and 2), Wen Yin (1), Jielei Wang (1 and 2), Tianyu Li (1), Ji Guo (1), Wenbo Jiang (1), Guoqing Wang (1), Guoming Lu (1 and 2) ((1) University of Electronic Science and Technology of China, (2) Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00589
Pdf URL: https://arxiv.org/pdf/2603.00589
Copy Paste: [[2603.00589]] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution(https://arxiv.org/abs/2603.00589)
Keywords: diffusion, generative
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.

Title: Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection

Authors: Li Sun, Lanxu Yang, Jiayu Tian, Bowen Fang, Xiaoyan Yu, Junda Ye, Peng Tang, Hao Peng, Philip S. Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00602
Pdf URL: https://arxiv.org/pdf/2603.00602
Copy Paste: [[2603.00602]] Learning to Explore: Policy-Guided Outlier Synthesis for Graph Out-of-Distribution Detection(https://arxiv.org/abs/2603.00602)
Keywords: anomaly
Abstract: Detecting out-of-distribution (OOD) graphs is crucial for ensuring the safety and reliability of Graph Neural Networks. In unsupervised graph-level OOD detection, models are typically trained using only in-distribution (ID) data, resulting in incomplete feature space characterization and weak decision boundaries. Although synthesizing outliers offers a promising solution, existing approaches rely on fixed, non-adaptive sampling heuristics (e.g., distance- or density-based), limiting their ability to explore informative OOD regions. We propose a Policy-Guided Outlier Synthesis (PGOS) framework that replaces static heuristics with a learned exploration strategy. Specifically, PGOS trains a reinforcement learning agent to navigate low-density regions in a structured latent space and sample representations that most effectively refine the OOD decision boundary. These representations are then decoded into high-quality pseudo-OOD graphs to improve detector robustness. Extensive experiments demonstrate that PGOS achieves state-of-the-art performance on multiple graph OOD and anomaly detection benchmarks.

Title: IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

Authors: Honghao Cai, Xiangyuan Wang, Yunhao Bai, Tianze Zhou, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00607
Pdf URL: https://arxiv.org/pdf/2603.00607
Copy Paste: [[2603.00607]] IdGlow: Dynamic Identity Modulation for Multi-Subject Generation(https://arxiv.org/abs/2603.00607)
Keywords: diffusion, generative
Abstract: Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

Title: Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models

Authors: Li Sun, Zhenhao Huang, Silei Chen, Lanxu Yang, Junda Ye, Sen Su, Philip S. Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00618
Pdf URL: https://arxiv.org/pdf/2603.00618
Copy Paste: [[2603.00618]] Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models(https://arxiv.org/abs/2603.00618)
Keywords: foundation model
Abstract: Multi-domain graph pre-training integrates knowledge from diverse domains to enhance performance in the target domains, which is crucial for building graph foundation models. Despite initial success, existing solutions often fall short of answering a fundamental question: how is knowledge integrated or transferred across domains? This theoretical limitation motivates us to rethink the consistency and transferability between model pre-training and domain adaptation. In this paper, we propose a fresh Riemannian geometry perspective, whose core idea is to merge any graph dataset into a unified, smooth Riemannian manifold, enabling a systematic understanding of knowledge integration and transfer. To achieve this, our key contribution is the theoretical establishment of neural manifold gluing, which first characterizes local geometry using an adaptive orthogonal frame and then "glues" the local pieces together into a coherent whole. Building on this theory, we present the GraphGlue framework, which supports batched pre-training with EMA prototyping and provides a transferability measure based on geometric consistence. Extensive experiments demonstrate its superior performance across diverse graph domains. Moreover, we empirically validated GraphGlue's geometric scaling law, showing that larger quantities of datasets improve model transferability by producing a smoother manifold. Codes are available at this https URL.

Title: Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models

Authors: Yunzhong Qiu, Zhiyao Cen, Zhongyi Pei, Chen Wang, Jianmin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00629
Pdf URL: https://arxiv.org/pdf/2603.00629
Copy Paste: [[2603.00629]] Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models(https://arxiv.org/abs/2603.00629)
Keywords: foundation model
Abstract: Large time series models (LTMs) have emerged as powerful tools for universal forecasting, yet they often struggle with the inherent diversity and nonstationarity of real-world time series data, leading to an unsatisfactory trade-off between forecasting accuracy and generalization. Rather than continually finetuning new LTM instances for each domain, we propose a data-centric framework, time-series adaptive transformation optimization (TATO), that enables a single frozen pre-trained LTM to adapt to diverse downstream domains through an optimally configured transformation pipeline. Specifically, TATO constructs three representative types of transformations, including context slicing, scale normalization, and outlier correction, to help LTMs better align with target domain characteristics. To ensure robustness, we incorporate carefully selected time series augmentations and a two-stage ranking mechanism that filters out pipelines underperforming on specific metrics. Extensive experiments on state-of-the-art LTMs and widely used datasets demonstrate that TATO consistently and significantly improves domain-adaptive forecasting performance, achieving a maximum reduction in MSE of 65.4\% and an average reduction of 13.6\%. Moreover, TATO is highly efficient, typically completing optimization in under 2 minutes, making it practical for real-world deployment. The source code is available at this https URL.

Title: Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

Authors: Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00643
Pdf URL: https://arxiv.org/pdf/2603.00643
Copy Paste: [[2603.00643]] Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered(https://arxiv.org/abs/2603.00643)
Keywords: generative
Abstract: This position paper argues that the evaluation of modern visual processing systems should no longer be driven primarily by single-metric image quality assessment benchmarks, particularly in the era of generative and perception-oriented methods. Image restoration exemplifies this divergence: while objective IQA metrics enable reproducible, scalable evaluation, they have increasingly drifted apart from human perception and user preferences. We contend that this mismatch risks constraining innovation and misguiding research progress across visual processing tasks. Rather than rejecting metrics altogether, this paper calls for a rebalancing of evaluation paradigms, advocating a more human-centered, context-aware, and fine-grained approach to assessing the visual models' outcomes.

Title: Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis

Authors: Youngjin Yoo, Han Liu, Bogdan Georgescu, Yanbo Zhang, Sasa Grbic, Michael Baumgartner, Thomas J. Re, Jyotipriya Das, Poikavila Ullaskrishnan, Eva Eibenberger, Andrei Chekkoury, Uttam K. Bodanapally, Savvas Nicolaou, Pina C. Sanelli, Thomas J. Schroeppel, Yvonne W. Lui, Eli Gibson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00675
Pdf URL: https://arxiv.org/pdf/2603.00675
Copy Paste: [[2603.00675]] Specializing Foundation Models via Mixture of Low-Rank Experts for Comprehensive Head CT Analysis(https://arxiv.org/abs/2603.00675)
Keywords: foundation model
Abstract: Foundation models pre-trained on large-scale datasets demonstrate strong transfer learning capabilities; however, their adaptation to complex multi-label diagnostic tasks-such as comprehensive head CT finding detection-remains understudied. Standard parameter-efficient fine-tuning methods such as LoRA apply uniform adaptations across pathology types, which may limit performance for diverse medical findings. We propose a Mixture of Low-Rank Experts (MoLRE) framework that extends LoRA with multiple specialized low-rank adapters and unsupervised soft routing. This approach enables conditional feature adaptation with less than 0.5% additional parameters and without explicit pathology supervision. We present a comprehensive benchmark of MoLRE across six state-of-the-art medical imaging foundation models spanning 2D and 3D architectures, general-domain, medical-domain, and head CT-specific pretraining, and model sizes ranging from 7M to 431M parameters. Using over 70,000 non-contrast head CT scans with 75 annotated findings-including hemorrhage, infarction, trauma, mass lesions, structural abnormalities, and chronic changes-our experiments demonstrate consistent performance improvements across all models. Gains vary substantially: general-purpose and medical-domain models show the largest improvements (DINOv3-Base: +4.6%; MedGemma: +4.3%), whereas 3D CT-specialized or very large models show more modest gains (+0.2-1.3%). The combination of MoLRE and MedGemma achieves the highest average detection AUC of 0.917. These findings highlight the importance of systematic benchmarking on target clinical tasks, as pretraining domain, architecture, and model scale interact in non-obvious ways.

Title: Polynomial Mixing for Efficient Self-supervised Speech Encoders

Authors: Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00683
Pdf URL: https://arxiv.org/pdf/2603.00683
Copy Paste: [[2603.00683]] Polynomial Mixing for Efficient Self-supervised Speech Encoders(https://arxiv.org/abs/2603.00683)
Keywords: self-supervised
Abstract: State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.

Title: RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis

Authors: Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang, Lin Fan, Yilin Zhou, Zikang Wang, Wenbo Yu, Lindong Wu, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00686
Pdf URL: https://arxiv.org/pdf/2603.00686
Copy Paste: [[2603.00686]] RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis(https://arxiv.org/abs/2603.00686)
Keywords: generative
Abstract: Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: this https URL.

Title: SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion

Authors: Guoquan Wei, Liu Shi, Shaoyu Wang, Mohan Li, Cunfeng Wei, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00687
Pdf URL: https://arxiv.org/pdf/2603.00687
Copy Paste: [[2603.00687]] SCOUT: Fast Spectral CT Imaging in Ultra LOw-data Regimes via PseUdo-label GeneraTion(https://arxiv.org/abs/2603.00687)
Keywords: self-supervised
Abstract: Noise and artifacts during computed tomography (CT) scans are a fundamental challenge affecting disease diagnosis. However, current methods either involve excessively long reconstruction times or rely on data-driven models for optimization, failing to adequately consider the valuable information inherent in the data itself, especially medical 3D data. This work proposes a reconstruction method under ultra-low raw data conditions, requiring no external data and avoiding lengthy pre-training processes. By leveraging spatial nonlocal similarity and the conjugate properties of the projection domain to generate pseudo-3D data for self-supervised training, high-fidelity results can be achieved in a very short time. Extensive experiments demonstrate that this method not only mitigates detector-induced ring artifacts but also exhibits unprecedented capabilities in detail recovery. This method provides a new paradigm for research using unlabeled raw projection data. Code is available at this https URL.

Title: Diversity over Uniformity: Rethinking Representation in Generated Image Detection

Authors: Qinghui He, Haifeng Zhang, Qiao Qin, Bo Liu, Xiuli Bi, Bin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00717
Pdf URL: https://arxiv.org/pdf/2603.00717
Copy Paste: [[2603.00717]] Diversity over Uniformity: Rethinking Representation in Generated Image Detection(https://arxiv.org/abs/2603.00717)
Keywords: generative
Abstract: With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02% and exhibiting superior generalization and detection reliability. The source code is available at this https URL.

Title: General Proximal Flow Networks

Authors: Alexander Strunk, Roland Assam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00751
Pdf URL: https://arxiv.org/pdf/2603.00751
Copy Paste: [[2603.00751]] General Proximal Flow Networks(https://arxiv.org/abs/2603.00751)
Keywords: generative
Abstract: This paper introduces General Proximal Flow Networks (GPFNs), a generalization of Bayesian Flow Networks that broadens the class of admissible belief-update operators. In Bayesian Flow Networks, each update step is a Bayesian posterior update, which is equivalent to a proximal step with respect to the Kullback-Leibler divergence. GPFNs replace this fixed choice with an arbitrary divergence or distance function, such as the Wasserstein distance, yielding a unified proximal-operator framework for iterative generative modeling. The corresponding training and sampling procedures are derived, establishing a formal link to proximal optimization and recovering the standard BFN update as a special case. Empirical evaluations confirm that adapting the divergence to the underlying data geometry yields measurable improvements in generation quality, highlighting the practical benefits of this broader framework.

Title: Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder

Authors: Adam Marcus, Paul Bentley, Daniel Rueckert
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00756
Pdf URL: https://arxiv.org/pdf/2603.00756
Copy Paste: [[2603.00756]] Stroke outcome and evolution prediction from CT brain using a spatiotemporal diffusion autoencoder(https://arxiv.org/abs/2603.00756)
Keywords: diffusion, self-supervised
Abstract: Stroke is a major cause of death and disability worldwide. Accurate outcome and evolution prediction has the potential to revolutionize stroke care by individualizing clinical decision-making leading to better outcomes. However, despite a plethora of attempts and the rich data provided by neuroimaging, modelling the ultimate fate of brain tissue remains a challenging task. In this work, we apply recent ideas in the field of diffusion probabilistic models to generate a self-supervised semantically meaningful stroke representation from Computed Tomography (CT) images. We then improve this representation by extending the method to accommodate longitudinal images and the time from stroke onset. The effectiveness of our approach is evaluated on a dataset consisting of 5,824 CT images from 3,573 patients across two medical centers with minimal labels. Comparative experiments show that our method achieves the best performance for predicting next-day severity and functional outcome at discharge.

Title: Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models

Authors: Zhenyu Zhou, Defang Chen, Siwei Lyu, Chun Chen, Can Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00763
Pdf URL: https://arxiv.org/pdf/2603.00763
Copy Paste: [[2603.00763]] Analyzing and Improving Fast Sampling of Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.00763)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved unprecedented success but still struggle to produce high-quality results under limited sampling budgets. Existing training-free sampling acceleration methods are typically developed independently, leaving the overall performance and compatibility among these methods unexplored. In this paper, we bridge this gap by systematically elucidating the design space, and our comprehensive experiments identify the sampling time schedule as the most pivotal factor. Inspired by the geometric properties of diffusion models revealed through the Frenet-Serret formulas, we propose constant total rotation schedule (TORS), a scheduling strategy that ensures uniform geometric variation along the sampling trajectory. TORS outperforms previous training-free acceleration methods and produces high-quality images with 10 sampling steps on Flux.1-Dev and Stable Diffusion 3.5. Extensive experiments underscore the adaptability of our method to unseen models, hyperparameters, and downstream applications.

Title: Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning

Authors: Karanpartap Singh, Adam Turnbull, Mohammad Abbasi, Kilian Pohl, Feng Vankee Lin, Ehsan Adeli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00786
Pdf URL: https://arxiv.org/pdf/2603.00786
Copy Paste: [[2603.00786]] Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning(https://arxiv.org/abs/2603.00786)
Keywords: self-supervised
Abstract: Understanding how large-scale functional brain networks reorganize during cognitive decline remains a central challenge in neuroimaging. While recent self-supervised models have shown promise for learning representations from resting-state fMRI, their internal mechanisms are difficult to interpret, limiting mechanistic insight. We propose BrainInterNet, a network-aware self-supervised framework based on masked reconstruction with cross-attention that explicitly models inter-network dependencies in rs-fMRI. By selectively masking predefined functional networks and reconstructing them from remaining context, our approach enables direct quantification of network predictability and interpretable analysis of cross-network interactions. We train BrainInterNet on multi-cohort fMRI data (from the ABCD, HCP Development, HCP Young Adults, and HCP Aging datasets) and evaluate on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, in total comprising 5,582 recordings. Our method reveals systematic alterations in the brain's network interactions under AD, including in the default mode, limbic, and attention networks. In parallel, the learned representations support accurate Alzheimer's-spectrum classification and yield a compact summary marker that tracks disease severity longitudinally. Together, these results demonstrate that network-guided masked modeling with cross-attention provides an interpretable and effective framework for characterizing functional reorganization in neurodegeneration.

Title: COMBAT: Conditional World Models for Behavioral Agent Training

Authors: Anmol Agarwal, Pranay Meshram, Sumer Singh, Saurav Suman, Andrew Lapp, Shahbuland Matiana, Louis Castricato, Spencer Frazier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00825
Pdf URL: https://arxiv.org/pdf/2603.00825
Copy Paste: [[2603.00825]] COMBAT: Conditional World Models for Behavioral Agent Training(https://arxiv.org/abs/2603.00825)
Keywords: diffusion
Abstract: Recent advances in video generation have spurred the development of world models capable of simulating 3D-consistent environments and interactions with static objects. However, a significant limitation remains in their ability to model dynamic, reactive agents that can intelligently influence and interact with the world. To address this gap, we introduce COMBAT, a real-time, action-controlled world model trained on the complex 1v1 fighting game Tekken 3. Our work demonstrates that diffusion models can successfully simulate a dynamic opponent that reacts to player actions, learning its behavior implicitly. Our approach utilizes a 1.2 billion parameter Diffusion Transformer, conditioned on latent representations from a deep compression autoencoder. We employ state-of-the-art techniques, including causal distillation and diffusion forcing, to achieve real-time inference. Crucially, we observe the emergence of sophisticated agent behavior by training the model solely on single-player inputs, without any explicit supervision for the opponent's policy. Unlike traditional imitation learning methods, which require complete action labels, COMBAT learns effectively from partially observed data to generate responsive behaviors for a controllable Player 1. We present an extensive study and introduce novel evaluation methods to benchmark this emergent agent behavior, establishing a strong foundation for training interactive agents within diffusion-based world models.

Title: MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

Authors: Idelfonso B. R. Nogueira, Carine M. Rebelloa, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00857
Pdf URL: https://arxiv.org/pdf/2603.00857
Copy Paste: [[2603.00857]] MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules(https://arxiv.org/abs/2603.00857)
Keywords: foundation model
Abstract: Predicting physicochemical properties across chemical space is vital for chemical engineering, drug discovery, and materials science. Current molecular foundation models lack thermodynamic consistency, while domain-informed approaches are limited to single properties and small datasets. We introduce MultiPUFFIN, a domain-constrained multimodal foundation model addressing both limitations simultaneously. MultiPUFFIN features: (i) an encoder fusing SMILES, graphs, and 3D geometries via gated cross-modal attention, alongside experimental condition and descriptor encoders; (ii) prediction heads embedding established correlations (e.g., Wagner, Andrade, van't Hoff, and Shomate equations) as inductive biases to ensure thermodynamic consistency; and (iii) a two-stage multi-task training this http URL prior frameworks, MultiPUFFIN predicts nine thermophysical properties simultaneously. It is trained on a multi-source dataset of 37,968 unique molecules (40,904 rows). With roughly 35 million parameters, MultiPUFFIN achieves a mean $R^2 = 0.716$ on a challenging scaffold-split test set of 8,877 molecules. Compared to ChemBERTa-2 (pre-trained on 77 million molecules), MultiPUFFIN outperforms the fine-tuned baseline across all nine properties despite using 2000x fewer training molecules. Advantages are strikingly apparent for temperature-dependent properties, where ChemBERTa-2 lacks the architectural capacity to incorporate thermodynamic this http URL results demonstrate that multimodal encoding and domain-informed biases substantially reduce data and compute requirements compared to brute-force pre-training. Furthermore, MultiPUFFIN handles missing modalities and recovers meaningful thermodynamic parameters without explicit supervision. Systematic ablation studies confirm the property-specific benefits of these domain-informed prediction heads.

Title: AMDS: Attack-Aware Multi-Stage Defense System for Network Intrusion Detection with Two-Stage Adaptive Weight Learning

Authors: Oluseyi Olukola, Nick Rahimi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00859
Pdf URL: https://arxiv.org/pdf/2603.00859
Copy Paste: [[2603.00859]] AMDS: Attack-Aware Multi-Stage Defense System for Network Intrusion Detection with Two-Stage Adaptive Weight Learning(https://arxiv.org/abs/2603.00859)
Keywords: anomaly
Abstract: Machine learning based network intrusion detection systems are vulnerable to adversarial attacks that degrade classification performance under both gradient-based and distribution shift threat models. Existing defenses typically apply uniform detection strategies, which may not account for heterogeneous attack characteristics. This paper proposes an attack-aware multi-stage defense framework that learns attack-specific detection strategies through a weighted combination of ensemble disagreement, predictive uncertainty, and distributional anomaly signals. Empirical analysis across seven adversarial attack types reveals distinct detection signatures, enabling a two-stage adaptive detection mechanism. Experimental evaluation on a benchmark intrusion detection dataset indicates that the proposed system attains 94.2% area under the receiver operating characteristic curve and improves classification accuracy by 4.5 percentage points and F1-score by 9.0 points over adversarially trained ensembles. Under adaptive white-box attacks with full architectural knowledge, the system appears to maintain 94.4% accuracy with a 4.2% attack success rate, though this evaluation is limited to two adaptive variants and does not constitute a formal robustness guarantee. Cross-dataset validation further suggests that defense effectiveness depends on baseline classifier competence and may vary with feature dimensionality. These results suggest that attack-specific optimization combined with multi-signal integration can provide a practical approach to improving adversarial robustness in machine learning-based intrusion detection systems.

Title: Active Flow Matching

Authors: Yashvir S. Grewal, Daniel M. Steinberg, Thang D. Bui, Cheng Soon Ong, Edwin V. Bonilla
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00877
Pdf URL: https://arxiv.org/pdf/2603.00877
Copy Paste: [[2603.00877]] Active Flow Matching(https://arxiv.org/abs/2603.00877)
Keywords: diffusion, generative
Abstract: Discrete diffusion and flow matching models capture complex, non-additive and non-autoregressive structure in high-dimensional objective landscapes through parallel, iterative refinement. However, their implicit generative nature precludes direct integration with principled variational frameworks for online black-box optimisation, such as variational search distributions (VSD) and conditioning by adaptive sampling (CbAS). We introduce Active Flow Matching (AFM), which reformulates variational objectives to operate on conditional endpoint distributions along the flow, enabling gradient-based steering of flow models toward high-fitness regions while preserving the rigour of VSD and CbAS. We derive forward and reverse Kullback-Leibler (KL) variants using self-normalised importance sampling. Across a suite of online protein and small molecule design tasks, forward-KL AFM consistently performs competitively compared to state-of-the-art baselines, demonstrating effective exploration-exploitation under tight experimental budgets.

Title: Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

Authors: Michael Hardy, Yunsung Kim
Subjects: cs.LG, cs.AI, cs.CY, stat.AP
Abstract URL: https://arxiv.org/abs/2603.00883
Pdf URL: https://arxiv.org/pdf/2603.00883
Copy Paste: [[2603.00883]] Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact(https://arxiv.org/abs/2603.00883)
Keywords: foundation model, generative
Abstract: LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.

Title: Probabilistic Learning and Generation in Deep Sequence Models

Authors: Wenlong Chen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.00888
Pdf URL: https://arxiv.org/pdf/2603.00888
Copy Paste: [[2603.00888]] Probabilistic Learning and Generation in Deep Sequence Models(https://arxiv.org/abs/2603.00888)
Keywords: diffusion, self-supervised, generative
Abstract: Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes' rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 & 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.

Title: Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents

Authors: Ben Dong, Hui Feng, Qian Wang
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2603.00902
Pdf URL: https://arxiv.org/pdf/2603.00902
Copy Paste: [[2603.00902]] Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents(https://arxiv.org/abs/2603.00902)
Keywords: generative
Abstract: Modern generative agents such as OpenClaw - an open-source, self-hosted personal assistant with a community skill ecosystem, are gaining attention and are used pervasively. However, the openness and rapid growth of these ecosystems often outpace systematic security evaluation. In this paper, we design, implement, and evaluate Clawdrain, a Trojanized skill that induces a multi-turn "Segmented Verification Protocol" via injected this http URL instructions and a companion script that returns PROGRESS/REPAIR/TERMINAL signals. We deploy Clawdrain in a production-like OpenClaw instance with real API billing and a production model (Gemini 2.5 Pro), and we measure 6-7x token amplification over a benign baseline, with a costly, failure configuration reaching approximately 9x. We observe a deployment-only phenomenon: the agent autonomously composes general-purpose tools (e.g., shell/Python) to route around brittle protocol steps, reducing amplification and altering attack dynamics. Finally, we identify production vectors enabled by OpenClaw's architecture, including this http URL prompt bloat, persistent tool-output pollution, cron/heartbeat frequency amplification, and behavioral instruction injection. Overall, we demonstrate that token-drain attacks remain feasible in real deployments, but their magnitude and observability are shaped by tool composition, recovery behavior, and interface design.

Title: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Authors: Seungwook Kim, Minsu Cho
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00918
Pdf URL: https://arxiv.org/pdf/2603.00918
Copy Paste: [[2603.00918]] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards(https://arxiv.org/abs/2603.00918)
Keywords: generative
Abstract: Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce ARC (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.

Title: \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On

Authors: Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00947
Pdf URL: https://arxiv.org/pdf/2603.00947
Copy Paste: [[2603.00947]] \textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On(https://arxiv.org/abs/2603.00947)
Keywords: diffusion
Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.

Title: Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

Authors: Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00975
Pdf URL: https://arxiv.org/pdf/2603.00975
Copy Paste: [[2603.00975]] Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models(https://arxiv.org/abs/2603.00975)
Keywords: diffusion, generative
Abstract: Unlearning in text-to-image diffusion models often leads to uneven concept removal and unintended forgetting of unrelated capabilities. This complicates tasks such as copyright compliance, protected data mitigation, artist opt-outs, and policy-driven content updates. As models grow larger and adopt more diverse architectures, achieving precise and selective unlearning while preserving generative quality becomes increasingly challenging. We introduce SurgUn (pronounced as Surgeon), a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models. Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones by competing for shared representational pathways. We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept while preserving unrelated capabilities through a novel training paradigm. SurgUn achieves high-precision unlearning across diverse settings. It performs strongly on compact U-Net based models such as Stable Diffusion v1.5, scales effectively to the larger U-Net architecture SDXL, and extends to SANA, representing an underexplored Diffusion Transformer based architecture for unlearning.

Title: EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization

Authors: Zhaoxin Fan, Nanxiang Jiang, Daiheng Gao, Shiji Zhou, Wenjun Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00978
Pdf URL: https://arxiv.org/pdf/2603.00978
Copy Paste: [[2603.00978]] EraseAnything++: Enabling Concept Erasure in Rectified Flow Transformers Leveraging Multi-Object Optimization(https://arxiv.org/abs/2603.00978)
Keywords: diffusion, generative
Abstract: Removing undesired concepts from large-scale text-to-image (T2I) and text-to-video (T2V) diffusion models while preserving overall generative quality remains a major challenge, particularly as modern models such as Stable Diffusion v3, Flux, and OpenSora employ flow-matching and transformer-based architectures and extend to long-horizon video generation. Existing concept erasure methods, designed for earlier T2I/T2V models, often fail to generalize to these paradigms. To address this issue, we propose EraseAnything++, a unified framework for concept erasure in both image and video diffusion models with flow-matching objectives. Central to our approach is formulating concept erasure as a constrained multi-objective optimization problem that explicitly balances concept removal with preservation of generative utility. To solve the resulting conflicting objectives, we introduce an efficient utility-preserving unlearning strategy based on implicit gradient surgery. Furthermore, by integrating LoRA-based parameter tuning with attention-level regularization, our method anchors erasure on key visual representations and propagates it consistently across spatial and temporal dimensions. In the video setting, we further enhance consistency through an anchor-and-propagate mechanism that initializes erasure on reference frames and enforces it throughout subsequent transformer layers, thereby mitigating temporal drift. Extensive experiments on both image and video benchmarks demonstrate that EraseAnything++ substantially outperforms prior methods in erasure effectiveness, generative fidelity, and temporal consistency, establishing a new state of the art for concept erasure in next-generation diffusion models.

Title: Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation

Authors: Jiaqi Tang, Mengyan Zheng, Shu Zhang, Fandong Zhang, Qingchao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00979
Pdf URL: https://arxiv.org/pdf/2603.00979
Copy Paste: [[2603.00979]] Fake It Right: Injecting Anatomical Logic into Synthetic Supervised Pre-training for Medical Segmentation(https://arxiv.org/abs/2603.00979)
Keywords: self-supervised
Abstract: Vision Transformers (ViTs) excel in 3D medical segmentation but require massive annotated datasets. While Self-Supervised Learning (SSL) mitigates this using unlabeled data, it still faces strict privacy and logistical barriers. Formula-Driven Supervised Learning (FDSL) offers a privacy-preserving alternative by pre-training on synthetic mathematical primitives. However, a critical semantic gap limits its efficacy: generic shapes lack the morphological fidelity, fixed spatial layouts, and inter-organ relationships of real anatomy, preventing models from learning essential global structural priors. To bridge this gap, we propose an Anatomy-Informed Synthetic Supervised Pre-training framework unifying FDSL's infinite scalability with anatomical realism. We replace basic primitives with a lightweight shape bank with de-identified, label-only segmentation masks from 5 subjects. Furthermore, we introduce a structure-aware sequential placement strategy to govern the patch synthesis process. Instead of random placement, we enforce physiological plausibility using spatial anchors for correct localization and a topological graph to manage inter-organ interactions (e.g., preventing impossible overlaps). Extensive experiments on BTCV and MSD datasets demonstrate that our method significantly outperforms state-of-the-art FDSL baselines and SSL methods by 1.74\% and up to 1.66\%, while exhibiting a robust scaling effect where performance improves with increased synthetic data volume. This provides a data-efficient, privacy-compliant solution for medical segmentation. The code will be made publicly available upon acceptance.

Title: Event-Anchored Frame Selection for Effective Long-Video Understanding

Authors: Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Fei Chao, Rongrong Ji, Xiawu Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00983
Pdf URL: https://arxiv.org/pdf/2603.00983
Copy Paste: [[2603.00983]] Event-Anchored Frame Selection for Effective Long-Video Understanding(https://arxiv.org/abs/2603.00983)
Keywords: self-supervised
Abstract: Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Title: Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality

Authors: Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, Jocelyn Chanussot
Subjects: cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2603.00988
Pdf URL: https://arxiv.org/pdf/2603.00988
Copy Paste: [[2603.00988]] Foundation Models in Remote Sensing: Evolving from Unimodality to Multimodality(https://arxiv.org/abs/2603.00988)
Keywords: foundation model
Abstract: Remote sensing (RS) techniques are increasingly crucial for deepening our understanding of the planet. As the volume and diversity of RS data continue to grow exponentially, there is an urgent need for advanced data modeling and understanding capabilities to manage and interpret these vast datasets effectively. Foundation models present significant new growth opportunities and immense potential to revolutionize the RS field. In this paper, we conduct a comprehensive technical survey on foundation models in RS, offering a brand-new perspective by exploring their evolution from unimodality to multimodality. We hope this work serves as a valuable entry point for researchers interested in both foundation models and RS and helps them launch new projects or explore new research topics in this rapidly evolving area. This survey addresses the following three key questions: What are foundation models in RS? Why are foundation models needed in RS? How can we effectively guide junior researchers in gaining a comprehensive and practical understanding of foundation models in RS applications? More specifically, we begin by outlining the background and motivation, emphasizing the importance of foundation models in RS. We then review existing foundation models in RS, systematically categorizing them into unimodal and multimodal approaches. Additionally, we provide a tutorial-like section to guide researchers, especially beginners, on how to train foundation models in RS and apply them to real-world tasks. The survey aims to equip researchers in RS with a deeper and more efficient understanding of foundation models, enabling them to get started easily and effectively apply these models across various RS applications.

Title: MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation

Authors: Yi Zhang, Puxun Tu, Kun Wang, Yulin Yan, Tao Ying, Xiaojun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.00990
Pdf URL: https://arxiv.org/pdf/2603.00990
Copy Paste: [[2603.00990]] MLRecon: Robust Markerless Freehand 3D Ultrasound Reconstruction via Coarse-to-Fine Pose Estimation(https://arxiv.org/abs/2603.00990)
Keywords: foundation model
Abstract: Freehand 3D ultrasound (US) reconstruction promises volumetric imaging with the flexibility of standard 2D probes, yet existing tracking paradigms face a restrictive trilemma: marker-based systems demand prohibitive costs, inside-out methods require intrusive sensor attachment, and sensorless approaches suffer from severe cumulative drift. To overcome these limitations, we present MLRecon, a robust markerless 3D US reconstruction framework delivering drift-resilient 6D probe pose tracking using a single commodity RGB-D camera. Leveraging the generalization power of vision foundation models, our pipeline enables continuous markerless tracking of the probe, augmented by a vision-guided divergence detector that autonomously monitors tracking integrity and triggers failure recovery to ensure uninterrupted scanning. Crucially, we further propose a dual-stage pose refinement network that explicitly disentangles high-frequency jitter from low-frequency bias, effectively denoising the trajectory while maintaining the kinematic fidelity of operator maneuvers. Experiments demonstrate that MLRecon significantly outperforms competing sensorless and sensor-aided methods, achieving average position errors as low as 0.88 mm on complex trajectories and yielding high-quality 3D reconstructions with sub-millimeter mean surface accuracy. This establishes a new benchmark for low-cost, accessible volumetric US imaging in resource-limited clinical settings.

Title: Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information

Authors: Xinwen Cheng, Jingyuan Zhang, Zhehao Huang, Yingwen Wu, Xiaolin Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.00992
Pdf URL: https://arxiv.org/pdf/2603.00992
Copy Paste: [[2603.00992]] Compensation-free Machine Unlearning in Text-to-Image Diffusion Models by Eliminating the Mutual Information(https://arxiv.org/abs/2603.00992)
Keywords: diffusion, generative
Abstract: The powerful generative capabilities of diffusion models have raised growing privacy and safety concerns regarding generating sensitive or undesired content. In response, machine unlearning (MU) -- commonly referred to as concept erasure (CE) in diffusion models -- has been introduced to remove specific knowledge from model parameters meanwhile preserving innocent knowledge. Despite recent advancements, existing unlearning methods often suffer from excessive and indiscriminate removal, which leads to substantial degradation in the quality of innocent generations. To preserve model utility, prior works rely on compensation, i.e., re-assimilating a subset of the remaining data or explicitly constraining the divergence from the pre-trained model on remaining concepts. However, we reveal that generations beyond the compensation scope still suffer, suggesting such post-remedial compensations are inherently insufficient for preserving the general utility of large-scale generative models. Therefore, in this paper, we advocate for developing compensation-free concept erasure operations, which precisely identify and eliminate the undesired knowledge such that the impact on other generations is minimal. In technique, we propose to MiM-MU, which is to unlearn a concept by minimizing the mutual information with a delicate design for computational effectiveness and for maintaining sampling distribution for other concepts. Extensive evaluations demonstrate that our proposed method achieves effective concept removal meanwhile maintaining high-quality generations for other concepts, and remarkably, without relying on any post-remedial compensation for the first time.

Title: Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer

Authors: Yuze Li, Dong Gong, Xiao Cao, Junchao Yuan, Dongsheng Li, Lei Zhou, Yun Sing Koh, Cheng Yan, Xinyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01000
Pdf URL: https://arxiv.org/pdf/2603.01000
Copy Paste: [[2603.01000]] Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer(https://arxiv.org/abs/2603.01000)
Keywords: diffusion
Abstract: Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.

Title: GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

Authors: Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01010
Pdf URL: https://arxiv.org/pdf/2603.01010
Copy Paste: [[2603.01010]] GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis(https://arxiv.org/abs/2603.01010)
Keywords: diffusion, generative
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions. We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling. To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples. Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.

Title: BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models

Authors: Jiayao Wang, Yiping Zhang, Mohammad Maruf Hasan, Xiaoying Lei, Jiale Zhang, Junwu Zhu, Qilin Wu, Dongfang Zhao
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01019
Pdf URL: https://arxiv.org/pdf/2603.01019
Copy Paste: [[2603.01019]] BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models(https://arxiv.org/abs/2603.01019)
Keywords: diffusion, self-supervised, generative
Abstract: Self-supervised diffusion models learn high-quality visual representations via latent space denoising. However, their representation layer poses a distinct threat: unlike traditional attacks targeting generative outputs, its unconstrained latent semantic space allows for stealthy backdoors, permitting malicious control upon triggering. In this paper, we propose BadRSSD, the first backdoor attack targeting the representation layer of self-supervised diffusion models. Specifically, it hijacks the semantic representations of poisoned samples with triggers in Principal Component Analysis (PCA) space toward those of a target image, then controls the denoising trajectory during diffusion by applying coordinated constraints across latent, pixel, and feature distribution spaces to steer the model toward generating the specified target. Additionally, we integrate representation dispersion regularization into the constraint framework to maintain feature space uniformity, significantly enhancing attack stealth. This approach preserves normal model functionality (high utility) while achieving precise target generation upon trigger activation (high specificity). Experiments on multiple benchmark datasets demonstrate that BadRSSD substantially outperforms existing attacks in both FID and MSE metrics, reliably establishing backdoors across different architectures and configurations, and effectively resisting state-of-the-art backdoor defenses.

Title: Vision-Language Feature Alignment for Road Anomaly Segmentation

Authors: Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01029
Pdf URL: https://arxiv.org/pdf/2603.01029
Copy Paste: [[2603.01029]] Vision-Language Feature Alignment for Road Anomaly Segmentation(https://arxiv.org/abs/2603.01029)
Keywords: anomaly
Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and this http URL is released on this https URL.

Title: Evaluating GFlowNet from partial episodes for stable and flexible policy-based training

Authors: Puhua Niu, Shili Wu, Xiaoning Qian
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01047
Pdf URL: https://arxiv.org/pdf/2603.01047
Copy Paste: [[2603.01047]] Evaluating GFlowNet from partial episodes for stable and flexible policy-based training(https://arxiv.org/abs/2603.01047)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.

Title: LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

Authors: Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01068
Pdf URL: https://arxiv.org/pdf/2603.01068
Copy Paste: [[2603.01068]] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model(https://arxiv.org/abs/2603.01068)
Keywords: diffusion
Abstract: We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at this https URL.

Title: Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration

Authors: Yunguan Fu, Wenjia Bai, Wen Yan, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01073
Pdf URL: https://arxiv.org/pdf/2603.01073
Copy Paste: [[2603.01073]] Flow Matching-enabled Test-Time Refinement for Unsupervised Cardiac MR Registration(https://arxiv.org/abs/2603.01073)
Keywords: diffusion
Abstract: Diffusion-based unsupervised image registration has been explored for cardiac cine MR, but expensive multi-step inference limits practical use. We propose FlowReg, a flow-matching framework in displacement field space that achieves strong registration in as few as two steps and supports further refinement with more steps. FlowReg uses warmup-reflow training: a single-step network first acts as a teacher, then a student learns to refine from arbitrary intermediate states, removing the need for a pre-trained model as in existing methods. An Initial Guess strategy feeds back the model prediction as the next starting point, improving refinement from step two onward. On ACDC and MM2 across six tasks (including cross-dataset generalization), FlowReg outperforms the state of the art on five tasks (+0.6% mean Dice score on average), with the largest gain in the left ventricle (+1.09%), and reduces LVEF estimation error on all six tasks (-2.58 percentage points), using only 0.7% extra parameters and no segmentation labels. Anonymized code is available at this https URL.

Title: Unified Vision-Language Modeling via Concept Space Alignment

Authors: Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01096
Pdf URL: https://arxiv.org/pdf/2603.01096
Copy Paste: [[2603.01096]] Unified Vision-Language Modeling via Concept Space Alignment(https://arxiv.org/abs/2603.01096)
Keywords: diffusion
Abstract: We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

Title: Understanding LoRA as Knowledge Memory: An Empirical Analysis

Authors: Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, S. K. Hong, Youngjune Gwon, Sungjin Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01097
Pdf URL: https://arxiv.org/pdf/2603.01097
Copy Paste: [[2603.01097]] Understanding LoRA as Knowledge Memory: An Empirical Analysis(https://arxiv.org/abs/2603.01097)
Keywords: in-context
Abstract: Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.

Title: Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting

Authors: Dantong Qin, Alessandro Bozzon, Xian Yang, Xun Zhang, Yike Guo, Pan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01103
Pdf URL: https://arxiv.org/pdf/2603.01103
Copy Paste: [[2603.01103]] Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting(https://arxiv.org/abs/2603.01103)
Keywords: diffusion, generative
Abstract: Many creative multimedia systems are built upon visual primitives such as strokes or textures, which are difficult to collect at scale and fundamentally different from natural image data. This data scarcity makes it challenging for modern generative models to learn expressive and controllable primitives, limiting their use in process-aware content creation. In this work, we study the problem of learning human-like brushstroke generation from a small set of hand-drawn samples (n=470) and propose StrokeDiff, a diffusion-based framework with Smooth Regularization (SmR). SmR injects stochastic visual priors during training, providing a simple mechanism to stabilize diffusion models under sparse supervision without altering the inference process. We further show how the learned primitives can be made controllable through a Bézier-based conditioning module and integrated into a complete stroke-based painting pipeline, including prediction, generation, ordering, and compositing. This demonstrates how data-efficient primitive modeling can support expressive and structured multimedia content creation. Experiments indicate that the proposed approach produces diverse and structurally coherent brushstrokes and enables paintings with richer texture and layering, validated by both automatic metrics and human evaluation.

Title: GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation

Authors: Zhuonan Liang, Wei Guo, Jie Gan, Yaxuan Song, Runnan Chen, Hang Chang, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01115
Pdf URL: https://arxiv.org/pdf/2603.01115
Copy Paste: [[2603.01115]] GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation(https://arxiv.org/abs/2603.01115)
Keywords: foundation model
Abstract: Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at this https URL

Title: Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations

Authors: Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Heng Yu, Xudong Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01125
Pdf URL: https://arxiv.org/pdf/2603.01125
Copy Paste: [[2603.01125]] Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations(https://arxiv.org/abs/2603.01125)
Keywords: anomaly
Abstract: While visual reasoning for simple analogies has received significant attention, compositional visual relations (CVR) remain relatively unexplored due to their greater complexity. To solve CVR tasks, we propose Predictive Reasoning with Augmented Anomaly Contrastive Learning (PR-A$^2$CL), \ie, to identify an outlier image given three other images that follow the same compositional rules. To address the challenge of modelling abundant compositional rules, an Augmented Anomaly Contrastive Learning is designed to distil discriminative and generalizable features by maximizing similarity among normal instances while minimizing similarity between normal and anomalous outliers. More importantly, a predict-and-verify paradigm is introduced for rule-based reasoning, in which a series of Predictive Anomaly Reasoning Blocks (PARBs) iteratively leverage features from three out of the four images to predict those of the remaining one. Throughout the subsequent verification stage, the PARBs progressively pinpoint the specific discrepancies attributable to the underlying rules. Experimental results on SVRT, CVR and MC$^2$R datasets show that PR-A$^2$CL significantly outperforms state-of-the-art reasoning models.

Title: A Deep Learning Framework for Heat Demand Forecasting using Time-Frequency Representations of Decomposed Features

Authors: Adithya Ramachandran, Satyaki Chatterjee, Thorkil Flensmark B. Neergaard, Maximilian Oberndoerfer, Andreas Maier, Siming Bayer
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01137
Pdf URL: https://arxiv.org/pdf/2603.01137
Copy Paste: [[2603.01137]] A Deep Learning Framework for Heat Demand Forecasting using Time-Frequency Representations of Decomposed Features(https://arxiv.org/abs/2603.01137)
Keywords: foundation model
Abstract: District Heating Systems are essential infrastructure for delivering heat to consumers across a geographic region sustainably, yet efficient management relies on optimizing diverse energy sources, such as wood, gas, electricity, and solar, in response to fluctuating demand. Aligning supply with demand is critical not only for ensuring reliable heat distribution but also for minimizing carbon emissions and extending infrastructure lifespan through lower operating temperatures. However, accurate multi-step forecasting to support these goals remains challenging due to complex, non-linear usage patterns and external dependencies. In this work, we propose a novel deep learning framework for day-ahead heat demand prediction that leverages time-frequency representations of historical data. By applying Continuous Wavelet Transform to decomposed demand and external meteorological factors, our approach enables Convolutional Neural Networks to learn hierarchical temporal features that are often inaccessible to standard time domain models. We systematically evaluate this method against statistical baselines, state-of-the-art Transformers, and emerging foundation models using multi-year data from three distinct Danish districts, a Danish city, and a German city. The results show a significant advancement, reducing the Mean Absolute Error by 36% to 43% compared to the strongest baselines, achieving forecasting accuracy of up to 95% across annual test datasets. Qualitative and statistical analyses further confirm the accuracy and robustness by reliably tracking volatile demand peaks where others fail. This work contributes both a high-performance forecasting architecture and critical insights into optimal feature composition, offering a validated solution for modern energy applications.

Title: Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

Authors: Kuai Jiang, Zhaoyan Ding, Guijuan Zhang, Dianjie Lu, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01140
Pdf URL: https://arxiv.org/pdf/2603.01140
Copy Paste: [[2603.01140]] Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers(https://arxiv.org/abs/2603.01140)
Keywords: generative
Abstract: Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.

Title: ArtLLM: Generating Articulated Assets via 3D LLM

Authors: Penghao Wang, Siyuan Xie, Hongyu Yan, Xianghui Yang, Jingwei Huang, Chunchao Guo, Jiayuan Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01142
Pdf URL: https://arxiv.org/pdf/2603.01142
Copy Paste: [[2603.01142]] ArtLLM: Generating Articulated Assets via 3D LLM(https://arxiv.org/abs/2603.01142)
Keywords: generative
Abstract: Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object's point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.

Title: Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification

Authors: Jacob Devasier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01190
Pdf URL: https://arxiv.org/pdf/2603.01190
Copy Paste: [[2603.01190]] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification(https://arxiv.org/abs/2603.01190)
Keywords: diffusion
Abstract: Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.

Title: Generative AI & Fictionality: How Novels Power Large Language Models

Authors: Edwin Roland, Richard Jean So
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01220
Pdf URL: https://arxiv.org/pdf/2603.01220
Copy Paste: [[2603.01220]] Generative AI & Fictionality: How Novels Power Large Language Models(https://arxiv.org/abs/2603.01220)
Keywords: generative
Abstract: Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels' effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today's various modes of cultural production must account for a relatively novel dimension: computational training data.

Title: Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation

Authors: Liwen Sun, Xiang Yu, Ming Tan, Zhuohao Chen, Anqi Cheng, Ashutosh Joshi, Chenyan Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01252
Pdf URL: https://arxiv.org/pdf/2603.01252
Copy Paste: [[2603.01252]] Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation(https://arxiv.org/abs/2603.01252)
Keywords: in-context
Abstract: Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.

Title: Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography

Authors: Timofey Efimov, Singanallur Venkatakrishnan, Maliha Hossain, Haley Duba-Sullivan, Amirkoushyar Ziabari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01253
Pdf URL: https://arxiv.org/pdf/2603.01253
Copy Paste: [[2603.01253]] Cross-Modal Guidance for Fast Diffusion-Based Computed Tomography(https://arxiv.org/abs/2603.01253)
Keywords: diffusion
Abstract: Diffusion models have emerged as powerful priors for solving inverse problems in computed tomography (CT). In certain applications, such as neutron CT, it can be expensive to collect large amounts of measurements even for a single scan, leading to sparse data sets from which it is challenging to obtain high quality reconstructions even with diffusion models. One strategy to mitigate this challenge is to leverage a complementary, easily available imaging modality; however, such approaches typically require retraining the diffusion model with large datasets. In this work, we propose incorporating an additional modality without retraining the diffusion prior, enabling accelerated imaging of costly modalities. We further examine the impact of imperfect side modalities on cross-modal guidance. Our method is evaluated on sparse-view neutron computed tomography, where reconstruction quality is substantially improved by incorporating X-ray computed tomography of the same samples.

Title: Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

Authors: Adel Javanmard, Baharan Mirzasoleiman, Vahab Mirrokni
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01293
Pdf URL: https://arxiv.org/pdf/2603.01293
Copy Paste: [[2603.01293]] Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models(https://arxiv.org/abs/2603.01293)
Keywords: in-context
Abstract: Large Language Models (LLMs) are pretrained on massive datasets and later instruction-tuned via supervised fine-tuning (SFT) or reinforcement learning (RL). Best practices emphasize large, diverse pretraining data, whereas post-training operates differently: SFT relies on smaller, high-quality datasets, while RL benefits more from scale, with larger amounts of feedback often outweighing label quality. Yet it remains unclear why pretraining and RL require large datasets, why SFT excels on smaller ones, and what defines high-quality SFT data. In this work, we theoretically analyze transformers trained on an in-context weight prediction task for linear regression. Our analysis reveals several key findings: $(i)$ balanced pretraining data can induce latent capabilities later activated during post-training, and $(ii)$ SFT learns best from a small set of examples challenging for the pretrained model, while excessively large SFT datasets may dilute informative pretraining signals. In contrast, RL is most effective on large-scale data that is not overly difficult for the pretrained model. We validate these theoretical insights with experiments on large nonlinear transformer architectures.

Title: AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

Authors: Zhen Qu, Xian Tao, Xiaoyi Bao, Dingrong Wang, ShiChen Qu, Zhengtao Zhang, Xingang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01305
Pdf URL: https://arxiv.org/pdf/2603.01305
Copy Paste: [[2603.01305]] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models(https://arxiv.org/abs/2603.01305)
Keywords: anomaly
Abstract: Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.

Title: You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image

Authors: Taoyue Wang, Xiang Zhang, Xiaotian Li, Huiyuan Yang, Lijun Yin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01328
Pdf URL: https://arxiv.org/pdf/2603.01328
Copy Paste: [[2603.01328]] You Only Need One Stage: Novel-View Synthesis From A Single Blind Face Image(https://arxiv.org/abs/2603.01328)
Keywords: diffusion, generative
Abstract: We propose a novel one-stage method, NVB-Face, for generating consistent Novel-View images directly from a single Blind Face image. Existing approaches to novel-view synthesis for objects or faces typically require a high-resolution RGB image as input. When dealing with degraded images, the conventional pipeline follows a two-stage process: first restoring the image to high resolution, then synthesizing novel views from the restored result. However, this approach is highly dependent on the quality of the restored image, often leading to inaccuracies and inconsistencies in the final output. To address this limitation, we extract single-view features directly from the blind face image and introduce a feature manipulator that transforms these features into 3D-aware, multi-view latent representations. Leveraging the powerful generative capacity of a diffusion model, our framework synthesizes high-quality, consistent novel-view face images. Experimental results show that our method significantly outperforms traditional two-stage approaches in both consistency and fidelity.

Title: MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Authors: Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, Wenke Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01331
Pdf URL: https://arxiv.org/pdf/2603.01331
Copy Paste: [[2603.01331]] MetaState: Persistent Working Memory for Discrete Diffusion Language Models(https://arxiv.org/abs/2603.01331)
Keywords: diffusion
Abstract: Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.

Title: Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth

Authors: Andrew Wang, Mike Davies
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01332
Pdf URL: https://arxiv.org/pdf/2603.01332
Copy Paste: [[2603.01332]] Perspective-Equivariant Fine-tuning for Multispectral Demosaicing without Ground Truth(https://arxiv.org/abs/2603.01332)
Keywords: foundation model
Abstract: Multispectral demosaicing is crucial to reconstruct full-resolution spectral images from snapshot mosaiced measurements, enabling real-time imaging from neurosurgery to autonomous driving. Classical methods are blurry, while supervised learning requires costly ground truth (GT) obtained from slow line-scanning systems. We propose Perspective-Equivariant Fine-tuning for Demosaicing (PEFD), a framework that learns multispectral demosaicing from mosaiced measurements alone. PEFD a) exploits the projective geometry of camera-based imaging systems to leverage a richer group structure than previous demosaicing methods to recover more null-space information, and b) learns efficiently without GT by adapting pretrained foundation models designed for 1-3 channel imaging. On intraoperative and automotive datasets, PEFD recovers fine details such as blood vessels and preserves spectral fidelity, substantially outperforming recent approaches, nearing supervised performance.

Title: Provable and Practical In-Context Policy Optimization for Self-Improvement

Authors: Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01335
Pdf URL: https://arxiv.org/pdf/2603.01335
Copy Paste: [[2603.01335]] Provable and Practical In-Context Policy Optimization for Self-Improvement(https://arxiv.org/abs/2603.01335)
Keywords: in-context
Abstract: We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Title: UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Authors: Yessin Moakher, Youssef Attia El Hili, Vasilii Feofanov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01348
Pdf URL: https://arxiv.org/pdf/2603.01348
Copy Paste: [[2603.01348]] UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification(https://arxiv.org/abs/2603.01348)
Keywords: self-supervised, foundation model
Abstract: Self-supervised foundation models have achieved remarkable success across domains, including time series. However, the potential of non-contrastive methods, a paradigm that has driven significant advances in computer vision, remains underexplored for time series. In this work, we adapt DINOv2-style self-distillation to pretrain a time series foundation model, building on the Mantis tokenizer and transformer encoder architecture as our backbone. Through a student-teacher framework, our method Utica learns representations that capture both temporal invariance via augmented crops and fine-grained local structure via patch masking. Our approach achieves state-of-the-art classification performance on both UCR and UEA benchmarks. These results suggest that non-contrastive methods are a promising and complementary pretraining strategy for time series foundation models.

Title: DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking

Authors: Gilad Turok, Chris De Sa, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01367
Pdf URL: https://arxiv.org/pdf/2603.01367
Copy Paste: [[2603.01367]] DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking(https://arxiv.org/abs/2603.01367)
Keywords: diffusion, generative
Abstract: Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper perplexity evaluation: the ELBO is a loose bound on likelihood under the training distribution, not the test-time distribution, while generative perplexity requires a biased external model and ignores diversity. To address this, we introduce the \textsc{DUEL} framework, which formalizes \emph{deterministic} position selection, unifying leading MDM sampling strategies. We prove \textbf{\textsc{DUEL} admits \emph{exact} likelihood computation} via a simple algorithm, evaluated under the same position selection used at test time. This \textbf{gives MDMs proper perplexity for the first time} -- the natural analogue of autoregressive perplexity. With proper perplexity in hand, we revisit key questions about MDMs. \textbf{MDMs are substantially better than previously thought}: the MDM-autoregressive perplexity gap shrinks by up to 32\% on in-domain data and 82\% on zero-shot benchmarks. \textsc{DUEL} enables the first principled comparison of fast, parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying probability margin \citep{kim2025train} as a strong default. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving 36.47 vs.\ 52.11 perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.

Title: Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning

Authors: Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, Chuan Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01385
Pdf URL: https://arxiv.org/pdf/2603.01385
Copy Paste: [[2603.01385]] Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning(https://arxiv.org/abs/2603.01385)
Keywords: foundation model
Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.

Title: One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Authors: Lennon J. Shikhman
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2603.01406
Pdf URL: https://arxiv.org/pdf/2603.01406
Copy Paste: [[2603.01406]] One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers(https://arxiv.org/abs/2603.01406)
Keywords: foundation model
Abstract: Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

Title: UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Authors: Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.01418
Pdf URL: https://arxiv.org/pdf/2603.01418
Copy Paste: [[2603.01418]] UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation(https://arxiv.org/abs/2603.01418)
Keywords: diffusion
Abstract: While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

Title: LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Authors: Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, Zhicheng Dou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.01425
Pdf URL: https://arxiv.org/pdf/2603.01425
Copy Paste: [[2603.01425]] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval(https://arxiv.org/abs/2603.01425)
Keywords: generative
Abstract: LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.

Title: DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

Authors: Zengqi Zhao, Weidi Xia, Peter Wei, Yan Zhang, Yiyi Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Simiao Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01433
Pdf URL: https://arxiv.org/pdf/2603.01433
Copy Paste: [[2603.01433]] DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis(https://arxiv.org/abs/2603.01433)
Keywords: diffusion, generative
Abstract: We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.

Title: Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Authors: Thomas Rückstieß, Robin Vujanic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01444
Pdf URL: https://arxiv.org/pdf/2603.01444
Copy Paste: [[2603.01444]] Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data(https://arxiv.org/abs/2603.01444)
Keywords: diffusion
Abstract: Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.

Title: Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

Authors: Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow, Kwok-Yan Lam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01450
Pdf URL: https://arxiv.org/pdf/2603.01450
Copy Paste: [[2603.01450]] Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection(https://arxiv.org/abs/2603.01450)
Keywords: foundation model, anomaly
Abstract: The rapid advancement of deepfake generation techniques poses significant threats to public safety and causes societal harm through the creation of highly realistic synthetic facial media. While existing detection methods demonstrate limitations in generalizing to emerging forgery patterns, this paper presents Deepfake Forensics Adapter (DFA), a novel dual-stream framework that synergizes vision-language foundation models with targeted forensics analysis. Our approach integrates a pre-trained CLIP model with three core components to achieve specialized deepfake detection by leveraging the powerful general capabilities of CLIP without changing CLIP parameters: 1) A Global Feature Adapter is used to identify global inconsistencies in image content that may indicate forgery, 2) A Local Anomaly Stream enhances the model's ability to perceive local facial forgery cues by explicitly leveraging facial structure priors, and 3) An Interactive Fusion Classifier promotes deep interaction and fusion between global and local features using a transformer encoder. Extensive evaluations of frame-level and video-level benchmarks demonstrate the superior generalization capabilities of DFA, particularly achieving state-of-the-art performance in the challenging DFDC dataset with frame-level AUC/EER of 0.816/0.256 and video-level AUC/EER of 0.836/0.251, representing a 4.8% video AUC improvement over previous methods. Our framework not only demonstrates state-of-the-art performance, but also points out a feasible and effective direction for developing a robust deepfake detection system with enhanced generalization capabilities against the evolving deepfake threats. Our code is available at this https URL

Title: Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection

Authors: Kai Zheng, Hang-Cheng Dong, Zhenkai Wu, Fupeng Wei, Wei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01498
Pdf URL: https://arxiv.org/pdf/2603.01498
Copy Paste: [[2603.01498]] Tri-path DINO: Feature Complementary Learning for Remote Sensing Multi-Class Change Detection(https://arxiv.org/abs/2603.01498)
Keywords: foundation model
Abstract: In remote sensing imagery, multi class change detection (MCD) is crucial for fine grained monitoring, yet it has long been constrained by complex scene variations and the scarcity of detailed annotations. To address this, we propose the Tripath DINO architecture, which adopts a three path complementary feature learning strategy to facilitate the rapid adaptation of pre trained foundation models to complex vertical domains. Specifically, we employ the DINOv3 pre trained model as the backbone feature extraction network to learn coarse grained features. An auxiliary path also adopts a siamese structure, progressively aggregating intermediate features from the siamese encoder to enhance the learning of fine grained features. Finally, a multi scale attention mechanism is introduced to augment the decoder network, where parallel convolutions adaptively capture and enhance contextual information under different receptive fields. The proposed method achieves optimal performance on the MCD task on both the Gaza facility damage assessment dataset (Gaza change) and the classic SECOND dataset. GradCAM visualizations further confirm that the main and auxiliary paths naturally focus on coarse grained semantic changes and fine grained structural details, respectively. This synergistic complementarity provides a robust and interpretable solution for advanced change detection tasks, offering a basis for rapid and accurate damage assessment.

Title: Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling

Authors: Zillur Rahman, Alex Sheng, Cristian Meo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01509
Pdf URL: https://arxiv.org/pdf/2603.01509
Copy Paste: [[2603.01509]] Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling(https://arxiv.org/abs/2603.01509)
Keywords: diffusion, generative
Abstract: While large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models, these models remain highly sensitive to input prompts, demonstrating that prompt design is critical to generation quality. Current methods for improving video output often fall short: they either depend on complex, post-editing models, risking the introduction of artifacts, or require expensive fine-tuning of the core generator, which severely limits both scalability and accessibility. In this work, we introduce 3R, a novel RAG based prompt optimization framework. 3R utilizes the power of current state-of-the-art T2V diffusion model and vision language model. It can be used with any T2V model without any kind of model training. The framework leverages three key strategies: RAG-based modifiers extraction for enriched contextual grounding, diffusion-based Preference Optimization for aligning outputs with human preferences, and temporal frame interpolation for producing temporally consistent visual contents. Together, these components enable more accurate, efficient, and contextually aligned text-to-video generation. Experimental results demonstrate the efficacy of 3R in enhancing the static fidelity and dynamic coherence of generated videos, underscoring the importance of optimizing user prompts.

Title: FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Authors: Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01515
Pdf URL: https://arxiv.org/pdf/2603.01515
Copy Paste: [[2603.01515]] FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation(https://arxiv.org/abs/2603.01515)
Keywords: diffusion
Abstract: Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.

Title: Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing

Authors: Zijin Yin, Bing Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01535
Pdf URL: https://arxiv.org/pdf/2603.01535
Copy Paste: [[2603.01535]] Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing(https://arxiv.org/abs/2603.01535)
Keywords: diffusion, generative
Abstract: Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

Title: RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry

Authors: Xinchang Wang, Yunhao Chen, Yuechen Zhang, Congcong Bian, Zihao Guo, Xingjun Ma, Hui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01544
Pdf URL: https://arxiv.org/pdf/2603.01544
Copy Paste: [[2603.01544]] RA-Det: Towards Universal Detection of AI-Generated Images via Robustness Asymmetry(https://arxiv.org/abs/2603.01544)
Keywords: generative
Abstract: Recent image generators produce photo-realistic content that undermines the reliability of downstream recognition systems. As visual appearance cues become less pronounced, appearance-driven detectors that rely on forensic cues or high-level representations lose stability. This motivates a shift from appearance to behavior, focusing on how images respond to controlled perturbations rather than how they look. In this work, we identify a simple and universal behavioral signal. Natural images preserve stable semantic representations under small, structured perturbations, whereas generated images exhibit markedly larger feature drift. We refer to this phenomenon as robustness asymmetry and provide a theoretical analysis that establishes a lower bound connecting this asymmetry to memorization tendencies in generative models, explaining its prevalence across architectures. Building on this insight, we introduce Robustness Asymmetry Detection (RA-Det), a behavior-driven detection framework that converts robustness asymmetry into a reliable decision signal. Evaluated across 14 diverse generative models and against more than 10 strong detectors, RA-Det achieves superior performance, improving the average performance by 7.81 percent. The method is data- and model-agnostic, requires no generator fingerprints, and transfers across unseen generators. Together, these results indicate that robustness asymmetry is a stable, general cue for synthetic-image detection and that carefully designed probing can turn this cue into a practical, universal detector. The source code is publicly available at Github.

Title: PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification

Authors: Jian Yu, Joakim Nguyen, Jinrui Fang, Awais Naeem, Zeyuan Cao, Sanjay Krishnan, Nicholas Konz, Tianlong Chen, Chandra Krishnan, Hairong Wang, Edward Castillo, Ying Ding, Ankita Shukla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01547
Pdf URL: https://arxiv.org/pdf/2603.01547
Copy Paste: [[2603.01547]] PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification(https://arxiv.org/abs/2603.01547)
Keywords: foundation model
Abstract: Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.

Title: Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder

Authors: Ayantika Das, Keerthi Ram, Mohanasankar Sivaprakasam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01552
Pdf URL: https://arxiv.org/pdf/2603.01552
Copy Paste: [[2603.01552]] Align-cDAE: Alzheimer's Disease Progression Modeling with Attention-Aligned Conditional Diffusion Auto-Encoder(https://arxiv.org/abs/2603.01552)
Keywords: diffusion, generative
Abstract: Generative AI framework-based modeling and prediction of longitudinal human brain images offer an efficient mechanism to track neurodegenerative progression, essential for the assessment of diseases like Alzheimer's. Among the existing generative approaches, recent diffusion-based models have emerged as an effective alternative to generate disease progression images. Incorporating multi-modal and non-imaging attributes as conditional information into diffusion frameworks has been shown to improve controllability during such generations. However, existing methods do not explicitly ensure that information from non-imaging conditioning modalities is meaningfully aligned with image features to introduce desirable changes in the generated images, such as modulation of progression-specific regions. Further, more precise control over the generation process can be achieved by introducing progression-relevant structure into the internal representations of the model, lacking in the existing approaches. To address these limitations, we propose a diffusion autoencoder-based framework for disease progression modeling that explicitly enforces alignment between different modalities. The alignment is enforced by introducing an explicit objective function that enables the model to focus on the regions exhibiting progression-related changes. Further, we devise a mechanism to better structure the latent representational space of the diffusion auto-encoding framework. Specifically, we assign separate latent subspaces for integrating progression-related conditions and retaining subject-specific identity information, allowing better-controlled image generation. These results demonstrate that enforcing alignment and better structuring of the latent representational space of diffusion auto-encoding framework leads to more anatomically precise modeling of Alzheimer's disease progression.

Title: LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Authors: Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01563
Pdf URL: https://arxiv.org/pdf/2603.01563
Copy Paste: [[2603.01563]] LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models(https://arxiv.org/abs/2603.01563)
Keywords: diffusion
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

Title: Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Authors: Saurabh Kaushik, Lalit Maurya, Beth Tellman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01576
Pdf URL: https://arxiv.org/pdf/2603.01576
Copy Paste: [[2603.01576]] Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications(https://arxiv.org/abs/2603.01576)
Keywords: foundation model
Abstract: Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{this https URL}{GitHub}).

Title: SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis

Authors: Chuqiao Wu, Jin Song, Yiyun Fei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01579
Pdf URL: https://arxiv.org/pdf/2603.01579
Copy Paste: [[2603.01579]] SkeleGuide: Explicit Skeleton Reasoning for Context-Aware Human-in-Place Image Synthesis(https://arxiv.org/abs/2603.01579)
Keywords: generative
Abstract: Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.

Title: Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

Authors: Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01580
Pdf URL: https://arxiv.org/pdf/2603.01580
Copy Paste: [[2603.01580]] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models(https://arxiv.org/abs/2603.01580)
Keywords: generative
Abstract: Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.

Title: FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems

Authors: Minwoo Kim, Seunghyeok Shin, Hongki Lim
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.01591
Pdf URL: https://arxiv.org/pdf/2603.01591
Copy Paste: [[2603.01591]] FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems(https://arxiv.org/abs/2603.01591)
Keywords: diffusion
Abstract: Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.

Title: Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference

Authors: Jiaqi Leng, Shuyuan Tu, Haidong Cao, Sicheng Xie, Daoguo Dong, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01594
Pdf URL: https://arxiv.org/pdf/2603.01594
Copy Paste: [[2603.01594]] Preference Score Distillation: Leveraging 2D Rewards to Align Text-to-3D Generation with Human Preference(https://arxiv.org/abs/2603.01594)
Keywords: diffusion
Abstract: Human preference alignment presents a critical yet underexplored challenge for diffusion models in text-to-3D generation. Existing solutions typically require task-specific fine-tuning, posing significant hurdles in data-scarce 3D domains. To address this, we propose Preference Score Distillation (PSD), an optimization-based framework that leverages pretrained 2D reward models for human-aligned text-to-3D synthesis without 3D training data. Our key insight stems from the incompatibility of pixel-level gradients: due to the absence of noisy samples during reward model training, direct application of 2D reward gradients disturbs the denoising process. Noticing that similar issue occurs in the naive classifier guidance in conditioned diffusion models, we fundamentally rethink preference alignment as a classifier-free guidance (CFG)-style mechanism through our implicit reward model. Furthermore, recognizing that frozen pretrained diffusion models constrain performance, we introduce an adaptive strategy to co-optimize preference scores and negative text embeddings. By incorporating CFG during optimization, online refinement of negative text embeddings dynamically enhances alignment. To our knowledge, we are the first to bridge human preference alignment with CFG theory under score distillation framework. Experiments demonstrate the superiority of PSD in aesthetic metrics, seamless integration with diverse pipelines, and strong extensibility.

Title: Sparse View Distractor-Free Gaussian Splatting

Authors: Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao, Dongjun Ye, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01603
Pdf URL: https://arxiv.org/pdf/2603.01603
Copy Paste: [[2603.01603]] Sparse View Distractor-Free Gaussian Splatting(https://arxiv.org/abs/2603.01603)
Keywords: foundation model
Abstract: 3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.

Title: Information-Theoretic Digital Twins for Stealthy Attack Detection in Industrial Control Systems: A Closed-Form KL Divergence Approach

Authors: Inda Kreso, Mehran Tarif, Fatemeh Moradi, Iman Khazrak, Mostafa M Rezaee, Mohammadhossein Homaei
Subjects: cs.CR, math.OC
Abstract URL: https://arxiv.org/abs/2603.01621
Pdf URL: https://arxiv.org/pdf/2603.01621
Copy Paste: [[2603.01621]] Information-Theoretic Digital Twins for Stealthy Attack Detection in Industrial Control Systems: A Closed-Form KL Divergence Approach(https://arxiv.org/abs/2603.01621)
Keywords: anomaly
Abstract: Digital twins (DTs) are increasingly used to monitor and secure Industrial Control Systems (ICS), yet detecting stealthy False Data Injection Attacks (FDIAs) that manipulate system states within normal physical bounds remains challenging. Deep learning anomaly detectors often over-generalize such subtle manipulations, while classical fault detection methods do not scale well in highly correlated multivariate systems. We propose a closed-loop Information-Theoretic Digital Twin (IT-DT) framework for real-time anomaly detection. N4SID identification is combined with steady-state Kalman filtering to quantify residual distribution shifts via closed-form KL divergence, capturing both mean deviations and malicious cross-covariance shifts. Evaluations on the SWaT and WADI datasets show that IT-DT achieves F1-scores of 0.832 and 0.615, respectively, with better precision than deep learning baselines such as TranAD. Computational profiling indicates that the analytical approach requires minimal memory and provides approximately a 600x inference speedup over transformer-based methods on CPU hardware. This makes the framework suitable for resource-constrained industrial edge controllers without GPU acceleration.

Title: Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Authors: Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01623
Pdf URL: https://arxiv.org/pdf/2603.01623
Copy Paste: [[2603.01623]] Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration(https://arxiv.org/abs/2603.01623)
Keywords: diffusion
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

Title: PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

Authors: Xianqi Wang, Hao Yang, Hangtian Wang, Junda Cheng, Gangwei Xu, Min Lin, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01650
Pdf URL: https://arxiv.org/pdf/2603.01650
Copy Paste: [[2603.01650]] PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts(https://arxiv.org/abs/2603.01650)
Keywords: foundation model
Abstract: Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.

Title: Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

Authors: Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2603.01655
Pdf URL: https://arxiv.org/pdf/2603.01655
Copy Paste: [[2603.01655]] Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling(https://arxiv.org/abs/2603.01655)
Keywords: generative
Abstract: Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the power of the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics to reduce the number of path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a comprehensive machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying such generative models to this domain presents significant challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key architectural components. First, we implement an \emph{experience replay buffer} to capture and retain rare valid paths. Second, we adopt a uniform exploratory policy to improve generalization and prevent the model from overfitting to simple geometries. Third, we apply a physics-based action masking strategy that filters out physically impossible paths before the model even considers them. As demonstrated in our experimental validation, the proposed model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $1000\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. The complete source code, tests, and tutorial are available at this https URL.

Title: A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs

Authors: Aryan Goyal, Shreshtha Singh, Ashish Mittal, Manoj Tadepalli, Piyush Kumar, Preetham Putha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01659
Pdf URL: https://arxiv.org/pdf/2603.01659
Copy Paste: [[2603.01659]] A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs(https://arxiv.org/abs/2603.01659)
Keywords: diffusion
Abstract: Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.

Title: FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters

Authors: Shao Shitong, Gu Yufei, Xie Zeke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01685
Pdf URL: https://arxiv.org/pdf/2603.01685
Copy Paste: [[2603.01685]] FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters(https://arxiv.org/abs/2603.01685)
Keywords: generative
Abstract: The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose FastLightGen, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.

Title: DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs

Authors: Aryan Goyal, Ashish Mittal, Pranav Rao, Manoj Tadepalli, Preetham Putha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01686
Pdf URL: https://arxiv.org/pdf/2603.01686
Copy Paste: [[2603.01686]] DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs(https://arxiv.org/abs/2603.01686)
Keywords: diffusion, generative
Abstract: Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.

Title: CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions

Authors: Gong Chen, Chaokun Zhang, Pengcheng Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01688
Pdf URL: https://arxiv.org/pdf/2603.01688
Copy Paste: [[2603.01688]] CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions(https://arxiv.org/abs/2603.01688)
Keywords: diffusion
Abstract: Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.

Title: Building a Strong Instruction Language Model for a Less-Resourced Language

Authors: Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01691
Pdf URL: https://arxiv.org/pdf/2603.01691
Copy Paste: [[2603.01691]] Building a Strong Instruction Language Model for a Less-Resourced Language(https://arxiv.org/abs/2603.01691)
Keywords: generative
Abstract: Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.

Title: Dual Distillation for Few-Shot Anomaly Detection

Authors: Le Dong, Qinzhong Tan, Chunlei Li, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01713
Pdf URL: https://arxiv.org/pdf/2603.01713
Copy Paste: [[2603.01713]] Dual Distillation for Few-Shot Anomaly Detection(https://arxiv.org/abs/2603.01713)
Keywords: anomaly
Abstract: Anomaly detection is a critical task in computer vision with profound implications for medical imaging, where identifying pathologies early can directly impact patient outcomes. While recent unsupervised anomaly detection approaches show promise, they require substantial normal training data and struggle to generalize across anatomical contexts. We introduce D$^2$4FAD, a novel dual distillation framework for few-shot anomaly detection that identifies anomalies in previously unseen tasks using only a small number of normal reference images. Our approach leverages a pre-trained encoder as a teacher network to extract multi-scale features from both support and query images, while a student decoder learns to distill knowledge from the teacher on query images and self-distill on support images. We further propose a learn-to-weight mechanism that dynamically assesses the reference value of each support image conditioned on the query, optimizing anomaly detection performance. To evaluate our method, we curate a comprehensive benchmark dataset comprising 13,084 images across four organs, four imaging modalities, and five disease categories. Extensive experiments demonstrate that D$^2$4FAD significantly outperforms existing approaches, establishing a new state-of-the-art in few-shot medical anomaly detection. Code is available at this https URL.

Title: Bootstrapping Embeddings for Low Resource Languages

Authors: Merve Basoz, Andrew Horne, Mattia Opper
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01732
Pdf URL: https://arxiv.org/pdf/2603.01732
Copy Paste: [[2603.01732]] Bootstrapping Embeddings for Low Resource Languages(https://arxiv.org/abs/2603.01732)
Keywords: in-context
Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.

Title: Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence

Authors: Ihor Kendiukhov
Subjects: cs.LG, q-bio.CB, q-bio.GN
Abstract URL: https://arxiv.org/abs/2603.01752
Pdf URL: https://arxiv.org/pdf/2603.01752
Copy Paste: [[2603.01752]] Causal Circuit Tracing Reveals Distinct Computational Architectures in Single-Cell Foundation Models: Inhibitory Dominance, Biological Coherence, and Cross-Model Convergence(https://arxiv.org/abs/2603.01752)
Keywords: foundation model
Abstract: Motivation: Sparse autoencoders (SAEs) decompose foundation model activations into interpretable features, but causal feature-to-feature interactions across network depth remain unknown for biological foundation models. Results: We introduce causal circuit tracing by ablating SAE features and measuring downstream responses, and apply it to Geneformer V2-316M and scGPT whole-human across four conditions (96,892 edges, 80,191 forward passes). Both models show approximately 53 percent biological coherence and 65 to 89 percent inhibitory dominance, invariant to architecture and cell type. scGPT produces stronger effects (mean absolute d = 1.40 vs. 1.05) with more balanced dynamics. Cross-model consensus yields 1,142 conserved domain pairs (10.6x enrichment, p < 0.001). Disease-associated domains are 3.59x more likely to be consensus. Gene-level CRISPRi validation shows 56.4 percent directional accuracy, confirming co-expression rather than causal encoding.

Title: Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

Authors: Zichen Tian, Yaoyao Liu, Qianru Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.01759
Pdf URL: https://arxiv.org/pdf/2603.01759
Copy Paste: [[2603.01759]] Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning(https://arxiv.org/abs/2603.01759)
Keywords: foundation model
Abstract: Training large foundation models from scratch for domain-specific applications is almost impossible due to data limits and long-tailed distributions -- taking remote sensing (RS) as an example. Fine-tuning natural image pre-trained models on RS images is a straightforward solution. To reduce computational costs and improve performance on tail classes, existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer. However, we observe that fixed hyperparameters -- such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. MetaPEFT dynamically adjusts three key factors of PEFT on RS images: module insertion, layer selection, and module-wise learning rates, which collectively control the influence of PEFT modules across the network. We conduct extensive experiments on three transfer-learning scenarios and five datasets in both RS and natural image domains. The results show that MetaPEFT achieves state-of-the-art performance in cross-spectral adaptation, requiring only a small amount of trainable parameters and improving tail-class accuracy significantly.

Title: Modular Memory is the Key to Continual Learning Agents

Authors: Vaggelis Dorovatas, Malte Schwerin, Andrew D. Bagdanov, Lucas Caccia, Antonio Carta, Laurent Charlin, Barbara Hammer, Tyler L. Hayes, Timm Hess, Christopher Kanan, Dhireesha Kudithipudi, Xialei Liu, Vincenzo Lomonaco, Jorge Mendez-Mendez, Darshan Patil, Ameya Prabhu, Elisa Ricci, Tinne Tuytelaars, Gido M. van de Ven, Liyuan Wang, Joost van de Weijer, Jonghyun Choi, Martin Mundt, Rahaf Aljundi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01761
Pdf URL: https://arxiv.org/pdf/2603.01761
Copy Paste: [[2603.01761]] Modular Memory is the Key to Continual Learning Agents(https://arxiv.org/abs/2603.01761)
Keywords: foundation model, in-context
Abstract: Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

Title: Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

Authors: Minseok Seo, Wonjun Lee, Jaehyuk Jang, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01765
Pdf URL: https://arxiv.org/pdf/2603.01765
Copy Paste: [[2603.01765]] Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation(https://arxiv.org/abs/2603.01765)
Keywords: diffusion, foundation model
Abstract: Zero-shot depth completion has gained attention for its ability to generalize across environments without sensor-specific datasets or retraining. However, most existing approaches rely on diffusion-based test-time optimization, which is computationally expensive due to iterative denoising. Recent visual-prompt-based methods reduce training cost but still require repeated forward--backward passes through the full frozen network to optimize input-level prompts, resulting in slow inference. In this work, we show that adapting only the decoder is sufficient for effective test-time optimization, as depth foundation models concentrate depth-relevant information within a low-dimensional decoder subspace. Based on this insight, we propose a lightweight test-time adaptation method that updates only this low-dimensional subspace using sparse depth supervision. Our approach achieves state-of-the-art performance, establishing a new Pareto frontier between accuracy and efficiency for test-time adaptation. Extensive experiments on five indoor and outdoor datasets demonstrate consistent improvements over prior methods, highlighting the practicality of fast zero-shot depth completion.

Title: CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning

Authors: Pratik Jawahar, Maurizio Pierini
Subjects: cs.LG, cs.AI, physics.app-ph
Abstract URL: https://arxiv.org/abs/2603.01768
Pdf URL: https://arxiv.org/pdf/2603.01768
Copy Paste: [[2603.01768]] CHLU: The Causal Hamiltonian Learning Unit as a Symplectic Primitive for Deep Learning(https://arxiv.org/abs/2603.01768)
Keywords: generative
Abstract: Current deep learning primitives dealing with temporal dynamics suffer from a fundamental dichotomy: they are either discrete and unstable (LSTMs) \citep{pascanu_difficulty_2013}, leading to exploding or vanishing gradients; or they are continuous and dissipative (Neural ODEs) \citep{dupont_augmented_2019}, which destroy information over time to ensure stability. We propose the \textbf{Causal Hamiltonian Learning Unit} (pronounced: \textit{clue}), a novel Physics-grounded computational learning primitive. By enforcing a Relativistic Hamiltonian structure and utilizing symplectic integration, a CHLU strictly conserves phase-space volume, as an attempt to solve the memory-stability trade-off. We show that the CHLU is designed for infinite-horizon stability, as well as controllable noise filtering. We then demonstrate a CHLU's generative ability using the MNIST dataset as a proof-of-principle.

Title: FreeAct: Freeing Activations for LLM Quantization

Authors: Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu, Fei Shen, Xiu Su, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.01776
Pdf URL: https://arxiv.org/pdf/2603.01776
Copy Paste: [[2603.01776]] FreeAct: Freeing Activations for LLM Quantization(https://arxiv.org/abs/2603.01776)
Keywords: diffusion
Abstract: Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.

Title: LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01778
Pdf URL: https://arxiv.org/pdf/2603.01778
Copy Paste: [[2603.01778]] LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction(https://arxiv.org/abs/2603.01778)
Keywords: in-context
Abstract: Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.

Title: D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Authors: Zhao Yang, Hengchang Liu, Chuan Cao, Bing Su
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2603.01780
Pdf URL: https://arxiv.org/pdf/2603.01780
Copy Paste: [[2603.01780]] D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation(https://arxiv.org/abs/2603.01780)
Keywords: diffusion, foundation model, generative
Abstract: Early DNA foundation models adopted BERT-style training, achieving good performance on DNA understanding tasks but lacking generative capabilities. Recent autoregressive models enable DNA generation, but employ left-to-right causal modeling that is suboptimal for DNA where regulatory relationships are inherently bidirectional. We present D3LM (\textbf{D}iscrete \textbf{D}NA \textbf{D}iffusion \textbf{L}anguage \textbf{M}odel), which unifies bidirectional representation learning and DNA generation through masked diffusion. D3LM directly adopts the Nucleotide Transformer (NT) v2 architecture but reformulates the training objective as masked diffusion in discrete DNA space, enabling both bidirectional understanding and generation capabilities within a single model. Compared to NT v2 of the same size, D3LM achieves improved performance on understanding tasks. Notably, on regulatory element generation, D3LM achieves an SFID of 10.92, closely approaching real DNA sequences (7.85) and substantially outperforming the previous best result of 29.16 from autoregressive models. Our work suggests diffusion language models as a promising paradigm for unified DNA foundation models. We further present the first systematic study of masked diffusion models in the DNA domain, investigating practical design choices such as tokenization schemes and sampling strategies, thereby providing empirical insights and a solid foundation for future research. D3LM has been released at this https URL.

Title: Learning Shortest Paths with Generative Flow Networks

Authors: Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01786
Pdf URL: https://arxiv.org/pdf/2603.01786
Copy Paste: [[2603.01786]] Learning Shortest Paths with Generative Flow Networks(https://arxiv.org/abs/2603.01786)
Keywords: generative
Abstract: In this paper, we present a novel learning framework for finding shortest paths in graphs utilizing Generative Flow Networks (GFlowNets). First, we examine theoretical properties of GFlowNets in non-acyclic environments in relation to shortest paths. We prove that, if the total flow is minimized, forward and backward policies traverse the environment graph exclusively along shortest paths between the initial and terminal states. Building on this result, we show that the pathfinding problem in an arbitrary graph can be solved by training a non-acyclic GFlowNet with flow regularization. We experimentally demonstrate the performance of our method in pathfinding in permutation environments and in solving Rubik's Cubes. For the latter problem, our approach shows competitive results with state-of-the-art machine learning approaches designed specifically for this task in terms of the solution length, while requiring smaller search budget at test-time.

Title: Phase-Type Variational Autoencoders for Heavy-Tailed Data

Authors: Abdelhakim Ziani, András Horváth, Paolo Ballarini
Subjects: cs.LG, cs.AI, stat.ML, stat.OT
Abstract URL: https://arxiv.org/abs/2603.01800
Pdf URL: https://arxiv.org/pdf/2603.01800
Copy Paste: [[2603.01800]] Phase-Type Variational Autoencoders for Heavy-Tailed Data(https://arxiv.org/abs/2603.01800)
Keywords: generative
Abstract: Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions (e.g., Gaussian) that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately recovers diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

Title: Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments

Authors: Dragos Costea, Alina Marcu, Cristina Lazar, Marius Leordeanu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01804
Pdf URL: https://arxiv.org/pdf/2603.01804
Copy Paste: [[2603.01804]] Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments(https://arxiv.org/abs/2603.01804)
Keywords: generative
Abstract: We study the ongoing debate regarding the statistical fidelity of AI-generated data compared to human-generated data in the context of non-verbal communication using full body motion. Concretely, we ask if contemporary generative models move beyond surface mimicry to participate in the silent, but expressive dialogue of body language. We tackle this question by introducing the first framework that generates a natural non-verbal interaction between Human and AI in real-time from 2D body keypoints. Our experiments utilize four lightweight architectures which run at up to 100 FPS on an NVIDIA Orin Nano, effectively closing the perception-action loop needed for natural Human-AI interaction. We trained on 437 human video clips and demonstrated that pretraining on synthetically-generated sequences reduces motion errors significantly, without sacrificing speed. Yet, a measurable reality gap persists. When the best model is evaluated on keypoints extracted from cutting-edge text-to-video systems, such as SORA and VEO, we observe that performance drops on SORA-generated clips. However, it degrades far less on VEO, suggesting that temporal coherence, not image fidelity, drives real-world performance. Our results demonstrate that statistically distinguishable differences persist between Human and AI motion.

Title: Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes

Authors: Hongkun Dou, Zike Chen, Zeyu Li, Hongjue Li, Lijun Yang, Yue Deng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.01837
Pdf URL: https://arxiv.org/pdf/2603.01837
Copy Paste: [[2603.01837]] Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes(https://arxiv.org/abs/2603.01837)
Keywords: diffusion, generative
Abstract: Diffusion models have gained prominence as powerful generative tools for solving inverse problems due to their ability to model complex data distributions. However, existing methods typically rely on complete knowledge of the forward observation process to compute gradients for guided sampling, limiting their applicability in scenarios where such information is unavailable. In this work, we introduce \textbf{\emph{Constrained Particle Seeking (CPS)}}, a novel gradient-free approach that leverages all candidate particle information to actively search for the optimal particle while incorporating constraints aligned with high-density regions of the unconditional prior. Unlike previous methods that passively select promising candidates, CPS reformulates the inverse problem as a constrained optimization task, enabling more flexible and efficient particle seeking. We demonstrate that CPS can effectively solve both image and scientific inverse problems, achieving results comparable to gradient-based methods while significantly outperforming gradient-free alternatives. Code is available at this https URL.

Title: Phishing the Phishers with SpecularNet: Hierarchical Graph Autoencoding for Reference-Free Web Phishing Detection

Authors: Tailai Song, Pedro Casas, Michela Meo
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01874
Pdf URL: https://arxiv.org/pdf/2603.01874
Copy Paste: [[2603.01874]] Phishing the Phishers with SpecularNet: Hierarchical Graph Autoencoding for Reference-Free Web Phishing Detection(https://arxiv.org/abs/2603.01874)
Keywords: generative
Abstract: Phishing remains the most pervasive threat to the Web, enabling large-scale credential theft and financial fraud through deceptive webpages. While recent reference-based and generative-AI-driven phishing detectors achieve strong accuracy, their reliance on external knowledge bases, cloud services, and complex multimodal pipelines fundamentally limits practicality, scalability, and reproducibility. In contrast, conventional deep learning approaches often fail to generalize to evolving phishing campaigns. We introduce SpecularNet, a novel lightweight framework for reference-free web phishing detection that demonstrates how carefully designed compact architectures can rival heavyweight systems. SpecularNet operates solely on the domain name and HTML structure, modeling the Document Object Model (DOM) as a tree and leveraging a hierarchical graph autoencoding architecture with directional, level-wise message passing. This design captures higher-order structural invariants of phishing webpages while enabling fast, end-to-end inference on standard CPUs. Extensive evaluation against 13 state of the art phishing detectors, including leading reference-based systems, shows that SpecularNet achieves competitive detection performance with dramatically lower computational cost. On benchmark datasets, it reaches an F1 score of 93.9%, trailing the best reference-based method slightly while reducing inference time from several seconds to approximately 20 milliseconds per webpage. Field and robustness evaluations further validate SpecularNet in real-world deployments, on a newly collected 2026 open-world dataset, and against adversarial attacks.

Title: CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection

Authors: Yiheng Li, Zichang Tan, Guoqing Xu, Yijun Ye, Yang Yang, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01878
Pdf URL: https://arxiv.org/pdf/2603.01878
Copy Paste: [[2603.01878]] CTForensics: A Comprehensive Dataset and Method for AI-Generated CT Image Detection(https://arxiv.org/abs/2603.01878)
Keywords: generative
Abstract: With the rapid development of generative AI in medical imaging, synthetic Computed Tomography (CT) images have demonstrated great potential in applications such as data augmentation and clinical diagnosis, but they also introduce serious security risks. Despite the increasing security concerns, existing studies on CT forgery detection are still limited and fail to adequately address real-world challenges. These limitations are mainly reflected in two aspects: the absence of datasets that can effectively evaluate model generalization to reflect the real-world application requirements, and the reliance on detection methods designed for natural images that are insensitive to CT-specific forgery artifacts. In this view, we propose CTForensics, a comprehensive dataset designed to systematically evaluate the generalization capability of CT forgery detection methods, which includes ten diverse CT generative methods. Moreover, we introduce the Enhanced Spatial-Frequency CT Forgery Detector (ESF-CTFD), an efficient CNN-based neural network that captures forgery cues across the wavelet, spatial, and frequency domains. First, it transforms the input CT image into three scales and extracts features at each scale via the Wavelet-Enhanced Central Stem. Then, starting from the largest-scale features, the Spatial Process Block gradually performs feature fusion with the smaller-scale ones. Finally, the Frequency Process Block learns frequency-domain information for predicting the final results. Experiments demonstrate that ESF-CTFD consistently outperforms existing methods and exhibits superior generalization across different CT generative models.

Title: Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling

Authors: Muyu Liu, Xuanyu Tian, Chenhe Du, Qing Wu, Hongjiang Wei, Yuyao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01890
Pdf URL: https://arxiv.org/pdf/2603.01890
Copy Paste: [[2603.01890]] Resolving Blind Inverse Problems under Dynamic Range Compression via Structured Forward Operator Modeling(https://arxiv.org/abs/2603.01890)
Keywords: diffusion
Abstract: Recovering radiometric fidelity from unknown dynamic range compression (UDRC), such as low-light enhancement and HDR reconstruction, is a challenging blind inverse problem, due to the unknown forward model and irreversible information loss introduced by compression. To address this challenge, we first identify monotonicity as the fundamental physical invariant shared across UDRC tasks. Leveraging this insight, we introduce the \textbf{cascaded monotonic Bernstein} (CaMB) operator to parameterize the unknown forward model. CaMB enforces monotonicity as a hard architectural inductive bias, constraining optimization to physically consistent mappings and enabling robust and stable operator estimation. We further integrate CaMB with a plug-and-play diffusion framework, proposing \textbf{CaMB-Diff}. Within this framework, the diffusion model serves as a powerful geometric prior for structural and semantic recovery, while CaMB explicitly models and corrects radiometric distortions through a physically grounded forward operator. Extensive experiments on a variety of zero-shot UDRC tasks, including low-light enhancement, low-field MRI enhancement, and HDR reconstruction, demonstrate that CaMB-Diff significantly outperforms state-of-the-art zero-shot baselines in terms of both signal fidelity and physical consistency. Moreover, we empirically validate the effectiveness of the proposed CaMB parameterization in accurately modeling the unknown forward operator.

Title: Generative Visual Chain-of-Thought for Image Editing

Authors: Zijin Yin, Tiankai Hang, Yiji Cheng, Shiyi Zhang, Runze He, Yu Xu, Chunyu Wang, Bing Li, Zheng Chang, Kongming Liang, Qinglin Lu, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01893
Pdf URL: https://arxiv.org/pdf/2603.01893
Copy Paste: [[2603.01893]] Generative Visual Chain-of-Thought for Image Editing(https://arxiv.org/abs/2603.01893)
Keywords: generative
Abstract: Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions. To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT), a unified framework that performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit. Unlike prior text-only CoT or tool-dependent visual CoT paradigms, GVCoT jointly optimizes visual tokens generated during the reasoning and editing phases in an end-to-end manner. This way fosters the emergence of innate spatial reasoning ability and enables more effective utilization of visual-domain cues. The main challenge of training GCVoT lies in the scarcity of large-scale editing data with precise edit region annotations; to this end, we construct GVCoT-Edit-Instruct, a dataset of 1.8M high-quality samples spanning 19 tasks. We adopt a progressive training strategy: supervised fine-tuning to build foundational localization ability in reasoning trace before final editing, followed by reinforcement learning to further improve reasoning and editing quality. Finally, we introduce SREdit-Bench, a new benchmark designed to comprehensively stress-test models under sophisticated scenes and fine-grained referring expressions. Experiments demonstrate that GVCoT consistently outperforms state-of-the-art models on SREdit-Bench and ImgEdit. We hope our GVCoT will inspire future research toward interpretable and precise image editing.

Title: Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport

Authors: Muyu Liu, Chenhe Du, Xuanyu Tian, Qing Wu, Xiao Wang, Haonan Zhang, Hongjiang Wei, Yuyao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01913
Pdf URL: https://arxiv.org/pdf/2603.01913
Copy Paste: [[2603.01913]] Zero-shot Low-Field MRI Enhancement via Diffusion-Based Adaptive Contrast Transport(https://arxiv.org/abs/2603.01913)
Keywords: diffusion
Abstract: Low-field (LF) magnetic resonance imaging (MRI) democratizes access to diagnostic imaging but is fundamentally limited by low signal-to-noise ratio and significant tissue contrast distortion due to field-dependent relaxation dynamics. Reconstructing high-field (HF) quality images from LF data is a blind inverse problem, severely challenged by the scarcity of paired training data and the unknown, non-linear contrast transformation operator. Existing zero-shot methods, which assume simplified linear degradation, often fail to recover authentic tissue contrast. In this paper, we propose DACT(Diffusion-Based Adaptive Contrast Transport), a novel zero-shot framework that restores HF-quality images without paired supervision. DACT synergizes a pre-trained HF diffusion prior to ensure anatomical fidelity with a physically-informed adaptive forward model. Specifically, we introduce a differentiable Sinkhorn optimal transport module that explicitly models and corrects the intensity distribution shift between LF and HF domains during the reverse diffusion process. This allows the framework to dynamically learn the intractable contrast mapping while preserving topological consistency. Extensive experiments on simulated and real clinical LF datasets demonstrate that DACT achieves state-of-the-art performance, yielding reconstructions with superior structural detail and correct tissue contrast.

Title: AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

Authors: Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01914
Pdf URL: https://arxiv.org/pdf/2603.01914
Copy Paste: [[2603.01914]] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth(https://arxiv.org/abs/2603.01914)
Keywords: self-supervised
Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.

Title: LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Authors: Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01928
Pdf URL: https://arxiv.org/pdf/2603.01928
Copy Paste: [[2603.01928]] LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving(https://arxiv.org/abs/2603.01928)
Keywords: foundation model
Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

Title: Dream2Learn: Structured Generative Dreaming for Continual Learning

Authors: Salvatore Calcagno, Matteo Pennisi, Federica Proietto Salanitri, Amelia Sorrenti, Simone Palazzo, Concetto Spampinato, Giovanni Bellitto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01935
Pdf URL: https://arxiv.org/pdf/2603.01935
Copy Paste: [[2603.01935]] Dream2Learn: Structured Generative Dreaming for Continual Learning(https://arxiv.org/abs/2603.01935)
Keywords: diffusion, generative
Abstract: Continual learning requires balancing plasticity and stability while mitigating catastrophic forgetting. Inspired by human dreaming as a mechanism for internal simulation and knowledge restructuring, we introduce Dream2Learn (D2L), a framework in which a model autonomously generates structured synthetic experiences from its own internal representations and uses them for self-improvement. Rather than reconstructing past data as in generative replay, D2L enables a classifier to create novel, semantically distinct dreamed classes that are coherent with its learned knowledge yet do not correspond to previously observed data. These dreamed samples are produced by conditioning a frozen diffusion model through soft prompt optimization driven by the classifier itself. The generated data are not used to replace memory, but to expand and reorganize the representation space, effectively allowing the network to self-train on internally synthesized concepts. By integrating dreamed classes into continual training, D2L proactively structures latent features to support forward knowledge transfer and adaptation to future tasks. This prospective self-training mechanism mirrors the role of sleep in consolidating and reorganizing memory, turning internal simulations into a tool for improved generalization. Experiments on Mini-ImageNet, FG-ImageNet, and ImageNet-R demonstrate that D2L consistently outperforms strong rehearsal-based baselines and achieves positive forward transfer, confirming its ability to enhance adaptability through internally generated training signals.

Title: Probabilistic Retrofitting of Learned Simulators

Authors: Cristiana Diaconu, Miles Cranmer, Richard E. Turner, Tanya Marwah, Payel Mukhopadhyay
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2603.01949
Pdf URL: https://arxiv.org/pdf/2603.01949
Copy Paste: [[2603.01949]] Probabilistic Retrofitting of Learned Simulators(https://arxiv.org/abs/2603.01949)
Keywords: foundation model
Abstract: Dominant approaches for modelling Partial Differential Equations (PDEs) rely on deterministic predictions, yet many physical systems of interest are inherently chaotic and uncertain. While training probabilistic models from scratch is possible, it is computationally expensive and fails to leverage the significant resources already invested in high-performing deterministic backbones. In this work, we adopt a training-efficient strategy to transform pre-trained deterministic models into probabilistic ones via retrofitting with a proper scoring rule: the Continuous Ranked Probability Score (CRPS). Crucially, this approach is architecture-agnostic: it applies the same adaptation mechanism across distinct model backbones with minimal code modifications. The method proves highly effective across different scales of pre-training: for models trained on single dynamical systems, we achieve 20-54% reductions in rollout CRPS and up to 30% improvements in variance-normalised RMSE (VRMSE) relative to compute-matched deterministic fine-tuning. We further validate our approach on a PDE foundation model, trained on multiple systems and retrofitted on the dataset of interest, to show that our probabilistic adaptation yields an improvement of up to 40% in CRPS and up to 15% in VRMSE compared to deterministic fine-tuning. Validated across diverse architectures and dynamics, our results show that probabilistic PDE modelling need not require retraining from scratch, but can be unlocked from existing deterministic backbones with modest additional training cost.

Title: Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Authors: Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.01950
Pdf URL: https://arxiv.org/pdf/2603.01950
Copy Paste: [[2603.01950]] Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment(https://arxiv.org/abs/2603.01950)
Keywords: generative
Abstract: A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

Title: CoVAE: correlated multimodal generative modeling

Authors: Federico Caretti, Guido Sanguinetti
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.01965
Pdf URL: https://arxiv.org/pdf/2603.01965
Copy Paste: [[2603.01965]] CoVAE: correlated multimodal generative modeling(https://arxiv.org/abs/2603.01965)
Keywords: generative
Abstract: Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.

Title: Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Authors: Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.01993
Pdf URL: https://arxiv.org/pdf/2603.01993
Copy Paste: [[2603.01993]] Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection(https://arxiv.org/abs/2603.01993)
Keywords: generative
Abstract: Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.

Title: Mitigating topology biases in Graph Diffusion via Counterfactual Intervention

Authors: Wendi Wang, Jiaxi Yang, Yongkang Du, Lu Lin
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2603.02005
Pdf URL: https://arxiv.org/pdf/2603.02005
Copy Paste: [[2603.02005]] Mitigating topology biases in Graph Diffusion via Counterfactual Intervention(https://arxiv.org/abs/2603.02005)
Keywords: diffusion
Abstract: Graph diffusion models have gained significant attention in graph generation tasks, but they often inherit and amplify topology biases from sensitive attributes (e.g. gender, age, region), leading to unfair synthetic graphs. Existing fair graph generation using diffusion models is limited to specific graph-based applications with complete labels or requires simultaneous updates for graph structure and node attributes, making them unsuitable for general usage. To relax these limitations by applying the debiasing method directly on graph topology, we propose Fair Graph Diffusion Model (FairGDiff), a counterfactual-based one-step solution that mitigates topology biases while balancing fairness and utility. In detail, we construct a causal model to capture the relationship between sensitive attributes, biased link formation, and the generated graph structure. By answering the counterfactual question "Would the graph structure change if the sensitive attribute were different?", we estimate an unbiased treatment and incorporate it into the diffusion process. FairGDiff integrates counterfactual learning into both forward diffusion and backward denoising, ensuring that the generated graphs are independent of sensitive attributes while preserving structural integrity. Extensive experiments on real-world datasets demonstrate that FairGDiff achieves a superior trade-off between fairness and utility, outperforming existing fair graph generation methods while maintaining scalability.

Title: MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising

Authors: Peiyuan Jing, Chun-Wun Cheng, Liutao Yang, Zhenxuan Zhang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier A. Montoya-Zegarra
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02012
Pdf URL: https://arxiv.org/pdf/2603.02012
Copy Paste: [[2603.02012]] MAP-Diff: Multi-Anchor Guided Diffusion for Progressive 3D Whole-Body Low-Dose PET Denoising(https://arxiv.org/abs/2603.02012)
Keywords: diffusion
Abstract: Low-dose Positron Emission Tomography (PET) reduces radiation exposure but suffers from severe noise and quantitative degradation. Diffusion-based denoising models achieve strong final reconstructions, yet their reverse trajectories are typically unconstrained and not aligned with the progressive nature of PET dose formation. We propose MAP-Diff, a multi-anchor guided diffusion framework for progressive 3D whole-body PET denoising. MAP-Diff introduces clinically observed intermediate-dose scans as trajectory anchors and enforces timestep-dependent supervision to regularize the reverse process toward dose-aligned intermediate states. Anchor timesteps are calibrated via degradation matching between simulated diffusion corruption and real multi-dose PET pairs, and a timestep-weighted anchor loss stabilizes stage-wise learning. At inference, the model requires only ultra-low-dose input while enabling progressive, dose-consistent intermediate restoration. Experiments on internal (Siemens Biograph Vision Quadra) and cross-scanner (United Imaging uEXPLORER) datasets show consistent improvements over strong CNN-, Transformer-, GAN-, and diffusion-based baselines. On the internal dataset, MAP-Diff improves PSNR from 42.48 dB to 43.71 dB (+1.23 dB), increases SSIM to 0.986, and reduces NMAE from 0.115 to 0.103 (-0.012) compared to 3D DDPM. Performance gains generalize across scanners, achieving 34.42 dB PSNR and 0.141 NMAE on the external cohort, outperforming all competing methods.

Title: CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic Data

Authors: Amir Asiaee, Zhuohui J. Liang, Chao Yan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02015
Pdf URL: https://arxiv.org/pdf/2603.02015
Copy Paste: [[2603.02015]] CausalWrap: Model-Agnostic Causal Constraint Wrappers for Tabular Synthetic Data(https://arxiv.org/abs/2603.02015)
Keywords: diffusion
Abstract: Tabular synthetic data generators are typically trained to match observational distributions, which can yield high conventional utility (e.g., column correlations, predictive accuracy) yet poor preservation of structural relations relevant to causal analysis and out-of-distribution (OOD) reasoning. When the downstream use of synthetic data involves causal reasoning -- estimating treatment effects, evaluating policies, or testing mediation pathways -- merely matching the observational distribution is insufficient: structural fidelity and treatment-mechanism preservation become essential. We propose CausalWrap (CW), a model-agnostic wrapper that injects partial causal knowledge (PCK) -- trusted edges, forbidden edges, and qualitative/monotonic constraints -- into any pretrained base generator (GAN, VAE, or diffusion model), without requiring access to its internals. CW learns a lightweight, differentiable post-hoc correction map applied to samples from the base generator, optimized with causal penalty terms under an augmented-Lagrangian schedule. We provide theoretical results connecting penalty-based optimization to constraint satisfaction and relating approximate factorization to joint distributional control. We validate CW on simulated structural causal models (SCMs) with known ground-truth interventions, semi-synthetic causal benchmarks (IHDP and an ACIC-style suite), and a real-world ICU cohort (MIMIC-IV) with expert-elicited partial graphs. CW improves causal fidelity across diverse base generators -- e.g., reducing average treatment effect (ATE) error by up to 63% on ACIC and lifting ATE agreement from 0.00 to 0.38 on the intensive care unit (ICU) cohort -- while largely retaining conventional utility.

Title: PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking

Authors: He Li, Feichen Song, Boyi Zeng, Shixiang Song, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02023
Pdf URL: https://arxiv.org/pdf/2603.02023
Copy Paste: [[2603.02023]] PonderLM-3: Adaptive Token-Wise Pondering with Differentiable Masking(https://arxiv.org/abs/2603.02023)
Keywords: self-supervised
Abstract: Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.

Title: WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories

Authors: Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02049
Pdf URL: https://arxiv.org/pdf/2603.02049
Copy Paste: [[2603.02049]] WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories(https://arxiv.org/abs/2603.02049)
Keywords: diffusion
Abstract: Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories. In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds. Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank. These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction. Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training. Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.

Title: ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks

Authors: Joël Küchler, Ellen van Maren, Vaiva Vasiliauskaitė, Katarina Vulić, Reza Abbasi-Asl, Stephan J. Ihle
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02063
Pdf URL: https://arxiv.org/pdf/2603.02063
Copy Paste: [[2603.02063]] ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks(https://arxiv.org/abs/2603.02063)
Keywords: generative
Abstract: Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.

Title: From Pixels to Patches: Pooling Strategies for Earth Embeddings

Authors: Isaac Corley, Caleb Robinson, Inbal Becker-Reshef, Juan M. Lavista Ferres
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02080
Pdf URL: https://arxiv.org/pdf/2603.02080
Copy Paste: [[2603.02080]] From Pixels to Patches: Pooling Strategies for Earth Embeddings(https://arxiv.org/abs/2603.02080)
Keywords: foundation model
Abstract: As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.

Title: LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

Authors: Hualiang Wei, Shunran Jia, Jialun Liu, Wenhui Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02129
Pdf URL: https://arxiv.org/pdf/2603.02129
Copy Paste: [[2603.02129]] LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation(https://arxiv.org/abs/2603.02129)
Keywords: diffusion, generative
Abstract: We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.

Title: GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis

Authors: Srikumar Sastry, Dan Cher, Brian Wei, Aayush Dhakal, Subash Khanal, Dev Gupta, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.02172
Pdf URL: https://arxiv.org/pdf/2603.02172
Copy Paste: [[2603.02172]] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis(https://arxiv.org/abs/2603.02172)
Keywords: diffusion, generative
Abstract: We introduce GeoDiT, a diffusion transformer designed for text-to-satellite image generation with point-based control. Existing controlled satellite image generative models often require pixel-level maps that are time-consuming to acquire, yet semantically limited. To address this limitation, we introduce a novel point-based conditioning framework that controls the generation process through the spatial location of the points and the textual description associated with each point, providing semantically rich control signals. This approach enables flexible, annotation-friendly, and computationally simple inference for satellite image generation. To this end, we introduce an adaptive local attention mechanism that effectively regularizes the attention scores based on the input point queries. We systematically evaluate various domain-specific design choices for training GeoDiT, including the selection of satellite image representation for alignment and geolocation representation for conditioning. Our experiments demonstrate that GeoDiT achieves impressive generation performance, surpassing the state-of-the-art remote sensing generative models.

Title: Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02175
Pdf URL: https://arxiv.org/pdf/2603.02175
Copy Paste: [[2603.02175]] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance(https://arxiv.org/abs/2603.02175)
Keywords: generative
Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at this https URL.

Title: Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

Authors: Divyanshu Daiya, Aniket Bera
Subjects: cs.CV, cs.AI, cs.GR, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2603.02190
Pdf URL: https://arxiv.org/pdf/2603.02190
Copy Paste: [[2603.02190]] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation(https://arxiv.org/abs/2603.02190)
Keywords: diffusion
Abstract: We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

Title: Frontier Models Can Take Actions at Low Probabilities

Authors: Alex Serrano, Wen Xing, David Lindner, Erik Jenner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.02202
Pdf URL: https://arxiv.org/pdf/2603.02202
Copy Paste: [[2603.02202]] Frontier Models Can Take Actions at Low Probabilities(https://arxiv.org/abs/2603.02202)
Keywords: in-context
Abstract: Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.