2025-11-24

Title: Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

Authors: Guanlue Li, Xufeng Zhao, Fang Wu, Sören Laue
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16675
Pdf URL: https://arxiv.org/pdf/2511.16675
Copy Paste: [[2511.16675]] Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model(https://arxiv.org/abs/2511.16675)
Keywords: diffusion
Abstract: Protein-protein interactions (PPIs) are governed by surface complementarity and hydrophobic interactions at protein interfaces. However, designing diverse and physically realistic protein structure and surfaces that precisely complement target receptors remains a significant challenge in computational protein design. In this work, we introduce PepBridge, a novel framework for the joint design of protein surface and structure that seamlessly integrates receptor surface geometry and biochemical properties. Starting with a receptor surface represented as a 3D point cloud, PepBridge generates complete protein structures through a multi-step process. First, it employs denoising diffusion bridge models (DDBMs) to map receptor surfaces to ligand surfaces. Next, a multi-model diffusion model predicts the corresponding structure, while Shape-Frame Matching Networks ensure alignment between surface geometry and backbone architecture. This integrated approach facilitates surface complementarity, conformational stability, and chemical feasibility. Extensive validation across diverse protein design scenarios demonstrates PepBridge's efficacy in generating structurally viable proteins, representing a significant advancement in the joint design of top-down protein structure.

Title: Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions

Authors: Takuya Igaue, Catia Correia-Caeiro, Akito Yoshida, Takako Miyabe-Nishiwaki, Ryusuke Hayashi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2511.16711
Pdf URL: https://arxiv.org/pdf/2511.16711
Copy Paste: [[2511.16711]] Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions(https://arxiv.org/abs/2511.16711)
Keywords: generative
Abstract: Generating animal faces using generative AI techniques is challenging because the available training images are limited both in quantity and variation, particularly for facial expressions across individuals. In this study, we focus on macaque monkeys, widely studied in systems neuroscience and evolutionary research, and propose a method to generate their facial expressions using a style-based generative image model (i.e., StyleGAN2). To address data limitations, we implemented: 1) data augmentation by synthesizing new facial expression images using a motion transfer to animate still images with computer graphics, 2) sample selection based on the latent representation of macaque faces from an initially trained StyleGAN2 model to ensure the variation and uniform sampling in training dataset, and 3) loss function refinement to ensure the accurate reproduction of subtle movements, such as eye movements. Our results demonstrate that the proposed method enables the generation of diverse facial expressions for multiple macaque individuals, outperforming models trained solely on original still images. Additionally, we show that our model is effective for style-based image editing, where specific style parameters correspond to distinct facial movements. These findings underscore the model's potential for disentangling motion components as style parameters, providing a valuable tool for research on macaque facial expressions.

Title: Password Strength Analysis Through Social Network Data Exposure: A Combined Approach Relying on Data Reconstruction and Generative Models

Authors: Maurizio Atzori, Eleonora Calò, Loredana Caruccio, Stefano Cirillo, Giuseppe Polese, Giandomenico Solimando
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16716
Pdf URL: https://arxiv.org/pdf/2511.16716
Copy Paste: [[2511.16716]] Password Strength Analysis Through Social Network Data Exposure: A Combined Approach Relying on Data Reconstruction and Generative Models(https://arxiv.org/abs/2511.16716)
Keywords: generative
Abstract: Although passwords remain the primary defense against unauthorized access, users often tend to use passwords that are easy to remember. This behavior significantly increases security risks, also due to the fact that traditional password strength evaluation methods are often inadequate. In this discussion paper, we present SODA ADVANCE, a data reconstruction tool also designed to enhance evaluation processes related to the password strength. In particular, SODA ADVANCE integrates a specialized module aimed at evaluating password strength by leveraging publicly available data from multiple sources, including social media platforms. Moreover, we investigate the capabilities and risks associated with emerging Large Language Models (LLMs) in evaluating and generating passwords, respectively. Experimental assessments conducted with 100 real users demonstrate that LLMs can generate strong and personalized passwords possibly defined according to user profiles. Additionally, LLMs were shown to be effective in evaluating passwords, especially when they can take into account user profile data.

Title: SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG

Authors: Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16766
Pdf URL: https://arxiv.org/pdf/2511.16766
Copy Paste: [[2511.16766]] SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG(https://arxiv.org/abs/2511.16766)
Keywords: generative
Abstract: Scalable Vector Graphics (SVGs) are central to modern design workflows, offering scaling without distortion and precise editability. However, for single object SVGs, generating multi-view consistent SVGs from a single-view input remains underexplored. We present a three stage framework that produces multi-view SVGs with geometric and color consistency from a single SVG input. First, the rasterized input is lifted to a 3D representation and rendered under target camera poses, producing multi-view images of the object. Next, we extend the temporal memory mechanism of Segment Anything 2 (SAM2) to the spatial domain, constructing a spatial memory bank that establishes part level correspondences across neighboring views, yielding cleaner and more consistent vector paths and color assignments without retraining. Finally, during the raster to vector conversion, we perform path consolidation and structural optimization to reduce redundancy while preserving boundaries and semantics. The resulting SVGs exhibit strong geometric and color consistency across views, significantly reduce redundant paths, and retain fine structural details. This work bridges generative modeling and structured vector representation, providing a scalable route to single input, object level multi-view SVG generation and supporting applications such as asset creation and semantic vector editing.

Title: WorldGen: From Text to Traversable and Interactive 3D Worlds

Authors: Dilin Wang, Hyunyoung Jung, Tom Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Yu-Ying Yeh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Yuchen Fan, Sergiu Oprea, Ziyan Wang, Roman Shapovalov, Nikolaos Sarafianos, Thibault Groueix, Antoine Toisoul, Prithviraj Dhar, Xiao Chu, Minghao Chen, Geon Yeong Park, Mahima Gupta, Yassir Azziz, Rakesh Ranjan, Andrea Vedaldi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16825
Pdf URL: https://arxiv.org/pdf/2511.16825
Copy Paste: [[2511.16825]] WorldGen: From Text to Traversable and Interactive 3D Worlds(https://arxiv.org/abs/2511.16825)
Keywords: diffusion, generative
Abstract: We introduce WorldGen, a system that enables the automatic creation of large-scale, interactive 3D worlds directly from text prompts. Our approach transforms natural language descriptions into traversable, fully textured environments that can be immediately explored or edited within standard game engines. By combining LLM-driven scene layout reasoning, procedural generation, diffusion-based 3D generation, and object-aware scene decomposition, WorldGen bridges the gap between creative intent and functional virtual spaces, allowing creators to design coherent, navigable worlds without manual modeling or specialized 3D expertise. The system is fully modular and supports fine-grained control over layout, scale, and style, producing worlds that are geometrically consistent, visually rich, and efficient to render in real time. This work represents a step towards accessible, generative world-building at scale, advancing the frontier of 3D generative AI for applications in gaming, simulation, and immersive social environments.

Title: ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds

Authors: Yihang Fu, Lifang He, Qingyu Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16828
Pdf URL: https://arxiv.org/pdf/2511.16828
Copy Paste: [[2511.16828]] ManifoldFormer: Geometric Deep Learning for Neural Dynamics on Riemannian Manifolds(https://arxiv.org/abs/2511.16828)
Keywords: foundation model
Abstract: Existing EEG foundation models mainly treat neural signals as generic time series in Euclidean space, ignoring the intrinsic geometric structure of neural dynamics that constrains brain activity to low-dimensional manifolds. This fundamental mismatch between model assumptions and neural geometry limits representation quality and cross-subject generalization. ManifoldFormer addresses this limitation through a novel geometric deep learning framework that explicitly learns neural manifold representations. The architecture integrates three key innovations: a Riemannian VAE for manifold embedding that preserves geometric structure, a geometric Transformer with geodesic-aware attention mechanisms operating directly on neural manifolds, and a dynamics predictor leveraging neural ODEs for manifold-constrained temporal evolution. Extensive evaluation across four public datasets demonstrates substantial improvements over state-of-the-art methods, with 4.6-4.8% higher accuracy and 6.2-10.2% higher Cohen's Kappa, while maintaining robust cross-subject generalization. The geometric approach reveals meaningful neural patterns consistent with neurophysiological principles, establishing geometric constraints as essential for effective EEG foundation models.

Title: PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models

Authors: Oscar Chew, Po-Yi Lu, Jayden Lin, Kuan-Hao Huang, Hsuan-Tien Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16830
Pdf URL: https://arxiv.org/pdf/2511.16830
Copy Paste: [[2511.16830]] PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models(https://arxiv.org/abs/2511.16830)
Keywords: diffusion
Abstract: Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.

Title: Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks

Authors: Leonardo Pepino, Pablo Riera, Juan Kamienkowski, Luciana Ferrer
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2511.16849
Pdf URL: https://arxiv.org/pdf/2511.16849
Copy Paste: [[2511.16849]] Better audio representations are more brain-like: linking model-brain alignment with performance in downstream auditory tasks(https://arxiv.org/abs/2511.16849)
Keywords: self-supervised
Abstract: Artificial neural networks (ANNs) are increasingly powerful models of brain computation, yet it remains unclear whether improving their task performance also makes their internal representations more similar to brain signals. To address this question in the auditory domain, we quantified the alignment between the internal representations of 36 different audio models and brain activity from two independent fMRI datasets. Using voxel-wise and component-wise regression, and representation similarity analysis (RSA), we found that recent self-supervised audio models with strong performance in diverse downstream tasks are better predictors of auditory cortex activity than older and more specialized models. To assess the quality of the audio representations, we evaluated these models in 6 auditory tasks from the HEAREval benchmark, spanning music, speech, and environmental sounds. This revealed strong positive Pearson correlations ($r>0.7$) between a model's overall task performance and its alignment with brain representations. Finally, we analyzed the evolution of the similarity between audio and brain representations during the pretraining of EnCodecMAE. We discovered that brain similarity increases progressively and emerges early during pretraining, despite the model not being explicitly optimized for this objective. This suggests that brain-like representations can be an emergent byproduct of learning to reconstruct missing information from naturalistic audio data.

Title: The use of vocal biomarkers in the detection of Parkinson's disease: a robust statistical performance comparison of classic machine learning models

Authors: Katia Pires Nascimento do Sacramento, Elliot Q. C. Garcia, Nicéias Silva Vilela, Vinicius P. Sacramento, Tiago A. E. Ferreira
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16856
Pdf URL: https://arxiv.org/pdf/2511.16856
Copy Paste: [[2511.16856]] The use of vocal biomarkers in the detection of Parkinson's disease: a robust statistical performance comparison of classic machine learning models(https://arxiv.org/abs/2511.16856)
Keywords: generative
Abstract: Parkinson's disease (PD) is a progressive neurodegenerative disorder that, in addition to directly impairing functional mobility, is frequently associated with vocal impairments such as hypophonia and dysarthria, which typically manifest in the early stages. The use of vocal biomarkers to support the early diagnosis of PD presents a non-invasive, low-cost, and accessible alternative in clinical settings. Thus, the objective of this cross-sectional study was to consistently evaluate the effectiveness of a Deep Neural Network (DNN) in distinguishing individuals with Parkinson's disease from healthy controls, in comparison with traditional Machine Learning (ML) methods, using vocal biomarkers. Two publicly available voice datasets were used. Mel-frequency cepstral coefficients (MFCCs) were extracted from the samples, and model robustness was assessed using a validation strategy with 1000 independent random executions. Performance was evaluated using classification statistics. Since normality assumptions were not satisfied, non-parametric tests (Kruskal-Wallis and Bonferroni post-hoc tests) were applied to verify whether the tested classification models were similar or different in the classification of PD. With an average accuracy of $98.65\%$ and $92.11\%$ on the Italian Voice dataset and Parkinson's Telemonitoring dataset, respectively, the DNN demonstrated superior performance and efficiency compared to traditional ML models, while also achieving competitive results when benchmarked against relevant studies. Overall, this study confirms the efficiency of DNNs and emphasizes their potential to provide greater accuracy and reliability for the early detection of neurodegenerative diseases using voice-based biomarkers.

Title: Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Authors: Loukas Sfountouris, Giannis Daras, Paris Giampouras
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16870
Pdf URL: https://arxiv.org/pdf/2511.16870
Copy Paste: [[2511.16870]] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment(https://arxiv.org/abs/2511.16870)
Keywords: diffusion, self-supervised, generative
Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

Title: Predicting the Formation of Induction Heads

Authors: Tatsuya Aoyama, Ethan Gotlieb Wilcox, Nathan Schneider
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16893
Pdf URL: https://arxiv.org/pdf/2511.16893
Copy Paste: [[2511.16893]] Predicting the Formation of Induction Heads(https://arxiv.org/abs/2511.16893)
Keywords: in-context
Abstract: Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

Title: Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models

Authors: Hao-Chien Hsueh, Chi-En Yen, Wen-Hsiao Peng, Ching-Chun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16904
Pdf URL: https://arxiv.org/pdf/2511.16904
Copy Paste: [[2511.16904]] Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models(https://arxiv.org/abs/2511.16904)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models have achieved remarkable success in generative tasks across diverse data types. While recent studies have explored alternative degradation processes beyond Gaussian noise, this paper bridges two key diffusion paradigms: hot diffusion, which relies entirely on noise, and cold diffusion, which uses only blurring without noise. We argue that hot diffusion fails to exploit the strong correlation between high-frequency image detail and low-frequency structures, leading to random behaviors in the early steps of generation. Conversely, while cold diffusion leverages image correlations for prediction, it neglects the role of noise (randomness) in shaping the data manifold, resulting in out-of-manifold issues and partially explaining its performance drop. To integrate both strengths, we propose Warm Diffusion, a unified Blur-Noise Mixture Diffusion Model (BNMD), to control blurring and noise jointly. Our divide-and-conquer strategy exploits the spectral dependency in images, simplifying score model estimation by disentangling the denoising and deblurring processes. We further analyze the Blur-to-Noise Ratio (BNR) using spectral analysis to investigate the trade-off between model learning dynamics and changes in the data manifold. Extensive experiments across benchmarks validate the effectiveness of our approach for image generation.

Title: Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content

Authors: Shushi Wang, Zicheng Zhang, Chunyi Li, Wei Wang, Liya Ma, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16908
Pdf URL: https://arxiv.org/pdf/2511.16908
Copy Paste: [[2511.16908]] Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content(https://arxiv.org/abs/2511.16908)
Keywords: generative
Abstract: Quality assessment of AI-generated content is crucial for evaluating model capability and guiding model optimization. However, most existing quality assessment datasets and models provide only a single quality score, which is too coarse to offer targeted guidance for improving generative models. In current applications of AI-generated images, realism and plausibility are two critical dimensions, and with the emergence of unified generation-understanding models, fine-grained evaluation along these dimensions becomes especially effective for improving generative performance. Therefore, we introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images. Q-Real consists of 3,088 images generated by popular text-to-image models. For each image, we annotate the locations of major entities and provide a set of judgment questions and attribution descriptions for these along the dimensions of realism and plausibility. Considering that recent advances in multi-modal large language models (MLLMs) enable fine-grained evaluation of AI-generated images, we construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning. Finally, to enhance MLLM capabilities, we design a fine-tuning framework and conduct experiments on multiple MLLMs using our dataset. Experimental results demonstrate the high quality and significance of our dataset and the comprehensiveness of the benchmark. Dataset and code will be released upon publication.

Title: PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage

Authors: Trieu Nguyen, Hao-Wei Pang, Shasha Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16912
Pdf URL: https://arxiv.org/pdf/2511.16912
Copy Paste: [[2511.16912]] PepEVOLVE: Position-Aware Dynamic Peptide Optimization via Group-Relative Advantage(https://arxiv.org/abs/2511.16912)
Keywords: generative
Abstract: Macrocyclic peptides are an emerging modality that combines biologics-like affinity with small-molecule-like developability, but their vast combinatorial space and multi-parameter objectives make lead optimization slow and challenging. Prior generative approaches such as PepINVENT require chemists to pre-specify mutable positions for optimization, choices that are not always known a priori, and rely on static pretraining and optimization algorithms that limit the model's ability to generalize and effectively optimize peptide sequences. We introduce PepEVOLVE, a position-aware, dynamic framework that learns both where to edit and how to dynamically optimize peptides for multi-objective improvement. PepEVOLVE (i) augments pretraining with dynamic masking and CHUCKLES shifting to improve generalization, (ii) uses a context-free multi-armed bandit router that discovers high-reward residues, and (iii) couples a novel evolving optimization algorithm with group-relative advantage to stabilize reinforcement updates. During in silico evaluations, the router policy reliably learns and concentrates probability on chemically meaningful sites that influence the peptide's properties. On a therapeutically motivated Rev-binding macrocycle benchmark, PepEVOLVE outperformed PepINVENT by reaching higher mean scores (approximately 0.8 vs. 0.6), achieving best candidates with a score of 0.95 (vs. 0.87), and converging in fewer steps under the task of optimizing permeability and lipophilicity with structural constraints. Overall, PepEVOLVE offers a practical, reproducible path to peptide lead optimization when optimal edit sites are unknown, enabling more efficient exploration and improving design quality across multiple objectives.

Title: UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

Authors: Chi Zhang, Jiepeng Wang, Youming Wang, Yuanzhi Liang, Xiaoyan Yang, Zuoxin Li, Haibin Huang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16917
Pdf URL: https://arxiv.org/pdf/2511.16917
Copy Paste: [[2511.16917]] UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2511.16917)
Keywords: diffusion, generative
Abstract: We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.

Title: DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution

Authors: Chaoran Xu, Chengkan Lv, Qiyu Chen, Yunkang Cao, Feng Zhang, Zhengtao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16920
Pdf URL: https://arxiv.org/pdf/2511.16920
Copy Paste: [[2511.16920]] DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution(https://arxiv.org/abs/2511.16920)
Keywords: diffusion, anomaly
Abstract: Anomaly generation is often framed as few-shot fine-tuning with anomalous samples, which contradicts the scarcity that motivates generation and tends to overfit category priors. We tackle the setting where no real anomaly samples or training are available. We propose Delta-Denoising (DeltaDeno), a training-free zero-shot anomaly generation method that localizes and edits defects by contrasting two diffusion branches driven by a minimal prompt pair under a shared schedule. By accumulating per-step denoising deltas into an image-specific localization map, we obtain a mask to guide the latent inpainting during later diffusion steps and preserve the surrounding context while generating realistic local defects. To improve stability and control, DeltaDeno performs token-level prompt refinement that aligns shared content and strengthens anomaly tokens, and applies a spatial attention bias restricted to anomaly tokens in the predicted region. Experiments on public datasets show that DeltaDeno achieves great generation, realism and consistent gains in downstream detection performance. Code will be made publicly available.

Title: Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features

Authors: Jingyi Xu, Meisong Zheng, Ying Chen, Minglang Qiao, Xin Deng, Mai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16928
Pdf URL: https://arxiv.org/pdf/2511.16928
Copy Paste: [[2511.16928]] Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features(https://arxiv.org/abs/2511.16928)
Keywords: diffusion
Abstract: Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82\% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37\% tLPIPS reduction).

Title: CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection

Authors: Rui Xue, Dan He, Fengmei Jin, Chen Zhang, Xiaofang Zhou
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2511.16929
Pdf URL: https://arxiv.org/pdf/2511.16929
Copy Paste: [[2511.16929]] CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection(https://arxiv.org/abs/2511.16929)
Keywords: anomaly
Abstract: Detecting trajectory anomalies is a vital task in modern Intelligent Transportation Systems (ITS), enabling the identification of unsafe, inefficient, or irregular travel behaviours. While deep learning has emerged as the dominant approach, several key challenges remain unresolved. First, sub-trajectory anomaly detection, capable of pinpointing the precise segments where anomalies occur, remains underexplored compared to whole-trajectory analysis. Second, many existing methods depend on carefully tuned thresholds, limiting their adaptability in real-world applications. Moreover, the irregular sampling of trajectory data and the presence of noise in training sets further degrade model performance, making it difficult to learn reliable representations of normal routes. To address these challenges, we propose a contrastive reinforcement learning framework for online trajectory anomaly detection, CroTad. Our method is threshold-free and robust to noisy, irregularly sampled data. By incorporating contrastive learning, CroTad learns to extract diverse normal travel patterns for different itineraries and effectively distinguish anomalous behaviours at both sub-trajectory and point levels. The detection module leverages deep reinforcement learning to perform online, real-time anomaly scoring, enabling timely and fine-grained identification of abnormal segments. Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of our framework across various evaluation scenarios.

Title: Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Authors: Dailan He, Guanlin Feng, Xingtong Ge, Yazhe Niu, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2511.16955
Pdf URL: https://arxiv.org/pdf/2511.16955
Copy Paste: [[2511.16955]] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models(https://arxiv.org/abs/2511.16955)
Keywords: generative
Abstract: Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.

Title: MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis

Authors: Di Luo, Shuhui Yang, Mingxin Yang, Jiawei Lu, Yixuan Tang, Xintong Han, Zhuo Chen, Beibei Wang, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16957
Pdf URL: https://arxiv.org/pdf/2511.16957
Copy Paste: [[2511.16957]] MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis(https://arxiv.org/abs/2511.16957)
Keywords: diffusion, foundation model, generative
Abstract: Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks--text-to-material generation, image-to-material generation, and intrinsic decomposition--within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native $1024\times1024$ synthesis that substantially surpasses existing approaches in both quality and diversity.

Title: Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices

Authors: Jigyasa Gupta, Soumya Goyal, Anil Kumar, Ishan Jindal
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16965
Pdf URL: https://arxiv.org/pdf/2511.16965
Copy Paste: [[2511.16965]] Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices(https://arxiv.org/abs/2511.16965)
Keywords: generative
Abstract: Synthesizing realistic cooked food images from raw inputs on edge devices is a challenging generative task, requiring models to capture complex changes in texture, color and structure during cooking. Existing image-to-image generation methods often produce unrealistic results or are too resource-intensive for edge deployment. We introduce the first oven-based cooking-progression dataset with chef-annotated doneness levels and propose an edge-efficient recipe and cooking state guided generator that synthesizes realistic food images conditioned on raw food image. This formulation enables user-preferred visual targets rather than fixed presets. To ensure temporal consistency and culinary plausibility, we introduce a domain-specific \textit{Culinary Image Similarity (CIS)} metric, which serves both as a training loss and a progress-monitoring signal. Our model outperforms existing baselines with significant reductions in FID scores (30\% improvement on our dataset; 60\% on public datasets)

Title: The Finer the Better: Towards Granular-aware Open-set Domain Generalization

Authors: Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16979
Pdf URL: https://arxiv.org/pdf/2511.16979
Copy Paste: [[2511.16979]] The Finer the Better: Towards Granular-aware Open-set Domain Generalization(https://arxiv.org/abs/2511.16979)
Keywords: diffusion
Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

Title: DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

Authors: Jonathan Skaza, Parsa Madinei, Ziqi Wen, Miguel Eckstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.16991
Pdf URL: https://arxiv.org/pdf/2511.16991
Copy Paste: [[2511.16991]] DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction(https://arxiv.org/abs/2511.16991)
Keywords: self-supervised
Abstract: Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.

Title: FLUID: Training-Free Face De-identification via Latent Identity Substitution

Authors: Jinhyeong Park, Shaheryar Muhammad, Seangmin Lee, Jong Taek Lee, Soon Ki Jung
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17005
Pdf URL: https://arxiv.org/pdf/2511.17005
Copy Paste: [[2511.17005]] FLUID: Training-Free Face De-identification via Latent Identity Substitution(https://arxiv.org/abs/2511.17005)
Keywords: diffusion
Abstract: We present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a training-free framework that directly substitutes identity in the latent space of pretrained diffusion models. Inspired by substitution mechanisms in chemistry, we reinterpret identity editing as semantic displacement in the latent h-space of a pretrained unconditional diffusion model. Our framework discovers identity-editing directions through optimization guided by novel reagent losses, which supervise for attribute preservation and identity suppression. We further propose both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experimental results on CelebA-HQ and FFHQ demonstrate that FLUID achieves a superior trade-off between identity suppression and attribute preservation, outperforming state-of-the-art de-identification methods in both qualitative and quantitative metrics.

Title: Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

Authors: Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2511.17031
Pdf URL: https://arxiv.org/pdf/2511.17031
Copy Paste: [[2511.17031]] Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation(https://arxiv.org/abs/2511.17031)
Keywords: diffusion
Abstract: The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

Title: OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Authors: Teng Fu, Mengyang Zhao, Ke Niu, Kaixin Peng, Bin Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17053
Pdf URL: https://arxiv.org/pdf/2511.17053
Copy Paste: [[2511.17053]] OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding(https://arxiv.org/abs/2511.17053)
Keywords: foundation model
Abstract: LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

Title: ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Authors: Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17068
Pdf URL: https://arxiv.org/pdf/2511.17068
Copy Paste: [[2511.17068]] ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion(https://arxiv.org/abs/2511.17068)
Keywords: diffusion
Abstract: Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.

Title: Diversity Has Always Been There in Your Visual Autoregressive Models

Authors: Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17074
Pdf URL: https://arxiv.org/pdf/2511.17074
Copy Paste: [[2511.17074]] Diversity Has Always Been There in Your Visual Autoregressive Models(https://arxiv.org/abs/2511.17074)
Keywords: diffusion, generative
Abstract: Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at this https URL.

Title: SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

Authors: Di Wu, Liu Liu, Xueyu Yuan, Qiaoyu Jun, Wenxiao Chen, Ruilong Yan, Yiming Tang, Liangtu Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17092
Pdf URL: https://arxiv.org/pdf/2511.17092
Copy Paste: [[2511.17092]] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting(https://arxiv.org/abs/2511.17092)
Keywords: diffusion
Abstract: Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. Then we compress 3D Gaussians into planar Gaussians to facilitate accurate estimation of normal and depth. The planar Gaussians are optimized in a coarse-to-fine manner through depth smooth regularization and few-shot diffusion. Moreover, we introduce a part segmentation probability for each Gaussian primitive and update them by back-projecting part segmentation masks of renderings. Extensive experimental results demonstrate that our method achieves higher-fidelity part-level surface reconstruction on both synthetic and real-world data than existing methods. Codes will be made publicly available.

Title: Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models

Authors: He Huang, Zixuan Hu, Dongxiao Li, Yao Xiao, Ling-Yu Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17094
Pdf URL: https://arxiv.org/pdf/2511.17094
Copy Paste: [[2511.17094]] Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models(https://arxiv.org/abs/2511.17094)
Keywords: anomaly
Abstract: Video anomaly detection (VAD) plays a vital role in real-world applications such as security surveillance, autonomous driving, and industrial monitoring. Recent advances in large pre-trained models have opened new opportunities for training-free VAD by leveraging rich prior knowledge and general reasoning capabilities. However, existing studies typically rely on dense frame-level inference, incurring high computational costs and latency. This raises a fundamental question: Is dense reasoning truly necessary when using powerful pre-trained models in VAD systems? To answer this, we propose ReCoVAD, a novel framework inspired by the dual reflex and conscious pathways of the human nervous system, enabling selective frame processing to reduce redundant computation. ReCoVAD consists of two core pathways: (i) a Reflex pathway that uses a lightweight CLIP-based module to fuse visual features with prototype prompts and produce decision vectors, which query a dynamic memory of past frames and anomaly scores for fast response; and (ii) a Conscious pathway that employs a medium-scale vision-language model to generate textual event descriptions and refined anomaly scores for novel frames. It continuously updates the memory and prototype prompts, while an integrated large language model periodically reviews accumulated descriptions to identify unseen anomalies, correct errors, and refine prototypes. Extensive experiments show that ReCoVAD achieves state-of-the-art training-free performance while processing only 28.55\% and 16.04\% of the frames used by previous methods on the UCF-Crime and XD-Violence datasets, demonstrating that sparse reasoning is sufficient for effective large-model-based VAD.

Title: AutoGraphAD: A novel approach using Variational Graph Autoencoders for anomalous network flow detection

Authors: Georgios Anyfantis, Pere Barlet-Ros
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17113
Pdf URL: https://arxiv.org/pdf/2511.17113
Copy Paste: [[2511.17113]] AutoGraphAD: A novel approach using Variational Graph Autoencoders for anomalous network flow detection(https://arxiv.org/abs/2511.17113)
Keywords: anomaly
Abstract: Network Intrusion Detection Systems (NIDS) are essential tools for detecting network attacks and intrusions. While extensive research has explored the use of supervised Machine Learning for attack detection and characterisation, these methods require accurately labelled datasets, which are very costly to obtain. Moreover, existing public datasets have limited and/or outdated attacks, and many of them suffer from mislabelled data. To reduce the reliance on labelled data, we propose AutoGraphAD, a novel unsupervised anomaly detection approach based on a Heterogeneous Variational Graph Autoencoder. AutoGraphAD operates on heterogeneous graphs, made from connection and IP nodes that capture network activity within a time window. The model is trained using unsupervised and contrastive learning, without relying on any labelled data. The reconstruction, structural loss, and KL divergence are then weighted and combined in an anomaly score that is then used for anomaly detection. Overall, AutoGraphAD yields the same, and in some cases better, results than previous unsupervised approaches, such as Anomal-E, but without requiring costly downstream anomaly detectors. As a result, AutoGraphAD achieves around 1.18 orders of magnitude faster training and 1.03 orders of magnitude faster inference, which represents a significant advantage for operational deployment.

Title: Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

Authors: Quentin Anthony, Yury Tokpanov, Skyler Szot, Srivatsan Rajagopal, Praneeth Medepalli, Rishi Iyer, Vasu Shyam, Anna Golubeva, Ansh Chaurasia, Xiao Yang, Tomas Figliolia, Robert Washbourne, Drew Thorstensen, Amartey Pearson, Zack Grossbart, Jason van Patten, Emad Barsoum, Zhenyu Gu, Yao Fu, Beren Millidge
Subjects: cs.CL, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2511.17127
Pdf URL: https://arxiv.org/pdf/2511.17127
Copy Paste: [[2511.17127]] Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design(https://arxiv.org/abs/2511.17127)
Keywords: foundation model
Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

Title: Four decades of circumpolar super-resolved satellite land surface temperature data

Authors: Sonia Dupuis, Nando Metzger, Konrad Schindler, Frank Göttsche, Stefan Wunderle
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17134
Pdf URL: https://arxiv.org/pdf/2511.17134
Copy Paste: [[2511.17134]] Four decades of circumpolar super-resolved satellite land surface temperature data(https://arxiv.org/abs/2511.17134)
Keywords: diffusion
Abstract: Land surface temperature (LST) is an essential climate variable (ECV) crucial for understanding land-atmosphere energy exchange and monitoring climate change, especially in the rapidly warming Arctic. Long-term satellite-based LST records, such as those derived from the Advanced Very High Resolution Radiometer (AVHRR), are essential for detecting climate trends. However, the coarse spatial resolution of AVHRR's global area coverage (GAC) data limit their utility for analyzing fine-scale permafrost dynamics and other surface processes in the Arctic. This paper presents a new 42 years pan-Arctic LST dataset, downscaled from AVHRR GAC to 1 km with a super-resolution algorithm based on a deep anisotropic diffusion model. The model is trained on MODIS LST data, using coarsened inputs and native-resolution outputs, guided by high-resolution land cover, digital elevation, and vegetation height maps. The resulting dataset provides twice-daily, 1 km LST observations for the entire pan-Arctic region over four decades. This enhanced dataset enables improved modelling of permafrost, reconstruction of near-surface air temperature, and assessment of surface mass balance of the Greenland Ice Sheet. Additionally, it supports climate monitoring efforts in the pre-MODIS era and offers a framework adaptable to future satellite missions for thermal infrared observation and climate data record continuity.

Title: One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Authors: Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17138
Pdf URL: https://arxiv.org/pdf/2511.17138
Copy Paste: [[2511.17138]] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution(https://arxiv.org/abs/2511.17138)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.

Title: DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving

Authors: Liuhan Yin, Runkun Ju, Guodong Guo, Erkang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17150
Pdf URL: https://arxiv.org/pdf/2511.17150
Copy Paste: [[2511.17150]] DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving(https://arxiv.org/abs/2511.17150)
Keywords: diffusion, generative
Abstract: Unlike discriminative approaches in autonomous driving that predict a fixed set of candidate trajectories of the ego vehicle, generative methods, such as diffusion models, learn the underlying distribution of future motion, enabling more flexible trajectory prediction. However, since these methods typically rely on denoising human-crafted trajectory anchors or random noise, there remains significant room for improvement. In this paper, we propose DiffRefiner, a novel two-stage trajectory prediction framework. The first stage uses a transformer-based Proposal Decoder to generate coarse trajectory predictions by regressing from sensor inputs using predefined trajectory anchors. The second stage applies a Diffusion Refiner that iteratively denoises and refines these initial predictions. In this way, we enhance the performance of diffusion-based planning by incorporating a discriminative trajectory proposal module, which provides strong guidance for the generative refinement process. Furthermore, we design a fine-grained denoising decoder to enhance scene compliance, enabling more accurate trajectory prediction through enhanced alignment with the surrounding environment. Experimental results demonstrate that DiffRefiner achieves state-of-the-art performance, attaining 87.4 EPDMS on NAVSIM v2, and 87.1 DS along with 71.4 SR on Bench2Drive, thereby setting new records on both public benchmarks. The effectiveness of each component is validated via ablation studies as well.

Title: Investigating self-supervised representations for audio-visual deepfake detection

Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Subjects: cs.CV, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2511.17181
Pdf URL: https://arxiv.org/pdf/2511.17181
Copy Paste: [[2511.17181]] Investigating self-supervised representations for audio-visual deepfake detection(https://arxiv.org/abs/2511.17181)
Keywords: self-supervised
Abstract: Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.

Title: Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning

Authors: Jiayi Wang, Wei Dai, Haoyu Wang, Sihan Yang, Haixia Bi, Jian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17201
Pdf URL: https://arxiv.org/pdf/2511.17201
Copy Paste: [[2511.17201]] Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning(https://arxiv.org/abs/2511.17201)
Keywords: foundation model
Abstract: In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM's zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{this https URL.}

Title: Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

Authors: Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17209
Pdf URL: https://arxiv.org/pdf/2511.17209
Copy Paste: [[2511.17209]] Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers(https://arxiv.org/abs/2511.17209)
Keywords: self-supervised, foundation model
Abstract: We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

Title: DelTriC: A Novel Clustering Method with Accurate Outlier

Authors: Tomas Javurek, Michal Gregor, Sebastian Kula, Marian Simko
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17219
Pdf URL: https://arxiv.org/pdf/2511.17219
Copy Paste: [[2511.17219]] DelTriC: A Novel Clustering Method with Accurate Outlier(https://arxiv.org/abs/2511.17219)
Keywords: anomaly
Abstract: The paper introduces DelTriC (Delaunay Triangulation Clustering), a clustering algorithm which integrates PCA/UMAP-based projection, Delaunay triangulation, and a novel back-projection mechanism to form clusters in the original high-dimensional space. DelTriC decouples neighborhood construction from decision-making by first triangulating in a low-dimensional proxy to index local adjacency, and then back-projecting to the original space to perform robust edge pruning, merging, and anomaly detection. DelTriC can outperform traditional methods such as k-means, DBSCAN, and HDBSCAN in many scenarios; it is both scalable and accurate, and it also significantly improves outlier detection.

Title: QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Authors: Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.17221
Pdf URL: https://arxiv.org/pdf/2511.17221
Copy Paste: [[2511.17221]] QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy(https://arxiv.org/abs/2511.17221)
Keywords: self-supervised, foundation model
Abstract: Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. this https URL

Title: FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble

Authors: Riccardo Tedoldi, Ola Engkvist, Patrick Bryant, Hossein Azizpour, Jon Paul Janet, Alessandro Tibo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17249
Pdf URL: https://arxiv.org/pdf/2511.17249
Copy Paste: [[2511.17249]] FlexiFlow: decomposable flow matching for generation of flexible molecular ensemble(https://arxiv.org/abs/2511.17249)
Keywords: diffusion
Abstract: Sampling useful three-dimensional molecular structures along with their most favorable conformations is a key challenge in drug discovery. Current state-of-the-art 3D de-novo design flow matching or diffusion-based models are limited to generating a single conformation. However, the conformational landscape of a molecule determines its observable properties and how tightly it is able to bind to a given protein target. By generating a representative set of low-energy conformers, we can more directly assess these properties and potentially improve the ability to generate molecules with desired thermodynamic observables. Towards this aim, we propose FlexiFlow, a novel architecture that extends flow-matching models, allowing for the joint sampling of molecules along with multiple conformations while preserving both equivariance and permutation invariance. We demonstrate the effectiveness of our approach on the QM9 and GEOM Drugs datasets, achieving state-of-the-art results in molecular generation tasks. Our results show that FlexiFlow can generate valid, unstrained, unique, and novel molecules with high fidelity to the training data distribution, while also capturing the conformational diversity of molecules. Moreover, we show that our model can generate conformational ensembles that provide similar coverage to state-of-the-art physics-based methods at a fraction of the inference time. Finally, FlexiFlow can be successfully transferred to the protein-conditioned ligand generation task, even when the dataset contains only static pockets without accompanying conformations.

Title: Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17254
Pdf URL: https://arxiv.org/pdf/2511.17254
Copy Paste: [[2511.17254]] Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats(https://arxiv.org/abs/2511.17254)
Keywords: generative
Abstract: Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

Title: A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Authors: Bulat Khaertdinov, Mirela Popa, Nava Tintarev
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2511.17255
Pdf URL: https://arxiv.org/pdf/2511.17255
Copy Paste: [[2511.17255]] A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback(https://arxiv.org/abs/2511.17255)
Keywords: generative
Abstract: Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

Title: Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing

Authors: Suchetan G. Uppur, Hemant Kumar, Vaibhav Kumar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17269
Pdf URL: https://arxiv.org/pdf/2511.17269
Copy Paste: [[2511.17269]] Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing(https://arxiv.org/abs/2511.17269)
Keywords: diffusion
Abstract: Training autonomous driving and navigation systems requires large and diverse point cloud datasets that capture complex edge case scenarios from various dynamic urban settings. Acquiring such diverse scenarios from real-world point cloud data, especially for critical edge cases, is challenging, which restricts system generalization and robustness. Current methods rely on simulating point cloud data within handcrafted 3D virtual environments, which is time-consuming, computationally expensive, and often fails to fully capture the complexity of real-world scenes. To address some of these issues, this research proposes a novel approach that addresses the problem discussed by editing real-world LiDAR scans using semantic mask-based guidance to generate novel synthetic LiDAR point clouds. We incorporate range image projection and semantic mask conditioning to achieve diffusion-based generation. Point clouds are transformed to 2D range view images, which are used as an intermediate representation to enable semantic editing using convex hull-based semantic masks. These masks guide the generation process by providing information on the dimensions, orientations, and locations of objects in the real environment, ensuring geometric consistency and realism. This approach demonstrates high-quality LiDAR point cloud generation, capable of producing complex edge cases and dynamic scenes, as validated on the KITTI-360 dataset. This offers a cost-effective and scalable solution for generating diverse LiDAR data, a step toward improving the robustness of autonomous driving systems.

Title: SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

Authors: Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17308
Pdf URL: https://arxiv.org/pdf/2511.17308
Copy Paste: [[2511.17308]] SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion(https://arxiv.org/abs/2511.17308)
Keywords: self-supervised
Abstract: Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via this https URL.

Title: MuM: Multi-View Masked Image Modeling for 3D Vision

Authors: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17309
Pdf URL: https://arxiv.org/pdf/2511.17309
Copy Paste: [[2511.17309]] MuM: Multi-View Masked Image Modeling for 3D Vision(https://arxiv.org/abs/2511.17309)
Keywords: self-supervised
Abstract: Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.

Title: Self-supervised denoising of raw tomography detector data for improved image reconstruction

Authors: Israt Jahan Tulin, Sebastian Starke, Dominic Windisch, André Bieberle, Peter Steinbach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17312
Pdf URL: https://arxiv.org/pdf/2511.17312
Copy Paste: [[2511.17312]] Self-supervised denoising of raw tomography detector data for improved image reconstruction(https://arxiv.org/abs/2511.17312)
Keywords: self-supervised
Abstract: Ultrafast electron beam X-ray computed tomography produces noisy data due to short measurement times, causing reconstruction artifacts and limiting overall image quality. To counteract these issues, two self-supervised deep learning methods for denoising of raw detector data were investigated and compared against a non-learning based denoising method. We found that the application of the deep-learning-based methods was able to enhance signal-to-noise ratios in the detector data and also led to consistent improvements of the reconstructed images, outperforming the non-learning based method.

Title: ReBaPL: Repulsive Bayesian Prompt Learning

Authors: Yassir Bendou, Omar Ezzahir, Eduardo Fernandes Montesuma, Gabriel Mahuas, Victoria Shevchenko, Mike Gartrell
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.17339
Pdf URL: https://arxiv.org/pdf/2511.17339
Copy Paste: [[2511.17339]] ReBaPL: Repulsive Bayesian Prompt Learning(https://arxiv.org/abs/2511.17339)
Keywords: foundation model
Abstract: Prompt learning has emerged as an effective technique for fine-tuning large-scale foundation models for downstream tasks. However, conventional prompt tuning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art methods for prompt learning.

Title: Refracting Reality: Generating Images with Realistic Transparent Objects

Authors: Yue Yin, Enze Tao, Dylan Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17340
Pdf URL: https://arxiv.org/pdf/2511.17340
Copy Paste: [[2511.17340]] Refracting Reality: Generating Images with Realistic Transparent Objects(https://arxiv.org/abs/2511.17340)
Keywords: generative
Abstract: Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image -- a panorama centered at the object -- using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.

Title: Loomis Painter: Reconstructing the Painting Process

Authors: Markus Pobitzer, Chang Liu, Chenyi Zhuang, Teng Long, Bin Ren, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17344
Pdf URL: https://arxiv.org/pdf/2511.17344
Copy Paste: [[2511.17344]] Loomis Painter: Reconstructing the Painting Process(https://arxiv.org/abs/2511.17344)
Keywords: diffusion, generative
Abstract: Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.

Title: DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Authors: Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17354
Pdf URL: https://arxiv.org/pdf/2511.17354
Copy Paste: [[2511.17354]] DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture(https://arxiv.org/abs/2511.17354)
Keywords: self-supervised
Abstract: Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues -- a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: this https URL.

Title: UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification

Authors: Taixi Chen, Jingyun Chen, Nancy Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17355
Pdf URL: https://arxiv.org/pdf/2511.17355
Copy Paste: [[2511.17355]] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification(https://arxiv.org/abs/2511.17355)
Keywords: foundation model
Abstract: Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.

Title: SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation

Authors: Seamie Hayes, Reenu Mohandas, Tim Brophy, Alexandre Boulch, Ganesh Sistu, Ciaran Eising
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17361
Pdf URL: https://arxiv.org/pdf/2511.17361
Copy Paste: [[2511.17361]] SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation(https://arxiv.org/abs/2511.17361)
Keywords: self-supervised
Abstract: Semantic occupancy estimation enables comprehensive scene understanding for automated driving, providing dense spatial and semantic information essential for perception and planning. While Gaussian representations have been widely adopted in self-supervised occupancy estimation, the deployment of a large number of Gaussian primitives drastically increases memory requirements and is not suitable for real-time inference. In contrast, superquadrics permit reduced primitive count and lower memory requirements due to their diverse shape set. However, implementation into a self-supervised occupancy model is nontrivial due to the absence of a superquadric rasterizer to enable model supervision. Our proposed method, SuperQuadricOcc, employs a superquadric-based scene representation. By leveraging a multi-layer icosphere-tessellated Gaussian approximation of superquadrics, we enable Gaussian rasterization for supervision during training. On the Occ3D dataset, SuperQuadricOcc achieves a 75\% reduction in memory footprint, 124\% faster inference, and a 5.9\% improvement in mIoU compared to previous Gaussian-based methods, without the use of temporal labels. To our knowledge, this is the first occupancy model to enable real-time inference while maintaining competitive performance. The use of superquadrics reduces the number of primitives required for scene modeling by 84\% relative to Gaussian-based approaches. Finally, evaluation against prior methods is facilitated by our fast superquadric voxelization module. The code will be released as open source.

Title: Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks

Authors: Georgia Baltsou, Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17393
Pdf URL: https://arxiv.org/pdf/2511.17393
Copy Paste: [[2511.17393]] Designing and Generating Diverse, Equitable Face Image Datasets for Face Verification Tasks(https://arxiv.org/abs/2511.17393)
Keywords: generative
Abstract: Face verification is a significant component of identity authentication in various applications including online banking and secure access to personal devices. The majority of the existing face image datasets often suffer from notable biases related to race, gender, and other demographic characteristics, limiting the effectiveness and fairness of face verification systems. In response to these challenges, we propose a comprehensive methodology that integrates advanced generative models to create varied and diverse high-quality synthetic face images. This methodology emphasizes the representation of a diverse range of facial traits, ensuring adherence to characteristics permissible in identity card photographs. Furthermore, we introduce the Diverse and Inclusive Faces for Verification (DIF-V) dataset, comprising 27,780 images of 926 unique identities, designed as a benchmark for future research in face verification. Our analysis reveals that existing verification models exhibit biases toward certain genders and races, and notably, applying identity style modifications negatively impacts model performance. By tackling the inherent inequities in existing datasets, this work not only enriches the discussion on diversity and ethics in artificial intelligence but also lays the foundation for developing more inclusive and reliable face verification technologies

Title: Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?

Authors: Sukwon Yun, Heming Yao, Burkhard Hoeckendorf, David Richmond, Aviv Regev, Russell Littman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17400
Pdf URL: https://arxiv.org/pdf/2511.17400
Copy Paste: [[2511.17400]] Sparse Mixture-of-Experts for Multi-Channel Imaging: Are All Channel Interactions Required?(https://arxiv.org/abs/2511.17400)
Keywords: foundation model
Abstract: Vision Transformers ($\text{ViTs}$) have become the backbone of vision foundation models, yet their optimization for multi-channel domains - such as cell painting or satellite imagery - remains underexplored. A key challenge in these domains is capturing interactions between channels, as each channel carries different information. While existing works have shown efficacy by treating each channel independently during tokenization, this approach naturally introduces a major computational bottleneck in the attention block - channel-wise comparisons leads to a quadratic growth in attention, resulting in excessive $\text{FLOPs}$ and high training cost. In this work, we shift focus from efficacy to the overlooked efficiency challenge in cross-channel attention and ask: "Is it necessary to model all channel interactions?". Inspired by the philosophy of Sparse Mixture-of-Experts ($\text{MoE}$), we propose MoE-ViT, a Mixture-of-Experts architecture for multi-channel images in $\text{ViTs}$, which treats each channel as an expert and employs a lightweight router to select only the most relevant experts per patch for attention. Proof-of-concept experiments on real-world datasets - JUMP-CP and So2Sat - demonstrate that $\text{MoE-ViT}$ achieves substantial efficiency gains without sacrificing, and in some cases enhancing, performance, making it a practical and attractive backbone for multi-channel imaging.

Title: Self-Supervised Learning by Curvature Alignment

Authors: Benyamin Ghojogh, M.Hadi Sepanj, Paul Fieguth
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2511.17426
Pdf URL: https://arxiv.org/pdf/2511.17426
Copy Paste: [[2511.17426]] Self-Supervised Learning by Curvature Alignment(https://arxiv.org/abs/2511.17426)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose $k$ nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.

Title: REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

Authors: Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.17442
Pdf URL: https://arxiv.org/pdf/2511.17442
Copy Paste: [[2511.17442]] REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing(https://arxiv.org/abs/2511.17442)
Keywords: foundation model, in-context
Abstract: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

Title: Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Authors: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.17450
Pdf URL: https://arxiv.org/pdf/2511.17450
Copy Paste: [[2511.17450]] Planning with Sketch-Guided Verification for Physics-Aware Video Generation(https://arxiv.org/abs/2511.17450)
Keywords: diffusion
Abstract: Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

Title: Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift

Authors: Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17455
Pdf URL: https://arxiv.org/pdf/2511.17455
Copy Paste: [[2511.17455]] Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift(https://arxiv.org/abs/2511.17455)
Keywords: foundation model
Abstract: Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: this https URL.

Title: Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Authors: Zhen Wang, Zhifeng Gao, Guolin Ke
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.17473
Pdf URL: https://arxiv.org/pdf/2511.17473
Copy Paste: [[2511.17473]] Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards(https://arxiv.org/abs/2511.17473)
Keywords: self-supervised
Abstract: Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

Title: Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Authors: Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17481
Pdf URL: https://arxiv.org/pdf/2511.17481
Copy Paste: [[2511.17481]] Counterfactual World Models via Digital Twin-conditioned Video Diffusion(https://arxiv.org/abs/2511.17481)
Keywords: diffusion
Abstract: World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

Title: Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions

Authors: Neel Sortur, Justin Goodwin, Purvik Patel, Luis Enrique Martinez Jr, Tzofi Klinghoffer, Rajmonda S. Caceres, Robin Walters
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17484
Pdf URL: https://arxiv.org/pdf/2511.17484
Copy Paste: [[2511.17484]] Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions(https://arxiv.org/abs/2511.17484)
Keywords: diffusion
Abstract: Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.

Title: An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Authors: Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17485
Pdf URL: https://arxiv.org/pdf/2511.17485
Copy Paste: [[2511.17485]] An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI(https://arxiv.org/abs/2511.17485)
Keywords: generative
Abstract: The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

Title: EvDiff: High Quality Video with an Event Camera

Authors: Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.17492
Pdf URL: https://arxiv.org/pdf/2511.17492
Copy Paste: [[2511.17492]] EvDiff: High Quality Video with an Event Camera(https://arxiv.org/abs/2511.17492)
Keywords: diffusion
Abstract: As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.