2025-12-17

Title: Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution

Authors: Jacob Schnell, Aditya Makkar, Gunadi Gani, Aniket Srinivasan Ashok, Darren Lo, Mike Optis, Alexander Wong, Yuhao Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13729
Pdf URL: https://arxiv.org/pdf/2512.13729
Copy Paste: [[2512.13729]] Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution(https://arxiv.org/abs/2512.13729)
Keywords: diffusion
Abstract: Various weather modelling problems (e.g., weather forecasting, optimizing turbine placements, etc.) require ample access to high-resolution, highly accurate wind data. Acquiring such high-resolution wind data, however, remains a challenging and expensive endeavour. Traditional reconstruction approaches are typically either cost-effective or accurate, but not both. Deep learning methods, including diffusion models, have been proposed to resolve this trade-off by leveraging advances in natural image super-resolution. Wind data, however, is distinct from natural images, and wind super-resolvers often use upwards of 10 input channels, significantly more than the usual 3-channel RGB inputs in natural images. To better leverage a large number of conditioning variables in diffusion models, we present a generalization of classifier-free guidance (CFG) to multiple conditioning inputs. Our novel composite classifier-free guidance (CCFG) can be dropped into any pre-trained diffusion model trained with standard CFG dropout. We demonstrate that CCFG outputs are higher-fidelity than those from CFG on wind super-resolution tasks. We present WindDM, a diffusion model trained for industrial-scale wind dynamics reconstruction and leveraging CCFG. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to $1000\times$ less than classical methods.

Title: PIS: A Generalized Physical Inversion Solver for Arbitrary Sparse Observations via Set-Conditioned Diffusion

Authors: Weijie Yang, Xun Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13732
Pdf URL: https://arxiv.org/pdf/2512.13732
Copy Paste: [[2512.13732]] PIS: A Generalized Physical Inversion Solver for Arbitrary Sparse Observations via Set-Conditioned Diffusion(https://arxiv.org/abs/2512.13732)
Keywords: diffusion
Abstract: Estimation of PDE-constrained physical parameters from limited indirect measurements is inherently ill-posed, particularly when observations are sparse, irregular, and constrained by real-world sensor placement. This challenge is ubiquitous in fields such as fluid mechanics, seismic inversion, and structural health monitoring. Existing deep and operator-learning models collapse under these conditions: fixed-grid assumptions fail, reconstruction deteriorates sharply, and inversion becomes unreliable with limited robustness and no uncertainty quantification (UQ).We propose the Physical Inversion Solver (PIS), a set-conditioned diffusion framework enabling inversion from truly arbitrary observation sets. PIS employs a Set Transformer-based encoder to handle measurements of any number or geometry, and a cosine-annealed sparsity curriculum for exceptional robustness. An accompanying information-theoretic analysis provides insight into the limits of inversion under extreme sparsity by revealing how observation entropy varies across physical this http URL is evaluated on three challenging PDE inverse problems: Darcy flow, wavefield inversion (Helmholtz), and structural health monitoring (Hooke's Law). Across all tasks and sparsity regimes -- including extreme cases with an observation rate of only $0.29\%$ -- existing operator-learning baselines fail to reconstruct meaningful fields, often diverging or collapsing this http URL stark contrast, PIS remains stable and accurate, reducing inversion error by $12.28\%$--$88.73\%$ and reliably producing calibrated posterior samples. These samples accurately reflect both data scarcity and intrinsic physical ambiguity. These results position PIS as a powerful, general-purpose, and uniquely sparsity-resilient solution for physical inversion under arbitrary and severely undersampled observations.

Title: DARTs: A Dual-Path Robust Framework for Anomaly Detection in High-Dimensional Multivariate Time Series

Authors: Xuechun Liu, Heli Sun, Xuecheng Wu, Ruichen Cao, Yunyun Shi, Dingkang Yang, Haoran Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13735
Pdf URL: https://arxiv.org/pdf/2512.13735
Copy Paste: [[2512.13735]] DARTs: A Dual-Path Robust Framework for Anomaly Detection in High-Dimensional Multivariate Time Series(https://arxiv.org/abs/2512.13735)
Keywords: diffusion, anomaly
Abstract: Multivariate time series anomaly detection (MTSAD) aims to accurately identify and localize complex abnormal patterns in the large-scale industrial control systems. While existing approaches excel in recognizing the distinct patterns under the low-dimensional scenarios, they often fail to robustly capture long-range spatiotemporal dependencies when learning representations from the high-dimensional noisy time series. To address these limitations, we propose DARTs, a robust long short-term dual-path framework with window-aware spatiotemporal soft fusion mechanism, which can be primarily decomposed into three complementary components. Specifically, in the short-term path, we introduce a Multi-View Sparse Graph Learner and a Diffusion Multi-Relation Graph Unit that collaborate to adaptively capture hierarchical discriminative short-term spatiotemporal patterns in the high-noise time series. While in the long-term path, we design a Multi-Scale Spatiotemporal Graph Constructor to model salient long-term dynamics within the high-dimensional representation space. Finally, a window-aware spatiotemporal soft-fusion mechanism is introduced to filter the residual noise while seamlessly integrating anomalous patterns. Extensive qualitative and quantitative experimental results across mainstream datasets demonstrate the superiority and robustness of our proposed DARTs. A series of ablation studies are also conducted to explore the crucial design factors of our proposed components. Our code and model will be made publicly open soon.

Title: TF-MCL: Time-frequency Fusion and Multi-domain Cross-Loss for Self-supervised Depression Detection

Authors: Li-Xuan Zhao, Chen-Yang Xu, Wen-Qiang Li, Bo Wang, Rong-Xing Wei, Qing-Hao Menga
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13736
Pdf URL: https://arxiv.org/pdf/2512.13736
Copy Paste: [[2512.13736]] TF-MCL: Time-frequency Fusion and Multi-domain Cross-Loss for Self-supervised Depression Detection(https://arxiv.org/abs/2512.13736)
Keywords: self-supervised
Abstract: In recent years, there has been a notable increase in the use of supervised detection methods of major depressive disorder (MDD) based on electroencephalogram (EEG) signals. However, the process of labeling MDD remains challenging. As a self-supervised learning method, contrastive learning could address the shortcomings of supervised learning methods, which are unduly reliant on labels in the context of MDD detection. However, existing contrastive learning methods are not specifically designed to characterize the time-frequency distribution of EEG signals, and their capacity to acquire low-semantic data representations is still inadequate for MDD detection tasks. To address the problem of contrastive learning method, we propose a time-frequency fusion and multi-domain cross-loss (TF-MCL) model for MDD detection. TF-MCL generates time-frequency hybrid representations through the use of a fusion mapping head (FMH), which efficiently remaps time-frequency domain information to the fusion domain, and thus can effectively enhance the model's capacity to synthesize time-frequency information. Moreover, by optimizing the multi-domain cross-loss function, the distribution of the representations in the time-frequency domain and the fusion domain is reconstructed, thereby improving the model's capacity to acquire fusion representations. We evaluated the performance of our model on the publicly available datasets MODMA and PRED+CT and show a significant improvement in accuracy, outperforming the existing state-of-the-art (SOTA) method by 5.87% and 9.96%, respectively.

Title: Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

Authors: Siyuan Dai, Lunxiao Li, Kun Zhao, Eardi Lila, Paul K. Crane, Heng Huang, Dongkuan Xu, Haoteng Tang, Liang Zhan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13747
Pdf URL: https://arxiv.org/pdf/2512.13747
Copy Paste: [[2512.13747]] Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making(https://arxiv.org/abs/2512.13747)
Keywords: in-context
Abstract: With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

Title: STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Authors: Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13752
Pdf URL: https://arxiv.org/pdf/2512.13752
Copy Paste: [[2512.13752]] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning(https://arxiv.org/abs/2512.13752)
Keywords: generative
Abstract: Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.

Title: MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Authors: Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13840
Pdf URL: https://arxiv.org/pdf/2512.13840
Copy Paste: [[2512.13840]] MoLingo: Motion-Language Alignment for Text-to-Motion Generation(https://arxiv.org/abs/2512.13840)
Keywords: diffusion
Abstract: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

Title: Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Authors: Wenda Li, Meng Wu, Sungmin Eum, Heesung Kwon, Qing Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13869
Pdf URL: https://arxiv.org/pdf/2512.13869
Copy Paste: [[2512.13869]] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models(https://arxiv.org/abs/2512.13869)
Keywords: diffusion
Abstract: Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{this https URL}{this url}.

Title: A Complete Guide to Spherical Equivariant Graph Transformers

Authors: Sophia Tang
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.13927
Pdf URL: https://arxiv.org/pdf/2512.13927
Copy Paste: [[2512.13927]] A Complete Guide to Spherical Equivariant Graph Transformers(https://arxiv.org/abs/2512.13927)
Keywords: generative
Abstract: Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three-dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message-passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling - from group representations and spherical harmonics, to tensor products, Clebsch-Gordan decomposition, and the construction of SO(3)-equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)-Transformer architectures and explain how they perform equivariant message-passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self-contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.

Title: Informing Acquisition Functions via Foundation Models for Molecular Discovery

Authors: Qi Chen, Fabio Ramos, Alán Aspuru-Guzik, Florian Shkurti
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.13935
Pdf URL: https://arxiv.org/pdf/2512.13935
Copy Paste: [[2512.13935]] Informing Acquisition Functions via Foundation Models for Molecular Discovery(https://arxiv.org/abs/2512.13935)
Keywords: foundation model, in-context
Abstract: Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.

Title: Pattern-Guided Diffusion Models

Authors: Vivian Lin, Kuk Jin Jang, Wenwen Si, Insup Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13945
Pdf URL: https://arxiv.org/pdf/2512.13945
Copy Paste: [[2512.13945]] Pattern-Guided Diffusion Models(https://arxiv.org/abs/2512.13945)
Keywords: diffusion
Abstract: Diffusion models have shown promise in forecasting future data from multivariate time series. However, few existing methods account for recurring structures, or patterns, that appear within the data. We present Pattern-Guided Diffusion Models (PGDM), which leverage inherent patterns within temporal data for forecasting future time steps. PGDM first extracts patterns using archetypal analysis and estimates the most likely next pattern in the sequence. By guiding predictions with this pattern estimate, PGDM makes more realistic predictions that fit within the set of known patterns. We additionally introduce a novel uncertainty quantification technique based on archetypal analysis, and we dynamically scale the guidance level based on the pattern estimate uncertainty. We apply our method to two well-motivated forecasting applications, predicting visual field measurements and motion capture frames. On both, we show that pattern guidance improves PGDM's performance (MAE / CRPS) by up to 40.67% / 56.26% and 14.12% / 14.10%, respectively. PGDM also outperforms baselines by up to 65.58% / 84.83% and 93.64% / 92.55%.

Title: An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes

Authors: Alban Gauthier, Valentin Deschaintre, Alexandre Lanvin, Fredo Durand, Adrien Bousseau, George Drettakis
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.13950
Pdf URL: https://arxiv.org/pdf/2512.13950
Copy Paste: [[2512.13950]] An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes(https://arxiv.org/abs/2512.13950)
Keywords: generative
Abstract: Digital content creation is experiencing a profound change with the advent of deep generative models. For texturing, conditional image generators now allow the synthesis of realistic RGB images of a 3D scene that align with the geometry of that scene. For appearance modeling, SVBRDF prediction networks recover material parameters from RGB images. Combining these technologies allows us to quickly generate SVBRDF maps for multiple views of a 3D scene, which can be merged to form a SVBRDF texture atlas of that scene. In this paper, we analyze the challenges and opportunities for SVBRDF prediction in the context of such a fast appearance modeling pipeline. On the one hand, single-view SVBRDF predictions might suffer from multiview incoherence and yield inconsistent texture atlases. On the other hand, generated RGB images, and the different modalities on which they are conditioned, can provide additional information for SVBRDF estimation compared to photographs. We compare neural architectures and conditions to identify designs that achieve high accuracy and coherence. We find that, surprisingly, a standard UNet is competitive with more complex designs. Project page: this http URL

Title: From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Authors: Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13953
Pdf URL: https://arxiv.org/pdf/2512.13953
Copy Paste: [[2512.13953]] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation(https://arxiv.org/abs/2512.13953)
Keywords: diffusion
Abstract: The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: this https URL.

Title: Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation

Authors: Miaohua Zhang, Mohammad Ali Armin, Xuesong Li, Sisi Liang, Lars Petersson, Changming Sun, David Ahmedt-Aristizabal, Zeeshan Hayder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13970
Pdf URL: https://arxiv.org/pdf/2512.13970
Copy Paste: [[2512.13970]] Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation(https://arxiv.org/abs/2512.13970)
Keywords: diffusion
Abstract: Marine obstacle detection demands robust segmentation under challenging conditions, such as sun glitter, fog, and rapidly changing wave patterns. These factors degrade image quality, while the scarcity and structural repetition of marine datasets limit the diversity of available training data. Although mask-conditioned diffusion models can synthesize layout-aligned samples, they often produce low-diversity outputs when conditioned on low-entropy masks and prompts, limiting their utility for improving robustness. In this paper, we propose a quality-driven and diversity-aware sample expansion pipeline that generates training data entirely at inference time, without retraining the diffusion model. The framework combines two key components:(i) a class-aware style bank that constructs high-entropy, semantically grounded prompts, and (ii) an adaptive annealing sampler that perturbs early conditioning, while a COD-guided proportional controller regulates this perturbation to boost diversity without compromising layout fidelity. Across marine obstacle benchmarks, augmenting training data with these controlled synthetic samples consistently improves segmentation performance across multiple backbones and increases visual variation in rare and texture-sensitive classes.

Title: Repurposing 2D Diffusion Models for 3D Shape Completion

Authors: Yao He, Youngjoong Kwon, Tiange Xiang, Wenxiao Cai, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13991
Pdf URL: https://arxiv.org/pdf/2512.13991
Copy Paste: [[2512.13991]] Repurposing 2D Diffusion Models for 3D Shape Completion(https://arxiv.org/abs/2512.13991)
Keywords: diffusion, generative
Abstract: We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.

Title: Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14008
Pdf URL: https://arxiv.org/pdf/2512.14008
Copy Paste: [[2512.14008]] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models(https://arxiv.org/abs/2512.14008)
Keywords: diffusion
Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.

Title: EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment

Authors: Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, Soonyoung Lee
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.14019
Pdf URL: https://arxiv.org/pdf/2512.14019
Copy Paste: [[2512.14019]] EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment(https://arxiv.org/abs/2512.14019)
Keywords: foundation model
Abstract: Cancer progression arises from interactions across multiple biological layers, especially beyond morphological and across molecular layers that remain invisible to image-only models. To capture this broader biological landscape, we present EXAONE Path 2.5, a pathology foundation model that jointly models histologic, genomic, epigenetic and transcriptomic modalities, producing an integrated patient representation that reflects tumor biology more comprehensively. Our approach incorporates three key components: (1) multimodal SigLIP loss enabling all-pairwise contrastive learning across heterogeneous modalities, (2) a fragment-aware rotary positional encoding (F-RoPE) module that preserves spatial structure and tissue-fragment topology in WSI, and (3) domain-specialized internal foundation models for both WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment. We evaluate EXAONE Path 2.5 against six leading pathology foundation models across two complementary benchmarks: an internal real-world clinical dataset and the Patho-Bench benchmark covering 80 tasks. Our framework demonstrates high data and parameter efficiency, achieving on-par performance with state-of-the-art foundation models on Patho-Bench while exhibiting the highest adaptability in the internal clinical setting. These results highlight the value of biologically informed multimodal design and underscore the potential of integrated genotype-to-phenotype modeling for next-generation precision oncology.

Title: Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers

Authors: Yibing Fu, Yunpeng Zhao, Zhitao Zeng, Cheng Chen, Yueming Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14026
Pdf URL: https://arxiv.org/pdf/2512.14026
Copy Paste: [[2512.14026]] Unleashing the Power of Image-Tabular Self-Supervised Learning via Breaking Cross-Tabular Barriers(https://arxiv.org/abs/2512.14026)
Keywords: self-supervised
Abstract: Multi-modal learning integrating medical images and tabular data has significantly advanced clinical decision-making in recent years. Self-Supervised Learning (SSL) has emerged as a powerful paradigm for pretraining these models on large-scale unlabeled image-tabular data, aiming to learn discriminative representations. However, existing SSL methods for image-tabular representation learning are often confined to specific data cohorts, mainly due to their rigid tabular modeling mechanisms when modeling heterogeneous tabular data. This inter-tabular barrier hinders the multi-modal SSL methods from effectively learning transferrable medical knowledge shared across diverse cohorts. In this paper, we propose a novel SSL framework, namely CITab, designed to learn powerful multi-modal feature representations in a cross-tabular manner. We design the tabular modeling mechanism from a semantic-awareness perspective by integrating column headers as semantic cues, which facilitates transferrable knowledge learning and the scalability in utilizing multiple data sources for pretraining. Additionally, we propose a prototype-guided mixture-of-linear layer (P-MoLin) module for tabular feature specialization, empowering the model to effectively handle the heterogeneity of tabular data and explore the underlying medical concepts. We conduct comprehensive evaluations on Alzheimer's disease diagnosis task across three publicly available data cohorts containing 4,461 subjects. Experimental results demonstrate that CITab outperforms state-of-the-art approaches, paving the way for effective and scalable cross-tabular multi-modal learning.

Title: A Deep Dive into Function Inlining and its Security Implications for ML-based Binary Analysis

Authors: Omar Abusabha, Jiyong Uhm, Tamer Abuhmed, Hyungjoon Koo
Subjects: cs.CR, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2512.14045
Pdf URL: https://arxiv.org/pdf/2512.14045
Copy Paste: [[2512.14045]] A Deep Dive into Function Inlining and its Security Implications for ML-based Binary Analysis(https://arxiv.org/abs/2512.14045)
Keywords: generative
Abstract: A function inlining optimization is a widely used transformation in modern compilers, which replaces a call site with the callee's body in need. While this transformation improves performance, it significantly impacts static features such as machine instructions and control flow graphs, which are crucial to binary analysis. Yet, despite its broad impact, the security impact of function inlining remains underexplored to date. In this paper, we present the first comprehensive study of function inlining through the lens of machine learning-based binary analysis. To this end, we dissect the inlining decision pipeline within the LLVM's cost model and explore the combinations of the compiler options that aggressively promote the function inlining ratio beyond standard optimization levels, which we term extreme inlining. We focus on five ML-assisted binary analysis tasks for security, using 20 unique models to systematically evaluate their robustness under extreme inlining scenarios. Our extensive experiments reveal several significant findings: i) function inlining, though a benign transformation in intent, can (in)directly affect ML model behaviors, being potentially exploited by evading discriminative or generative ML models; ii) ML models relying on static features can be highly sensitive to inlining; iii) subtle compiler settings can be leveraged to deliberately craft evasive binary variants; and iv) inlining ratios vary substantially across applications and build configurations, undermining assumptions of consistency in training and evaluation of ML models.

Title: FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Authors: Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14056
Pdf URL: https://arxiv.org/pdf/2512.14056
Copy Paste: [[2512.14056]] FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling(https://arxiv.org/abs/2512.14056)
Keywords: diffusion, self-supervised
Abstract: Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.

Title: Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution

Authors: Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14061
Pdf URL: https://arxiv.org/pdf/2512.14061
Copy Paste: [[2512.14061]] Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution(https://arxiv.org/abs/2512.14061)
Keywords: diffusion, generative
Abstract: Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.

Title: Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Authors: Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14067
Pdf URL: https://arxiv.org/pdf/2512.14067
Copy Paste: [[2512.14067]] Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed(https://arxiv.org/abs/2512.14067)
Keywords: diffusion
Abstract: Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

Title: SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Authors: Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14068
Pdf URL: https://arxiv.org/pdf/2512.14068
Copy Paste: [[2512.14068]] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding(https://arxiv.org/abs/2512.14068)
Keywords: diffusion
Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

Title: FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis

Authors: Da Zhang, Bingyu Li, Zhiyuan Zhao, Feiping Nie, Junyu Gao, Xuelong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14078
Pdf URL: https://arxiv.org/pdf/2512.14078
Copy Paste: [[2512.14078]] FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis(https://arxiv.org/abs/2512.14078)
Keywords: anomaly
Abstract: Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology, underpinning key tasks including classification, forecasting, and anomaly detection. Although deep learning models have achieved remarkable progress in these areas in recent years, constructing an efficient, multi-task compatible, and generalizable unified framework for time series analysis remains a significant challenge. Existing approaches are often tailored to single tasks or specific data types, making it difficult to simultaneously handle multi-task modeling and effectively integrate information across diverse time series types. Moreover, real-world data are often affected by noise, complex frequency components, and multi-scale dynamic patterns, which further complicate robust feature extraction and analysis. To ameliorate these challenges, we propose FusAD, a unified analysis framework designed for diverse time series tasks. FusAD features an adaptive time-frequency fusion mechanism, integrating both Fourier and Wavelet transforms to efficiently capture global-local and multi-scale dynamic features. With an adaptive denoising mechanism, FusAD automatically senses and filters various types of noise, highlighting crucial sequence variations and enabling robust feature extraction in complex environments. In addition, the framework integrates a general information fusion and decoding structure, combined with masked pre-training, to promote efficient learning and transfer of multi-granularity representations. Extensive experiments demonstrate that FusAD consistently outperforms state-of-the-art models on mainstream time series benchmarks for classification, forecasting, and anomaly detection tasks, while maintaining high efficiency and scalability. Code is available at this https URL.

Title: Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization

Authors: Boyuan Yao, Dingcheng Luo, Lianghao Cao, Nikola Kovachki, Thomas O'Leary-Roseberry, Omar Ghattas
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2512.14086
Pdf URL: https://arxiv.org/pdf/2512.14086
Copy Paste: [[2512.14086]] Derivative-Informed Fourier Neural Operator: Universal Approximation and Applications to PDE-Constrained Optimization(https://arxiv.org/abs/2512.14086)
Keywords: diffusion
Abstract: We present approximation theories and efficient training methods for derivative-informed Fourier neural operators (DIFNOs) with applications to PDE-constrained optimization. A DIFNO is an FNO trained by minimizing its prediction error jointly on output and Fréchet derivative samples of a high-fidelity operator (e.g., a parametric PDE solution operator). As a result, a DIFNO can closely emulate not only the high-fidelity operator's response but also its sensitivities. To motivate the use of DIFNOs instead of conventional FNOs as surrogate models, we show that accurate surrogate-driven PDE-constrained optimization requires accurate surrogate Fréchet derivatives. Then, for continuously differentiable operators, we establish (i) simultaneous universal approximation of FNOs and their Fréchet derivatives on compact sets, and (ii) universal approximation of FNOs in weighted Sobolev spaces with input measures that have unbounded supports. Our theoretical results certify the capability of FNOs for accurate derivative-informed operator learning and accurate solution of PDE-constrained optimization. Furthermore, we develop efficient training schemes using dimension reduction and multi-resolution techniques that significantly reduce memory and computational costs for Fréchet derivative learning. Numerical examples on nonlinear diffusion--reaction, Helmholtz, and Navier--Stokes equations demonstrate that DIFNOs are superior in sample complexity for operator learning and solving infinite-dimensional PDE-constrained inverse problems, achieving high accuracy at low training sample sizes.

Title: ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes

Authors: Felix Holm, Ghazal Ghazaei, Nassir Navab
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14092
Pdf URL: https://arxiv.org/pdf/2512.14092
Copy Paste: [[2512.14092]] ProtoFlow: Interpretable and Robust Surgical Workflow Modeling with Learned Dynamic Scene Graph Prototypes(https://arxiv.org/abs/2512.14092)
Keywords: self-supervised
Abstract: Purpose: Detailed surgical recognition is critical for advancing AI-assisted surgery, yet progress is hampered by high annotation costs, data scarcity, and a lack of interpretable models. While scene graphs offer a structured abstraction of surgical events, their full potential remains untapped. In this work, we introduce ProtoFlow, a novel framework that learns dynamic scene graph prototypes to model complex surgical workflows in an interpretable and robust manner. Methods: ProtoFlow leverages a graph neural network (GNN) encoder-decoder architecture that combines self-supervised pretraining for rich representation learning with a prototype-based fine-tuning stage. This process discovers and refines core prototypes that encapsulate recurring, clinically meaningful patterns of surgical interaction, forming an explainable foundation for workflow analysis. Results: We evaluate our approach on the fine-grained CAT-SG dataset. ProtoFlow not only outperforms standard GNN baselines in overall accuracy but also demonstrates exceptional robustness in limited-data, few-shot scenarios, maintaining strong performance when trained on as few as one surgical video. Our qualitative analyses further show that the learned prototypes successfully identify distinct surgical sub-techniques and provide clear, interpretable insights into workflow deviations and rare complications. Conclusion: By uniting robust representation learning with inherent explainability, ProtoFlow represents a significant step toward developing more transparent, reliable, and data-efficient AI systems, accelerating their potential for clinical adoption in surgical training, real-time decision support, and workflow optimization.

Title: AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Authors: Sisi Dai, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14095
Pdf URL: https://arxiv.org/pdf/2512.14095
Copy Paste: [[2512.14095]] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation(https://arxiv.org/abs/2512.14095)
Keywords: diffusion
Abstract: Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

Title: OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Authors: Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14096
Pdf URL: https://arxiv.org/pdf/2512.14096
Copy Paste: [[2512.14096]] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration(https://arxiv.org/abs/2512.14096)
Keywords: diffusion
Abstract: Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.

Title: Cornserve: Efficiently Serving Any-to-Any Multimodal Models

Authors: Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, Mosharaf Chowdhury
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2512.14098
Pdf URL: https://arxiv.org/pdf/2512.14098
Copy Paste: [[2512.14098]] Cornserve: Efficiently Serving Any-to-Any Multimodal Models(https://arxiv.org/abs/2512.14098)
Keywords: diffusion
Abstract: We present Cornserve, an efficient online serving system for an emerging class of multimodal models called Any-to-Any models. Any-to-Any models accept combinations of text and multimodal data (e.g., image, video, audio) as input and also generate combinations of text and multimodal data as output, introducing request type, computation path, and computation scaling heterogeneity in model serving. Cornserve allows model developers to describe the computation graph of generic Any-to-Any models, which consists of heterogeneous components such as multimodal encoders, autoregressive models like Large Language Models (LLMs), and multimodal generators like Diffusion Transformers (DiTs). Given this, Cornserve's planner automatically finds an optimized deployment plan for the model, including whether and how to disaggregate the model into smaller components based on model and workload characteristics. Cornserve's distributed runtime then executes the model per the plan, efficiently handling Any-to-Any model heterogeneity during online serving. Evaluations show that Cornserve can efficiently serve diverse Any-to-Any models and workloads, delivering up to 3.81$\times$ throughput improvement and up to 5.79$\times$ tail latency reduction over existing solutions.

Title: ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Authors: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14099
Pdf URL: https://arxiv.org/pdf/2512.14099
Copy Paste: [[2512.14099]] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models(https://arxiv.org/abs/2512.14099)
Keywords: diffusion
Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.

Title: Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries

Authors: Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.14102
Pdf URL: https://arxiv.org/pdf/2512.14102
Copy Paste: [[2512.14102]] Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries(https://arxiv.org/abs/2512.14102)
Keywords: foundation model
Abstract: Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.

Title: MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction

Authors: Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14114
Pdf URL: https://arxiv.org/pdf/2512.14114
Copy Paste: [[2512.14114]] MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction(https://arxiv.org/abs/2512.14114)
Keywords: generative
Abstract: Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at this https URL.

Title: FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation

Authors: Qingyuan Cai, Linxin Zhang, Xuecai Hu, Saihui Hou, Yongzhen Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14162
Pdf URL: https://arxiv.org/pdf/2512.14162
Copy Paste: [[2512.14162]] FastDDHPose: Towards Unified, Efficient, and Disentangled 3D Human Pose Estimation(https://arxiv.org/abs/2512.14162)
Keywords: diffusion
Abstract: Recent approaches for monocular 3D human pose estimation (3D HPE) have achieved leading performance by directly regressing 3D poses from 2D keypoint sequences. Despite the rapid progress in 3D HPE, existing methods are typically trained and evaluated under disparate frameworks, lacking a unified framework for fair comparison. To address these limitations, we propose Fast3DHPE, a modular framework that facilitates rapid reproduction and flexible development of new methods. By standardizing training and evaluation protocols, Fast3DHPE enables fair comparison across 3D human pose estimation methods while significantly improving training efficiency. Within this framework, we introduce FastDDHPose, a Disentangled Diffusion-based 3D Human Pose Estimation method which leverages the strong latent distribution modeling capability of diffusion models to explicitly model the distributions of bone length and bone direction while avoiding further amplification of hierarchical error accumulation. Moreover, we design an efficient Kinematic-Hierarchical Spatial and Temporal Denoiser that encourages the model to focus on kinematic joint hierarchies while avoiding unnecessary modeling of overly complex joint topologies. Extensive experiments on Human3.6M and MPI-INF-3DHP show that the Fast3DHPE framework enables fair comparison of all methods while significantly improving training efficiency. Within this unified framework, FastDDHPose achieves state-of-the-art performance with strong generalization and robustness in in-the-wild scenarios. The framework and models will be released at: this https URL

Title: Random-Bridges as Stochastic Transports for Generative Models

Authors: Stefano Goria, Levent A. Mengütürk, Murat C. Mengütürk, Berkan Sesen
Subjects: cs.LG, math.PR
Abstract URL: https://arxiv.org/abs/2512.14190
Pdf URL: https://arxiv.org/pdf/2512.14190
Copy Paste: [[2512.14190]] Random-Bridges as Stochastic Transports for Generative Models(https://arxiv.org/abs/2512.14190)
Keywords: generative
Abstract: This paper motivates the use of random-bridges -- stochastic processes conditioned to take target distributions at fixed timepoints -- in the realm of generative modelling. Herein, random-bridges can act as stochastic transports between two probability distributions when appropriately initialized, and can display either Markovian or non-Markovian, and either continuous, discontinuous or hybrid patterns depending on the driving process. We show how one can start from general probabilistic statements and then branch out into specific representations for learning and simulation algorithms in terms of information processing. Our empirical results, built on Gaussian random bridges, produce high-quality samples in significantly fewer steps compared to traditional approaches, while achieving competitive Frechet inception distance scores. Our analysis provides evidence that the proposed framework is computationally cheap and suitable for high-speed generation tasks.

Title: DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.14217
Pdf URL: https://arxiv.org/pdf/2512.14217
Copy Paste: [[2512.14217]] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos(https://arxiv.org/abs/2512.14217)
Keywords: diffusion
Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.

Title: OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Authors: Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14225
Pdf URL: https://arxiv.org/pdf/2512.14225
Copy Paste: [[2512.14225]] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving(https://arxiv.org/abs/2512.14225)
Keywords: diffusion, generative
Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.

Title: 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Authors: Jimmie Kwok, Holger Caesar, Andras Palffy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14235
Pdf URL: https://arxiv.org/pdf/2512.14235
Copy Paste: [[2512.14235]] 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation(https://arxiv.org/abs/2512.14235)
Keywords: diffusion
Abstract: Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.

Title: Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding

Authors: Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14236
Pdf URL: https://arxiv.org/pdf/2512.14236
Copy Paste: [[2512.14236]] Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding(https://arxiv.org/abs/2512.14236)
Keywords: diffusion
Abstract: The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present Elastic3D, a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (more precisely, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion. Please check the project page for the video samples this https URL.

Title: Two CFG Nahuatl for automatic corpora expansion

Authors: Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Graham Ranger Martha-Lorena Avendaño-Garrido
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14239
Pdf URL: https://arxiv.org/pdf/2512.14239
Copy Paste: [[2512.14239]] Two CFG Nahuatl for automatic corpora expansion(https://arxiv.org/abs/2512.14239)
Keywords: generative
Abstract: The aim of this article is to introduce two Context-Free Grammars (CFG) for Nawatl Corpora expansion. Nawatl is an Amerindian language (it is a National Language of Mexico) of the $\pi$-language type, i.e. a language with few digital resources. For this reason the corpora available for the learning of Large Language Models (LLMs) are virtually non-existent, posing a significant challenge. The goal is to produce a substantial number of syntactically valid artificial Nawatl sentences and thereby to expand the corpora for the purpose of learning non contextual embeddings. For this objective, we introduce two new Nawatl CFGs and use them in generative mode. Using these grammars, it is possible to expand Nawatl corpus significantly and subsequently to use it to learn embeddings and to evaluate their relevance in a sentences semantic similarity task. The results show an improvement compared to the results obtained using only the original corpus without artificial expansion, and also demonstrate that economic embeddings often perform better than some LLMs.

Title: Physically consistent model learning for reaction-diffusion systems

Authors: Erion Morina, Martin Holler
Subjects: cs.LG, math.AP, math.OC
Abstract URL: https://arxiv.org/abs/2512.14240
Pdf URL: https://arxiv.org/pdf/2512.14240
Copy Paste: [[2512.14240]] Physically consistent model learning for reaction-diffusion systems(https://arxiv.org/abs/2512.14240)
Keywords: diffusion
Abstract: This paper addresses the problem of learning reaction-diffusion (RD) systems from data while ensuring physical consistency and well-posedness of the learned models. Building on a regularization-based framework for structured model learning, we focus on learning parameterized reaction terms and investigate how to incorporate key physical properties, such as mass conservation and quasipositivity, directly into the learning process. Our main contributions are twofold: First, we propose techniques to systematically modify a given class of parameterized reaction terms such that the resulting terms inherently satisfy mass conservation and quasipositivity, ensuring that the learned RD systems preserve non-negativity and adhere to physical principles. These modifications also guarantee well-posedness of the resulting PDEs under additional regularity and growth conditions. Second, we extend existing theoretical results on regularization-based model learning to RD systems using these physically consistent reaction terms. Specifically, we prove that solutions to the learning problem converge to a unique, regularization-minimizing solution of a limit system even when conservation laws and quasipositivity are enforced. In addition, we provide approximation results for quasipositive functions, essential for constructing physically consistent parameterizations. These results advance the development of interpretable and reliable data-driven models for RD systems that align with fundamental physical laws.

Title: Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning

Authors: Salvatore Romano, Marco Grassia, Giuseppe Mangioni
Subjects: cs.LG, cs.AI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2512.14241
Pdf URL: https://arxiv.org/pdf/2512.14241
Copy Paste: [[2512.14241]] Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning(https://arxiv.org/abs/2512.14241)
Keywords: diffusion, generative
Abstract: Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.

Title: FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting

Authors: Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14253
Pdf URL: https://arxiv.org/pdf/2512.14253
Copy Paste: [[2512.14253]] FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting(https://arxiv.org/abs/2512.14253)
Keywords: foundation model, generative
Abstract: In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.

Title: SS4D: Native 4D Generative Model via Structured Spacetime Latents

Authors: Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14284
Pdf URL: https://arxiv.org/pdf/2512.14284
Copy Paste: [[2512.14284]] SS4D: Native 4D Generative Model via Structured Spacetime Latents(https://arxiv.org/abs/2512.14284)
Keywords: generative
Abstract: We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

Title: PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition

Authors: Abdullah Al Mamun, Miaohua Zhang, David Ahmedt-Aristizabal, Zeeshan Hayder, Mohammad Awrangjeb
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14309
Pdf URL: https://arxiv.org/pdf/2512.14309
Copy Paste: [[2512.14309]] PSMamba: Progressive Self-supervised Vision Mamba for Plant Disease Recognition(https://arxiv.org/abs/2512.14309)
Keywords: self-supervised
Abstract: Self-supervised Learning (SSL) has become a powerful paradigm for representation learning without manual annotations. However, most existing frameworks focus on global alignment and struggle to capture the hierarchical, multi-scale lesion patterns characteristic of plant disease imagery. To address this gap, we propose PSMamba, a progressive self-supervised framework that integrates the efficient sequence modelling of Vision Mamba (VM) with a dual-student hierarchical distillation strategy. Unlike conventional single teacher-student designs, PSMamba employs a shared global teacher and two specialised students: one processes mid-scale views to capture lesion distributions and vein structures, while the other focuses on local views to capture fine-grained cues such as texture irregularities and early-stage lesions. This multi-granular supervision facilitates the joint learning of contextual and detailed representations, with consistency losses ensuring coherent cross-scale alignment. Experiments on three benchmark datasets show that PSMamba consistently outperforms state-of-the-art SSL methods, delivering superior accuracy and robustness in both domain-shifted and fine-grained scenarios.

Title: Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity

Authors: Shuai Dong, Jie Zhang, Guoying Zhao, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14320
Pdf URL: https://arxiv.org/pdf/2512.14320
Copy Paste: [[2512.14320]] Semantic Mismatch and Perceptual Degradation: A New Perspective on Image Editing Immunity(https://arxiv.org/abs/2512.14320)
Keywords: diffusion
Abstract: Text-guided image editing via diffusion models, while powerful, raises significant concerns about misuse, motivating efforts to immunize images against unauthorized edits using imperceptible perturbations. Prevailing metrics for evaluating immunization success typically rely on measuring the visual dissimilarity between the output generated from a protected image and a reference output generated from the unprotected original. This approach fundamentally overlooks the core requirement of image immunization, which is to disrupt semantic alignment with attacker intent, regardless of deviation from any specific output. We argue that immunization success should instead be defined by the edited output either semantically mismatching the prompt or suffering substantial perceptual degradations, both of which thwart malicious intent. To operationalize this principle, we propose Synergistic Intermediate Feature Manipulation (SIFM), a method that strategically perturbs intermediate diffusion features through dual synergistic objectives: (1) maximizing feature divergence from the original edit trajectory to disrupt semantic alignment with the expected edit, and (2) minimizing feature norms to induce perceptual degradations. Furthermore, we introduce the Immunization Success Rate (ISR), a novel metric designed to rigorously quantify true immunization efficacy for the first time. ISR quantifies the proportion of edits where immunization induces either semantic failure relative to the prompt or significant perceptual degradations, assessed via Multimodal Large Language Models (MLLMs). Extensive experiments show our SIFM achieves the state-of-the-art performance for safeguarding visual content against malicious diffusion-based manipulation.

Title: Dual Attention Guided Defense Against Malicious Edits

Authors: Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14333
Pdf URL: https://arxiv.org/pdf/2512.14333
Copy Paste: [[2512.14333]] Dual Attention Guided Defense Against Malicious Edits(https://arxiv.org/abs/2512.14333)
Keywords: diffusion
Abstract: Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.

Title: Towards Transferable Defense Against Malicious Image Edits

Authors: Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14341
Pdf URL: https://arxiv.org/pdf/2512.14341
Copy Paste: [[2512.14341]] Towards Transferable Defense Against Malicious Image Edits(https://arxiv.org/abs/2512.14341)
Keywords: diffusion
Abstract: Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.

Title: RePo: Language Models with Context Re-Positioning

Authors: Huayang Li, Tianyu Zhao, Richard Sproat
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.14391
Pdf URL: https://arxiv.org/pdf/2512.14391
Copy Paste: [[2512.14391]] RePo: Language Models with Context Re-Positioning(https://arxiv.org/abs/2512.14391)
Keywords: in-context
Abstract: In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. Drawing on Cognitive Load Theory (CLT), we argue that this uninformative structure increases extraneous cognitive load, consuming finite working memory capacity that should be allocated to deep reasoning and attention allocation. To address this, we propose RePo, a novel mechanism that reduces extraneous load via context re-positioning. Unlike standard approaches, RePo utilizes a differentiable module, $f_\phi$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined integer range. By continually pre-training on the OLMo-2 1B backbone, we demonstrate that RePo significantly enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Detailed analysis reveals that RePo successfully allocate higher attention to distant but relevant information, assign positions in dense and non-linear space, and capture the intrinsic structure of the input context. Our code is available at this https URL.

Title: LCMem: A Universal Model for Robust Image Memorization Detection

Authors: Mischa Dombrowski, Felix Nützel, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14421
Pdf URL: https://arxiv.org/pdf/2512.14421
Copy Paste: [[2512.14421]] LCMem: A Universal Model for Robust Image Memorization Detection(https://arxiv.org/abs/2512.14421)
Keywords: generative
Abstract: Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at this https URL.

Title: The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy

Authors: Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14423
Pdf URL: https://arxiv.org/pdf/2512.14423
Copy Paste: [[2512.14423]] The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy(https://arxiv.org/abs/2512.14423)
Keywords: diffusion
Abstract: Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or this http URL address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity this http URL adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.

Title: Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging

Authors: Chang Cai, Hao Jiang, Xiaojun Yuan, Ying-Jun Angela Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14435
Pdf URL: https://arxiv.org/pdf/2512.14435
Copy Paste: [[2512.14435]] Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging(https://arxiv.org/abs/2512.14435)
Keywords: generative
Abstract: Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.

Title: A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Authors: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.14442
Pdf URL: https://arxiv.org/pdf/2512.14442
Copy Paste: [[2512.14442]] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning(https://arxiv.org/abs/2512.14442)
Keywords: foundation model, generative
Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.

Title: Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Authors: Xingfu Zhou, Pengfei Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14448
Pdf URL: https://arxiv.org/pdf/2512.14448
Copy Paste: [[2512.14448]] Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space(https://arxiv.org/abs/2512.14448)
Keywords: generative
Abstract: Large Language Model (LLM) agents relying on external retrieval are increasingly deployed in high-stakes environments. While existing adversarial attacks primarily focus on content falsification or instruction injection, we identify a novel, process-oriented attack surface: the agent's reasoning style. We propose Reasoning-Style Poisoning (RSP), a paradigm that manipulates how agents process information rather than what they process. We introduce Generative Style Injection (GSI), an attack method that rewrites retrieved documents into pathological tones--specifically "analysis paralysis" or "cognitive haste"--without altering underlying facts or using explicit triggers. To quantify these shifts, we develop the Reasoning Style Vector (RSV), a metric tracking Verification depth, Self-confidence, and Attention focus. Experiments on HotpotQA and FEVER using ReAct, Reflection, and Tree of Thoughts (ToT) architectures reveal that GSI significantly degrades performance. It increases reasoning steps by up to 4.4 times or induces premature errors, successfully bypassing state-of-the-art content filters. Finally, we propose RSP-M, a lightweight runtime monitor that calculates RSV metrics in real-time and triggers alerts when values exceed safety thresholds. Our work demonstrates that reasoning style is a distinct, exploitable vulnerability, necessitating process-level defenses beyond static content analysis.

Title: Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency

Authors: Jia Guo, Jiawei Du, Shengzhu Yang, Shuai Lu, Wenquan Cheng, Kaiwen Zhang, Yihua Sun, Chuhong Yang, Weihang Zhang, Fang Chen, Yilan Wu, Lie Ju, Guochen Ning, Longfei Ma, Huiping Yao, Jinyuan Wang, Peilun Shi, Yukun Zhou, Jie Xu, Pearse A. Keane, Hanruo Liu, Hongen Liao, Ningli Wang, Huiqi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14499
Pdf URL: https://arxiv.org/pdf/2512.14499
Copy Paste: [[2512.14499]] Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment Efficiency(https://arxiv.org/abs/2512.14499)
Keywords: foundation model
Abstract: Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.

Title: Linguists should learn to love speech-based deep learning models

Authors: Marianne de Heer Kloots, Paul Boersma, Willem Zuidema
Subjects: cs.CL, cs.SD, eess.AS, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.14506
Pdf URL: https://arxiv.org/pdf/2512.14506
Copy Paste: [[2512.14506]] Linguists should learn to love speech-based deep learning models(https://arxiv.org/abs/2512.14506)
Keywords: generative
Abstract: Futrell and Mahowald present a useful framework bridging technology-oriented deep learning systems and explanation-oriented linguistic theories. Unfortunately, the target article's focus on generative text-based LLMs fundamentally limits fruitful interactions with linguistics, as many interesting questions on human language fall outside what is captured by written text. We argue that audio-based deep learning models can and should play a crucial role.

Title: Improving Slow Transfer Predictions: Generative Methods Compared

Authors: Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2512.14522
Pdf URL: https://arxiv.org/pdf/2512.14522
Copy Paste: [[2512.14522]] Improving Slow Transfer Predictions: Generative Methods Compared(https://arxiv.org/abs/2512.14522)
Keywords: generative
Abstract: Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.

Title: DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors

Authors: Yiheng Huang, Junhong Chen, Anqi Ning, Zhanhong Liang, Nick Michiels, Luc Claesen, Wenyin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14536
Pdf URL: https://arxiv.org/pdf/2512.14536
Copy Paste: [[2512.14536]] DASP: Self-supervised Nighttime Monocular Depth Estimation with Domain Adaptation of Spatiotemporal Priors(https://arxiv.org/abs/2512.14536)
Keywords: self-supervised
Abstract: Self-supervised monocular depth estimation has achieved notable success under daytime conditions. However, its performance deteriorates markedly at night due to low visibility and varying illumination, e.g., insufficient light causes textureless areas, and moving objects bring blurry regions. To this end, we propose a self-supervised framework named DASP that leverages spatiotemporal priors for nighttime depth estimation. Specifically, DASP consists of an adversarial branch for extracting spatiotemporal priors and a self-supervised branch for learning. In the adversarial branch, we first design an adversarial network where the discriminator is composed of four devised spatiotemporal priors learning blocks (SPLB) to exploit the daytime priors. In particular, the SPLB contains a spatial-based temporal learning module (STLM) that uses orthogonal differencing to extract motion-related variations along the time axis and an axial spatial learning module (ASLM) that adopts local asymmetric convolutions with global axial attention to capture the multiscale structural information. By combining STLM and ASLM, our model can acquire sufficient spatiotemporal features to restore textureless areas and estimate the blurry regions caused by dynamic objects. In the self-supervised branch, we propose a 3D consistency projection loss to bilaterally project the target frame and source frame into a shared 3D space, and calculate the 3D discrepancy between the two projected frames as a loss to optimize the 3D structural consistency and daytime priors. Extensive experiments on the Oxford RobotCar and nuScenes datasets demonstrate that our approach achieves state-of-the-art performance for nighttime depth estimation. Ablation studies further validate the effectiveness of each component.

Title: Synthetic Electrogram Generation with Variational Autoencoders for ECGI

Authors: Miriam Gutiérrez Fernández, Karen López-Linares, Carlos Fambuena Santos, María S. Guillem, Andreu M. Climent, Óscar Barquero Pérez
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.14537
Pdf URL: https://arxiv.org/pdf/2512.14537
Copy Paste: [[2512.14537]] Synthetic Electrogram Generation with Variational Autoencoders for ECGI(https://arxiv.org/abs/2512.14537)
Keywords: generative
Abstract: Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.

Title: HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Authors: Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14542
Pdf URL: https://arxiv.org/pdf/2512.14542
Copy Paste: [[2512.14542]] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion(https://arxiv.org/abs/2512.14542)
Keywords: diffusion
Abstract: Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.

Title: Dual Language Models: Balancing Training Efficiency and Overfitting Resilience

Authors: David Samuel, Lucas Georges Gabriel Charpentier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14549
Pdf URL: https://arxiv.org/pdf/2512.14549
Copy Paste: [[2512.14549]] Dual Language Models: Balancing Training Efficiency and Overfitting Resilience(https://arxiv.org/abs/2512.14549)
Keywords: diffusion
Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal ratio between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal ratio is similar whether targeting autoregressive or masked-diffusion downstream performance.

Title: Polypersona: Persona-Grounded LLM for Synthetic Survey Responses

Authors: Tejaswani Dash, Dinesh Karri, Anudeep Vurity, Gautam Datla, Tazeem Ahmad, Saima Rafi, Rohith Tangudu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14562
Pdf URL: https://arxiv.org/pdf/2512.14562
Copy Paste: [[2512.14562]] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses(https://arxiv.org/abs/2512.14562)
Keywords: generative
Abstract: This paper introduces PolyPersona, a generative framework for synthesizing persona-conditioned survey responses across multiple domains. The framework instruction-tunes compact chat models using parameter-efficient LoRA adapters with 4-bit quantization under a resource-adaptive training setup. A dialogue-based data pipeline explicitly preserves persona cues, ensuring consistent behavioral alignment across generated responses. Using this pipeline, we construct a dataset of 3,568 synthetic survey responses spanning ten domains and 433 distinct personas, enabling controlled instruction tuning and systematic multi-domain evaluation. We evaluate the generated responses using a multi-metric evaluation suite that combines standard text generation metrics, including BLEU, ROUGE, and BERTScore, with survey-specific metrics designed to assess structural coherence, stylistic consistency, and sentiment this http URL results show that compact models such as TinyLlama 1.1B and Phi-2 achieve performance comparable to larger 7B to 8B baselines, with a highest BLEU score of 0.090 and ROUGE-1 of 0.429. These findings demonstrate that persona-conditioned fine-tuning enables small language models to generate reliable and coherent synthetic survey data. The proposed framework provides an efficient and reproducible approach for survey data generation, supporting scalable evaluation while facilitating bias analysis through transparent and open protocols.

Title: WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.14614
Pdf URL: https://arxiv.org/pdf/2512.14614
Copy Paste: [[2512.14614]] WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling(https://arxiv.org/abs/2512.14614)
Keywords: diffusion
Abstract: This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: this https URL and this https URL.

Title: Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets

Authors: Omid Khormali
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14615
Pdf URL: https://arxiv.org/pdf/2512.14615
Copy Paste: [[2512.14615]] Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets(https://arxiv.org/abs/2512.14615)
Keywords: anomaly
Abstract: We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.

Title: JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Authors: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.14620
Pdf URL: https://arxiv.org/pdf/2512.14620
Copy Paste: [[2512.14620]] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction(https://arxiv.org/abs/2512.14620)
Keywords: generative
Abstract: This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

Title: A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images

Authors: Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher, Rainer Spang, Ralf Huss, Johannes Raffler, Sarah Reinke, Wolfram Klapper, Katja Steiger, Kristina Schwamborn, Carsten Marr
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14640
Pdf URL: https://arxiv.org/pdf/2512.14640
Copy Paste: [[2512.14640]] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images(https://arxiv.org/abs/2512.14640)
Keywords: foundation model
Abstract: Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.

Title: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.14681
Pdf URL: https://arxiv.org/pdf/2512.14681
Copy Paste: [[2512.14681]] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing(https://arxiv.org/abs/2512.14681)
Keywords: diffusion
Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at this https URL.

Title: MMGR: Multi-Modal Generative Reasoning

Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.14691
Pdf URL: https://arxiv.org/pdf/2512.14691
Copy Paste: [[2512.14691]] MMGR: Multi-Modal Generative Reasoning(https://arxiv.org/abs/2512.14691)
Keywords: foundation model, generative
Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.

Title: Native and Compact Structured Latents for 3D Generation

Authors: Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, Jiaolong Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14692
Pdf URL: https://arxiv.org/pdf/2512.14692
Copy Paste: [[2512.14692]] Native and Compact Structured Latents for 3D Generation(https://arxiv.org/abs/2512.14692)
Keywords: generative
Abstract: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.