2025-12-30

Title: Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.22150
Pdf URL: https://arxiv.org/pdf/2512.22150
Copy Paste: [[2512.22150]] Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders(https://arxiv.org/abs/2512.22150)
Keywords: generative
Abstract: Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.

Title: Characterizing Motion Encoding in Video Diffusion Timesteps

Authors: Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2512.22175
Pdf URL: https://arxiv.org/pdf/2512.22175
Copy Paste: [[2512.22175]] Characterizing Motion Encoding in Video Diffusion Timesteps(https://arxiv.org/abs/2512.22175)
Keywords: diffusion
Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.

Title: Wireless Traffic Prediction with Large Language Model

Authors: Chuanting Zhang, Haixia Zhang, Jingping Qiao, Zongzhang Li, Mohamed-Slim Alouini
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22178
Pdf URL: https://arxiv.org/pdf/2512.22178
Copy Paste: [[2512.22178]] Wireless Traffic Prediction with Large Language Model(https://arxiv.org/abs/2512.22178)
Keywords: foundation model
Abstract: The growing demand for intelligent, adaptive resource management in next-generation wireless networks has underscored the importance of accurate and scalable wireless traffic prediction. While recent advancements in deep learning and foundation models such as large language models (LLMs) have demonstrated promising forecasting capabilities, they largely overlook the spatial dependencies inherent in city-scale traffic dynamics. In this paper, we propose TIDES (Traffic Intelligence with DeepSeek-Enhanced Spatial-temporal prediction), a novel LLM-based framework that captures spatial-temporal correlations for urban wireless traffic prediction. TIDES first identifies heterogeneous traffic patterns across regions through a clustering mechanism and trains personalized models for each region to balance generalization and specialization. To bridge the domain gap between numerical traffic data and language-based models, we introduce a prompt engineering scheme that embeds statistical traffic features as structured inputs. Furthermore, we design a DeepSeek module that enables spatial alignment via cross-domain attention, allowing the LLM to leverage information from spatially related regions. By fine-tuning only lightweight components while freezing core LLM layers, TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead. Extensive experiments on real-world cellular traffic datasets demonstrate that TIDES significantly outperforms state-of-the-art baselines in both prediction accuracy and robustness. Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.

Title: Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection

Authors: Rajeeb Thapa Chhetri, Zhixiong Chen, Saurab Thapa
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.22179
Pdf URL: https://arxiv.org/pdf/2512.22179
Copy Paste: [[2512.22179]] Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection(https://arxiv.org/abs/2512.22179)
Keywords: anomaly
Abstract: A fundamental limitation of supervised deep learning in high-dimensional tabular domains is "Generalization Collapse": models learn precise decision boundaries for known distributions but fail catastrophically when facing Out-of-Distribution (OOD) data. We hypothesize that this failure stems from the lack of topological constraints in the latent space, resulting in diffuse manifolds where novel anomalies remain statistically indistinguishable from benign data. To address this, we propose Latent Sculpting, a hierarchical two-stage representation learning framework. Stage 1 utilizes a hybrid 1D-CNN and Transformer Encoder trained with a novel Dual-Centroid Compactness Loss (DCCL) to actively "sculpt" benign traffic into a low-entropy, hyperspherical cluster. Unlike standard contrastive losses that rely on triplet mining, DCCL optimizes global cluster centroids to enforce absolute manifold density. Stage 2 conditions a Masked Autoregressive Flow (MAF) on this pre-structured manifold to learn an exact density estimate. We evaluate this methodology on the rigorous CIC-IDS-2017 benchmark, treating it as a proxy for complex, non-stationary data streams. Empirical results demonstrate that explicit manifold sculpting is a prerequisite for robust zero-shot generalization. While supervised baselines suffered catastrophic performance collapse on unseen distribution shifts (F1 approx 0.30) and the strongest unsupervised baseline achieved only 0.76, our framework achieved an F1-Score of 0.87 on strictly zero-shot anomalies. Notably, we report an 88.89% detection rate on "Infiltration" scenarios--a complex distributional shift where state-of-the-art supervised models achieved 0.00% accuracy. These findings suggest that decoupling structure learning from density estimation provides a scalable path toward generalized anomaly detection.

Title: DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Authors: Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyanag He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22234
Pdf URL: https://arxiv.org/pdf/2512.22234
Copy Paste: [[2512.22234]] DiRL: An Efficient Post-Training Framework for Diffusion Language Models(https://arxiv.org/abs/2512.22234)
Keywords: diffusion
Abstract: Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

Title: Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction

Authors: Mengxiao Geng, Ran Hong, Xiaoling Xu, Bingxuan Li, Qiegen Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22237
Pdf URL: https://arxiv.org/pdf/2512.22237
Copy Paste: [[2512.22237]] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction(https://arxiv.org/abs/2512.22237)
Keywords: diffusion
Abstract: Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.

Title: Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs

Authors: Pascal Passigan, Kevin zhu, Angelina Ning
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22251
Pdf URL: https://arxiv.org/pdf/2512.22251
Copy Paste: [[2512.22251]] Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs(https://arxiv.org/abs/2512.22251)
Keywords: foundation model
Abstract: Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) -- which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression -- under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.

Title: Graph Attention-based Adaptive Transfer Learning for Link Prediction

Authors: Huashen Lu, Wensheng Gan, Guoting Chen, Zhichao Huang, Philip S. Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22252
Pdf URL: https://arxiv.org/pdf/2512.22252
Copy Paste: [[2512.22252]] Graph Attention-based Adaptive Transfer Learning for Link Prediction(https://arxiv.org/abs/2512.22252)
Keywords: self-supervised
Abstract: Graph neural networks (GNNs) have brought revolutionary advancements to the field of link prediction (LP), providing powerful tools for mining potential relationships in graphs. However, existing methods face challenges when dealing with large-scale sparse graphs and the need for a high degree of alignment between different datasets in transfer learning. Besides, although self-supervised methods have achieved remarkable success in many graph tasks, prior research has overlooked the potential of transfer learning to generalize across different graph datasets. To address these limitations, we propose a novel Graph Attention Adaptive Transfer Network (GAATNet). It combines the advantages of pre-training and fine-tuning to capture global node embedding information across datasets of different scales, ensuring efficient knowledge transfer and improved LP performance. To enhance the model's generalization ability and accelerate training, we design two key strategies: 1) Incorporate distant neighbor embeddings as biases in the self-attention module to capture global features. 2) Introduce a lightweight self-adapter module during fine-tuning to improve training efficiency. Comprehensive experiments on seven public datasets demonstrate that GAATNet achieves state-of-the-art performance in LP tasks. This study provides a general and scalable solution for LP tasks to effectively integrate GNNs with transfer learning. The source code and datasets are publicly available at this https URL

Title: Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Authors: Antara Titikhsha, Om Kulkarni, Dharun Muthaiah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22272
Pdf URL: https://arxiv.org/pdf/2512.22272
Copy Paste: [[2512.22272]] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models(https://arxiv.org/abs/2512.22272)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-{\Sigma}. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

Title: The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Authors: Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22275
Pdf URL: https://arxiv.org/pdf/2512.22275
Copy Paste: [[2512.22275]] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency(https://arxiv.org/abs/2512.22275)
Keywords: foundation model
Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.

Title: Hierarchical Stacking Optimization Using Dirichlet's Process (SoDip): Towards Accelerated Design for Graft Polymerization

Authors: Amgad Ahmed Ali Ibrahim, Hein Htet, Ryoji Asahi
Subjects: cs.LG, cs.CE, physics.app-ph
Abstract URL: https://arxiv.org/abs/2512.22279
Pdf URL: https://arxiv.org/pdf/2512.22279
Copy Paste: [[2512.22279]] Hierarchical Stacking Optimization Using Dirichlet's Process (SoDip): Towards Accelerated Design for Graft Polymerization(https://arxiv.org/abs/2512.22279)
Keywords: diffusion
Abstract: Radiation-induced grafting (RIG) enables precise functionalization of polymer films for ion-exchange membranes, CO2-separation membranes, and battery electrolytes by generating radicals on robust substrates to graft desired monomers. However, reproducibility remains limited due to unreported variability in base-film morphology (crystallinity, grain orientation, free volume), which governs monomer diffusion, radical distribution, and the Trommsdorff effect, leading to spatial graft gradients and performance inconsistencies. We present a hierarchical stacking optimization framework with a Dirichlet's Process (SoDip), a hierarchical data-driven framework integrating: (1) a decoder-only Transformer (DeepSeek-R1) to encode textual process descriptors (irradiation source, grafting type, substrate manufacturer); (2) TabNet and XGBoost for modelling multimodal feature interactions; (3) Gaussian Process Regression (GPR) with Dirichlet Process Mixture Models (DPMM) for uncertainty quantification and heteroscedasticity; and (4) Bayesian Optimization for efficient exploration of high-dimensional synthesis space. A diverse dataset was curated using ChemDataExtractor 2.0 and WebPlotDigitizer, incorporating numerical and textual variables across hundreds of RIG studies. In cross-validation, SoDip achieved ~33% improvement over GPR while providing calibrated confidence intervals that identify low-reproducibility regimes. Its stacked architecture integrates sparse textual and numerical inputs of varying quality, outperforming prior models and establishing a foundation for reproducible, morphology-aware design in graft polymerization research.

Title: Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

Authors: Zikun Guoa, Adeyinka.P. Adedigbaa, Rammohan Mallipeddi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22287
Pdf URL: https://arxiv.org/pdf/2512.22287
Copy Paste: [[2512.22287]] Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation(https://arxiv.org/abs/2512.22287)
Keywords: generative
Abstract: Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.

Title: Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model

Authors: Renping Zhou, Zanlin Ni, Tianyi Chen, Zeyu Liu, Yang Yue, Yulin Wang, Yuxuan Wang, Jingshu Liu, Gao Huang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22288
Pdf URL: https://arxiv.org/pdf/2512.22288
Copy Paste: [[2512.22288]] Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model(https://arxiv.org/abs/2512.22288)
Keywords: diffusion
Abstract: Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: this https URL .

Title: Multi-Head Spectral-Adaptive Graph Anomaly Detection

Authors: Qingyue Cao, Bo Jin, Changwei Gong, Xin Tong, Wenzheng Li, Xiaodong Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22291
Pdf URL: https://arxiv.org/pdf/2512.22291
Copy Paste: [[2512.22291]] Multi-Head Spectral-Adaptive Graph Anomaly Detection(https://arxiv.org/abs/2512.22291)
Keywords: anomaly
Abstract: Graph anomaly detection technology has broad applications in financial fraud and risk control. However, existing graph anomaly detection methods often face significant challenges when dealing with complex and variable abnormal patterns, as anomalous nodes are often disguised and mixed with normal nodes, leading to the coexistence of homophily and heterophily in the graph domain. Recent spectral graph neural networks have made notable progress in addressing this issue; however, current techniques typically employ fixed, globally shared filters. This 'one-size-fits-all' approach can easily cause over-smoothing, erasing critical high-frequency signals needed for fraud detection, and lacks adaptive capabilities for different graph instances. To solve this problem, we propose a Multi-Head Spectral-Adaptive Graph Neural Network (MHSA-GNN). The core innovation is the design of a lightweight hypernetwork that, conditioned on a 'spectral fingerprint' containing structural statistics and Rayleigh quotient features, dynamically generates Chebyshev filter parameters tailored to each instance. This enables a customized filtering strategy for each node and its local subgraph. Additionally, to prevent mode collapse in the multi-head mechanism, we introduce a novel dual regularization strategy that combines teacher-student contrastive learning (TSC) to ensure representation accuracy and Barlow Twins diversity loss (BTD) to enforce orthogonality among heads. Extensive experiments on four real-world datasets demonstrate that our method effectively preserves high-frequency abnormal signals and significantly outperforms existing state-of-the-art methods, especially showing excellent robustness on highly heterogeneous datasets.

Title: LLA: Enhancing Security and Privacy for Generative Models with Logic-Locked Accelerators

Authors: You Li, Guannan Zhao, Yuhao Ju, Yunqi He, Jie Gu, Hai Zhou
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22307
Pdf URL: https://arxiv.org/pdf/2512.22307
Copy Paste: [[2512.22307]] LLA: Enhancing Security and Privacy for Generative Models with Logic-Locked Accelerators(https://arxiv.org/abs/2512.22307)
Keywords: generative
Abstract: We introduce LLA, an effective intellectual property (IP) protection scheme for generative AI models. LLA leverages the synergy between hardware and software to defend against various supply chain threats, including model theft, model corruption, and information leakage. On the software side, it embeds key bits into neurons that can trigger outliers to degrade performance and applies invariance transformations to obscure the key values. On the hardware side, it integrates a lightweight locking module into the AI accelerator while maintaining compatibility with various dataflow patterns and toolchains. An accelerator with a pre-stored secret key acts as a license to access the model services provided by the IP owner. The evaluation results show that LLA can withstand a broad range of oracle-guided key optimization attacks, while incurring a minimal computational overhead of less than 0.1% for 7,168 key bits.

Title: LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Authors: Xudong Ling, Tianxi Huang, Qian Dong, Tao He, Chaorong Li, Guiduo Duan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22317
Pdf URL: https://arxiv.org/pdf/2512.22317
Copy Paste: [[2512.22317]] LangPrecip: Language-Aware Multimodal Precipitation Nowcasting(https://arxiv.org/abs/2512.22317)
Keywords: generative
Abstract: Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent this http URL further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.

Title: SpotEdit: Selective Region Editing in Diffusion Transformers

Authors: Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22323
Pdf URL: https://arxiv.org/pdf/2512.22323
Copy Paste: [[2512.22323]] SpotEdit: Selective Region Editing in Diffusion Transformers(https://arxiv.org/abs/2512.22323)
Keywords: diffusion
Abstract: Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

Title: DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22324
Pdf URL: https://arxiv.org/pdf/2512.22324
Copy Paste: [[2512.22324]] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models(https://arxiv.org/abs/2512.22324)
Keywords: diffusion, self-supervised
Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.

Title: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Authors: Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22374
Pdf URL: https://arxiv.org/pdf/2512.22374
Copy Paste: [[2512.22374]] Self-Evaluation Unlocks Any-Step Text-to-Image Generation(https://arxiv.org/abs/2512.22374)
Keywords: diffusion
Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

Title: The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

Authors: Zubaida Mohammed Albadani, Mohammed Q. Shormani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22376
Pdf URL: https://arxiv.org/pdf/2512.22376
Copy Paste: [[2512.22376]] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach(https://arxiv.org/abs/2512.22376)
Keywords: generative
Abstract: This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning 'I said,' introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k 'you said'. It also provides insights into the possibility of the universality of minimalism.

Title: DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

Authors: Hansang Lee, Chaelin Lee, Nieun Seo, Joon Seok Lim, Helen Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22406
Pdf URL: https://arxiv.org/pdf/2512.22406
Copy Paste: [[2512.22406]] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization(https://arxiv.org/abs/2512.22406)
Keywords: diffusion, generative
Abstract: We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn's Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32\% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet's maximum converged performance ($31.03\% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.

Title: Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy

Authors: Amil Khan, Matheus Palhares Viana, Suraj Mishra, B.S. Manjunath
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22423
Pdf URL: https://arxiv.org/pdf/2512.22423
Copy Paste: [[2512.22423]] Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy(https://arxiv.org/abs/2512.22423)
Keywords: foundation model
Abstract: Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone--without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.

Title: SAM 3D for 3D Object Reconstruction from Remote Sensing Images

Authors: Junsheng Yao, Lichao Mou, Qingyu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22452
Pdf URL: https://arxiv.org/pdf/2512.22452
Copy Paste: [[2512.22452]] SAM 3D for 3D Object Reconstruction from Remote Sensing Images(https://arxiv.org/abs/2512.22452)
Keywords: foundation model
Abstract: Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.

Title: Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Authors: Sukhyun Jeong, Yong-Hoon Choi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.22464
Pdf URL: https://arxiv.org/pdf/2512.22464
Copy Paste: [[2512.22464]] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing(https://arxiv.org/abs/2512.22464)
Keywords: diffusion
Abstract: Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

Title: Tracking by Predicting 3-D Gaussians Over Time

Authors: Tanish Baranwal, Himanshu Gaurav Singh, Jathushan Rajasegaran, Jitendra Malik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22489
Pdf URL: https://arxiv.org/pdf/2512.22489
Copy Paste: [[2512.22489]] Tracking by Predicting 3-D Gaussians Over Time(https://arxiv.org/abs/2512.22489)
Keywords: self-supervised
Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at this https URL and this https URL.

Title: Decomposing Task Vectors for Refined Model Editing

Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22511
Pdf URL: https://arxiv.org/pdf/2512.22511
Copy Paste: [[2512.22511]] Decomposing Task Vectors for Refined Model Editing(https://arxiv.org/abs/2512.22511)
Keywords: diffusion
Abstract: Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.

Title: Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification

Authors: Guikun Xu, Xiaohan Yi, Peilin Zhao, Yatao Bian
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2512.22597
Pdf URL: https://arxiv.org/pdf/2512.22597
Copy Paste: [[2512.22597]] Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification(https://arxiv.org/abs/2512.22597)
Keywords: generative
Abstract: Generating low-energy conformer ensembles and identifying ground-state conformations from molecular graphs remain computationally demanding with physics-based pipelines. Current learning-based approaches often suffer from a fragmented paradigm: generative models capture diversity but lack reliable energy calibration, whereas deterministic predictors target a single structure and fail to represent ensemble variability. Here we present EnFlow, a unified framework that couples flow matching (FM) with an explicitly learned energy model through an energy-guided sampling scheme defined along a non-Gaussian FM path. By incorporating energy-gradient guidance during sampling, our method steers trajectories toward lower-energy regions, substantially improving conformational fidelity, particularly in the few-step regime. The learned energy function further enables efficient energy-based ranking of generated ensembles for accurate ground-state identification. Extensive experiments on GEOM-QM9 and GEOM-Drugs demonstrate that EnFlow simultaneously improves generation metrics with 1--2 ODE-steps and reduces ground-state prediction errors compared with state-of-the-art methods.

Title: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Authors: Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.22615
Pdf URL: https://arxiv.org/pdf/2512.22615
Copy Paste: [[2512.22615]] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone(https://arxiv.org/abs/2512.22615)
Keywords: diffusion
Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $\pi_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

Title: Rethinking Memory Design in SAM-Based Visual Object Tracking

Authors: Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi, Naoufel Werghi, Sajid Javed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22624
Pdf URL: https://arxiv.org/pdf/2512.22624
Copy Paste: [[2512.22624]] Rethinking Memory Design in SAM-Based Visual Object Tracking(https://arxiv.org/abs/2512.22624)
Keywords: foundation model
Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

Title: Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Authors: Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22626
Pdf URL: https://arxiv.org/pdf/2512.22626
Copy Paste: [[2512.22626]] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion(https://arxiv.org/abs/2512.22626)
Keywords: diffusion
Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.

Title: On the Role of Discreteness in Diffusion LLMs

Authors: Ziqi Jin, Bin Wang, Xiang Lin, Lidong Bing, Aixin Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22630
Pdf URL: https://arxiv.org/pdf/2512.22630
Copy Paste: [[2512.22630]] On the Role of Discreteness in Diffusion LLMs(https://arxiv.org/abs/2512.22630)
Keywords: diffusion
Abstract: Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.

Title: Visual Autoregressive Modelling for Monocular Depth Estimation

Authors: Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22653
Pdf URL: https://arxiv.org/pdf/2512.22653
Copy Paste: [[2512.22653]] Visual Autoregressive Modelling for Monocular Depth Estimation(https://arxiv.org/abs/2512.22653)
Keywords: diffusion, generative
Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "this https URL.

Title: Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations

Authors: Achraf Hsain, Fouad Mohammed Abbou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22672
Pdf URL: https://arxiv.org/pdf/2512.22672
Copy Paste: [[2512.22672]] Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations(https://arxiv.org/abs/2512.22672)
Keywords: generative
Abstract: This paper presents the first application of quantum generative models to learned latent space representations of computational fluid dynamics (CFD) data. While recent work has explored quantum models for learning statistical properties of fluid systems, the combination of discrete latent space compression with quantum generative sampling for CFD remains unexplored. We develop a GPU-accelerated Lattice Boltzmann Method (LBM) simulator to generate fluid vorticity fields, which are compressed into a discrete 7-dimensional latent space using a Vector Quantized Variational Autoencoder (VQ-VAE). The central contribution is a comparative analysis of quantum and classical generative approaches for modeling this physics-derived latent distribution: we evaluate a Quantum Circuit Born Machine (QCBM) and Quantum Generative Adversarial Network (QGAN) against a classical Long Short-Term Memory (LSTM) baseline. Under our experimental conditions, both quantum models produced samples with lower average minimum distances to the true distribution compared to the LSTM, with the QCBM achieving the most favorable metrics. This work provides: (1)~a complete open-source pipeline bridging CFD simulation and quantum machine learning, (2)~the first empirical study of quantum generative modeling on compressed latent representations of physics simulations, and (3)~a foundation for future rigorous investigation at this intersection.

Title: CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

Authors: ZhenQi Chen, TsaiChing Ni, YuanFu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22681
Pdf URL: https://arxiv.org/pdf/2512.22681
Copy Paste: [[2512.22681]] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation(https://arxiv.org/abs/2512.22681)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

Title: SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

Authors: Paul Dobre, Jackson Cooper, Xin Wang, Hongzhou Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22706
Pdf URL: https://arxiv.org/pdf/2512.22706
Copy Paste: [[2512.22706]] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis(https://arxiv.org/abs/2512.22706)
Keywords: diffusion
Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.

Title: Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

Authors: Youssef Megahed, Robin Ducharme, Inok Lee, Inbal Willner, Olivier X. Miguel, Kevin Dick, Adrian D. C. Chan, Mark Walker, Steven Hawken
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.22730
Pdf URL: https://arxiv.org/pdf/2512.22730
Copy Paste: [[2512.22730]] Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning(https://arxiv.org/abs/2512.22730)
Keywords: self-supervised, foundation model
Abstract: Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).

Title: WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Authors: Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.22737
Pdf URL: https://arxiv.org/pdf/2512.22737
Copy Paste: [[2512.22737]] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference(https://arxiv.org/abs/2512.22737)
Keywords: diffusion
Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

Title: GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks

Authors: Xuyan Li, Jie Wang, Zheng Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22772
Pdf URL: https://arxiv.org/pdf/2512.22772
Copy Paste: [[2512.22772]] GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks(https://arxiv.org/abs/2512.22772)
Keywords: generative
Abstract: Dynamic graphs are widely used to represent evolving real-world networks. Temporal Graph Neural Networks (TGNNs) have emerged as a powerful tool for processing such graphs, but the lack of transparency and explainability limits their practical adoption. Research on TGNN explainability is still in its early stages and faces several key issues: (i) Current methods are tailored to specific TGNN types, restricting generality. (ii) They suffer from high computational costs, making them unsuitable for large-scale networks. (iii) They often overlook the structural connectivity of explanations and require prior knowledge, reducing user-friendliness. To address these issues, we propose GRExplainer, the first universal, efficient, and user-friendly explanation method for TGNNs. GRExplainer extracts node sequences as a unified feature representation, making it independent of specific input formats and thus applicable to both snapshot-based and event-based TGNNs (the major types of TGNNs). By utilizing breadth-first search and temporal information to construct input node sequences, GRExplainer reduces redundant computation and improves efficiency. To enhance user-friendliness, we design a generative model based on Recurrent Neural Networks (RNNs), enabling automated and continuous explanation generation. Experiments on six real-world datasets with three target TGNNs show that GRExplainer outperforms existing baseline methods in generality, efficiency, and user-friendliness.

Title: Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

Authors: Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22796
Pdf URL: https://arxiv.org/pdf/2512.22796
Copy Paste: [[2512.22796]] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization(https://arxiv.org/abs/2512.22796)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.

Title: ReDiF: Reinforced Distillation for Few Step Diffusion

Authors: Amirhossein Tighkhorshid, Zahra Dehghanian, Gholamali Aminian, Chengchun Shi, Hamid R. Rabiee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22802
Pdf URL: https://arxiv.org/pdf/2512.22802
Copy Paste: [[2512.22802]] ReDiF: Reinforced Distillation for Few Step Diffusion(https://arxiv.org/abs/2512.22802)
Keywords: diffusion, generative
Abstract: Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher's outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.

Title: EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Authors: Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, Daquan Zhou, Taku Komura, Zhiyang Dou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22808
Pdf URL: https://arxiv.org/pdf/2512.22808
Copy Paste: [[2512.22808]] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation(https://arxiv.org/abs/2512.22808)
Keywords: generative
Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.

Title: ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Authors: Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu, Suhui Wu, Jun Zhang, Suman Banerjee, Hao Zhang
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22854
Pdf URL: https://arxiv.org/pdf/2512.22854
Copy Paste: [[2512.22854]] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning(https://arxiv.org/abs/2512.22854)
Keywords: diffusion
Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.

Title: Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

Authors: Ziyu Zhou, Haozhe Luo, Mohammad Reza Hosseinzadeh Taher, Jiaxuan Pang, Xiaowei Ding, Michael B. Gotway, Jianming Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22872
Pdf URL: https://arxiv.org/pdf/2512.22872
Copy Paste: [[2512.22872]] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs(https://arxiv.org/abs/2512.22872)
Keywords: self-supervised, foundation model
Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps' superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.

Title: M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Authors: Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22877
Pdf URL: https://arxiv.org/pdf/2512.22877
Copy Paste: [[2512.22877]] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models(https://arxiv.org/abs/2512.22877)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

Title: Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

Authors: Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Haozhe Jia, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22881
Pdf URL: https://arxiv.org/pdf/2512.22881
Copy Paste: [[2512.22881]] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance(https://arxiv.org/abs/2512.22881)
Keywords: diffusion
Abstract: Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.

Title: DECEPTICON: How Dark Patterns Manipulate Web Agents

Authors: Phil Cuvin, Hao Zhu, Diyi Yang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22894
Pdf URL: https://arxiv.org/pdf/2512.22894
Copy Paste: [[2512.22894]] DECEPTICON: How Dark Patterns Manipulate Web Agents(https://arxiv.org/abs/2512.22894)
Keywords: in-context
Abstract: Deceptive UI designs, widely instantiated across the web and commonly known as dark patterns, manipulate users into performing actions misaligned with their goals. In this paper, we show that dark patterns are highly effective in steering agent trajectories, posing a significant risk to agent robustness. To quantify this risk, we introduce DECEPTICON, an environment for testing individual dark patterns in isolation. DECEPTICON includes 700 web navigation tasks with dark patterns -- 600 generated tasks and 100 real-world tasks, designed to measure instruction-following success and dark pattern effectiveness. Across state-of-the-art agents, we find dark patterns successfully steer agent trajectories towards malicious outcomes in over 70% of tested generated and real-world tasks -- compared to a human average of 31%. Moreover, we find that dark pattern effectiveness correlates positively with model size and test-time reasoning, making larger, more capable models more susceptible. Leading countermeasures against adversarial attacks, including in-context prompting and guardrail models, fail to consistently reduce the success rate of dark pattern interventions. Our findings reveal dark patterns as a latent and unmitigated risk to web agents, highlighting the urgent need for robust defenses against manipulative designs.

Title: Multiple Token Divergence: Measuring and Steering In-Context Computation Density

Authors: Vincent Herrmann, Eric Alcaide, Michael Wand, Jürgen Schmidhuber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22944
Pdf URL: https://arxiv.org/pdf/2512.22944
Copy Paste: [[2512.22944]] Multiple Token Divergence: Measuring and Steering In-Context Computation Density(https://arxiv.org/abs/2512.22944)
Keywords: in-context
Abstract: Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.

Title: Reverse Personalization

Authors: Han-Wei Kung, Tuomas Varanka, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22984
Pdf URL: https://arxiv.org/pdf/2512.22984
Copy Paste: [[2512.22984]] Reverse Personalization(https://arxiv.org/abs/2512.22984)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at this https URL .

Title: Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Authors: Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23035
Pdf URL: https://arxiv.org/pdf/2512.23035
Copy Paste: [[2512.23035]] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion(https://arxiv.org/abs/2512.23035)
Keywords: self-supervised, foundation model
Abstract: Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at this https URL.

Title: 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Authors: Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23042
Pdf URL: https://arxiv.org/pdf/2512.23042
Copy Paste: [[2512.23042]] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds(https://arxiv.org/abs/2512.23042)
Keywords: self-supervised
Abstract: Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

Title: PI-MFM: Physics-informed multimodal foundation model for solving partial differential equations

Authors: Min Zhu, Jingmin Sun, Zecheng Zhang, Hayden Schaeffer, Lu Lu
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2512.23056
Pdf URL: https://arxiv.org/pdf/2512.23056
Copy Paste: [[2512.23056]] PI-MFM: Physics-informed multimodal foundation model for solving partial differential equations(https://arxiv.org/abs/2512.23056)
Keywords: foundation model
Abstract: Partial differential equations (PDEs) govern a wide range of physical systems, and recent multimodal foundation models have shown promise for learning PDE solution operators across diverse equation families. However, existing multi-operator learning approaches are data-hungry and neglect physics during training. Here, we propose a physics-informed multimodal foundation model (PI-MFM) framework that directly enforces governing equations during pretraining and adaptation. PI-MFM takes symbolic representations of PDEs as the input, and automatically assembles PDE residual losses from the input expression via a vectorized derivative computation. These designs enable any PDE-encoding multimodal foundation model to be trained or adapted with unified physics-informed objectives across equation families. On a benchmark of 13 parametric one-dimensional time-dependent PDE families, PI-MFM consistently outperforms purely data-driven counterparts, especially with sparse labeled spatiotemporal points, partially observed time domains, or few labeled function pairs. Physics losses further improve robustness against noise, and simple strategies such as resampling collocation points substantially improve accuracy. We also analyze the accuracy, precision, and computational cost of automatic differentiation and finite differences for derivative computation within PI-MFM. Finally, we demonstrate zero-shot physics-informed fine-tuning to unseen PDE families: starting from a physics-informed pretrained model, adapting using only PDE residuals and initial/boundary conditions, without any labeled solution data, rapidly reduces test errors to around 1% and clearly outperforms physics-only training from scratch. These results show that PI-MFM provides a practical and scalable path toward data-efficient, transferable PDE solvers.

Title: TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

Authors: Melikşah Türker, A. Ebrar Kızıloğlu, Onur Güngör, Susan Üsküdarlı
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23065
Pdf URL: https://arxiv.org/pdf/2512.23065
Copy Paste: [[2512.23065]] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish(https://arxiv.org/abs/2512.23065)
Keywords: foundation model
Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

Title: Multimodal Functional Maximum Correlation for Emotion Recognition

Authors: Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2512.23076
Pdf URL: https://arxiv.org/pdf/2512.23076
Copy Paste: [[2512.23076]] Multimodal Functional Maximum Correlation for Emotion Recognition(https://arxiv.org/abs/2512.23076)
Keywords: self-supervised
Abstract: Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at this https URL.

Title: MedSAM-based lung masking for multi-label chest X-ray classification

Authors: Brayden Miao, Zain Rehman, Xin Miao, Siming Liu, Jianjie Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23089
Pdf URL: https://arxiv.org/pdf/2512.23089
Copy Paste: [[2512.23089]] MedSAM-based lung masking for multi-label chest X-ray classification(https://arxiv.org/abs/2512.23089)
Keywords: foundation model
Abstract: Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.

Title: Osmotic Learning: A Self-Supervised Paradigm for Decentralized Contextual Data Representation

Authors: Mario Colosi, Reza Farahani, Maria Fazio, Radu Prodan, Massimo Villari
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2512.23096
Pdf URL: https://arxiv.org/pdf/2512.23096
Copy Paste: [[2512.23096]] Osmotic Learning: A Self-Supervised Paradigm for Decentralized Contextual Data Representation(https://arxiv.org/abs/2512.23096)
Keywords: diffusion, self-supervised
Abstract: Data within a specific context gains deeper significance beyond its isolated interpretation. In distributed systems, interdependent data sources reveal hidden relationships and latent structures, representing valuable information for many applications. This paper introduces Osmotic Learning (OSM-L), a self-supervised distributed learning paradigm designed to uncover higher-level latent knowledge from distributed data. The core of OSM-L is osmosis, a process that synthesizes dense and compact representation by extracting contextual information, eliminating the need for raw data exchange between distributed entities. OSM-L iteratively aligns local data representations, enabling information diffusion and convergence into a dynamic equilibrium that captures contextual patterns. During training, it also identifies correlated data groups, functioning as a decentralized clustering mechanism. Experimental results confirm OSM-L's convergence and representation capabilities on structured datasets, achieving over 0.99 accuracy in local information alignment while preserving contextual integrity.

Title: How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

Authors: Paul M. Thompson
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.23109
Pdf URL: https://arxiv.org/pdf/2512.23109
Copy Paste: [[2512.23109]] How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure(https://arxiv.org/abs/2512.23109)
Keywords: generative
Abstract: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.

Title: PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

Authors: Jian Wang, Sixing Rong, Jiarui Xing, Yuling Xu, Weide Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23130
Pdf URL: https://arxiv.org/pdf/2512.23130
Copy Paste: [[2512.23130]] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion(https://arxiv.org/abs/2512.23130)
Keywords: diffusion, generative
Abstract: We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.

Title: Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

Authors: Armstrong Foundjem, Lionel Nganyewou Tidjon, Leuson Da Silva, Foutse Khomh
Subjects: cs.CR, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2512.23132
Pdf URL: https://arxiv.org/pdf/2512.23132
Copy Paste: [[2512.23132]] Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems(https://arxiv.org/abs/2512.23132)
Keywords: diffusion, foundation model
Abstract: Machine learning (ML) underpins foundation models in finance, healthcare, and critical infrastructure, making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection-driven jailbreaks and cross-modal manipulation. Traditional cybersecurity lacks ML-specific threat modeling for foundation, multimodal, and RAG systems. Objective: Characterize ML security risks by identifying dominant TTPs, vulnerabilities, and targeted lifecycle stages. Methods: We extract 93 threats from MITRE ATLAS (26), AI Incident Database (12), and literature (55), and analyze 854 GitHub/Python repositories. A multi-agent RAG system (ChatGPT-4o, temp 0.4) mines 300+ articles to build an ontology-driven threat graph linking TTPs, vulnerabilities, and stages. Results: We identify unreported threats including commercial LLM API model stealing, parameter memorization leakage, and preference-guided text-only jailbreaks. Dominant TTPs include MASTERKEY-style jailbreaking, federated poisoning, diffusion backdoors, and preference optimization leakage, mainly impacting pre-training and inference. Graph analysis reveals dense vulnerability clusters in libraries with poor patch propagation. Conclusion: Adaptive, ML-specific security frameworks, combining dependency hygiene, threat intelligence, and monitoring, are essential to mitigate supply-chain and inference risks across the ML lifecycle.

Title: Diffusion-based Decentralized Federated Multi-Task Representation Learning

Authors: Donghwa Kang, Shana Moothedath
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23161
Pdf URL: https://arxiv.org/pdf/2512.23161
Copy Paste: [[2512.23161]] Diffusion-based Decentralized Federated Multi-Task Representation Learning(https://arxiv.org/abs/2512.23161)
Keywords: diffusion
Abstract: Representation learning is a widely adopted framework for learning in data-scarce environments to obtain a feature extractor or representation from various different yet related tasks. Despite extensive research on representation learning, decentralized approaches remain relatively underexplored. This work develops a decentralized projected gradient descent-based algorithm for multi-task representation learning. We focus on the problem of multi-task linear regression in which multiple linear regression models share a common, low-dimensional linear representation. We present an alternating projected gradient descent and minimization algorithm for recovering a low-rank feature matrix in a diffusion-based decentralized and federated fashion. We obtain constructive, provable guarantees that provide a lower bound on the required sample complexity and an upper bound on the iteration complexity of our proposed algorithm. We analyze the time and communication complexity of our algorithm and show that it is fast and communication-efficient. We performed numerical simulations to validate the performance of our algorithm and compared it with benchmark algorithms.

Title: GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23180
Pdf URL: https://arxiv.org/pdf/2512.23180
Copy Paste: [[2512.23180]] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation(https://arxiv.org/abs/2512.23180)
Keywords: generative
Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub this https URL.

Title: Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Authors: Changgyoon Oh, Jongoh Jeong, Jegyeong Cho, Kuk-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23210
Pdf URL: https://arxiv.org/pdf/2512.23210
Copy Paste: [[2512.23210]] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks(https://arxiv.org/abs/2512.23210)
Keywords: diffusion, generative
Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.

Title: Anka: A Domain-Specific Language for Reliable LLM Code Generation

Authors: Saif Khalfan Saif Al Mazrouei
Subjects: cs.CL, cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.23214
Pdf URL: https://arxiv.org/pdf/2512.23214
Copy Paste: [[2512.23214]] Anka: A Domain-Specific Language for Reliable LLM Code Generation(https://arxiv.org/abs/2512.23214)
Keywords: in-context
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.

Title: Anomaly Detection by Effectively Leveraging Synthetic Images

Authors: Sungho Kang, Hyunkyu Park, Yeonho Lee, Hanbyul Lee, Mijoo Jeong, YeongHyeon Park, Injae Lee, Juneho Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23227
Pdf URL: https://arxiv.org/pdf/2512.23227
Copy Paste: [[2512.23227]] Anomaly Detection by Effectively Leveraging Synthetic Images(https://arxiv.org/abs/2512.23227)
Keywords: diffusion, generative, anomaly
Abstract: Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.

Title: SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems

Authors: Minwoo Kim, Hongki Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23232
Pdf URL: https://arxiv.org/pdf/2512.23232
Copy Paste: [[2512.23232]] SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems(https://arxiv.org/abs/2512.23232)
Keywords: diffusion
Abstract: Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein's Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.

Title: Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network

Authors: Dongsheng Li, Chaobo Chen, Siling Wang, Song Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23234
Pdf URL: https://arxiv.org/pdf/2512.23234
Copy Paste: [[2512.23234]] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network(https://arxiv.org/abs/2512.23234)
Keywords: diffusion
Abstract: Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8\%, an AP$_{50}$ of 84.3\%, and a small-object AP of 25.3\%, surpassing the RT-DETR-R18 baseline by 3.0\%, 6.5\%, and 5.3\%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP$_{50}$ on the IIG and LangGas dataset.

Title: RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Authors: Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23239
Pdf URL: https://arxiv.org/pdf/2512.23239
Copy Paste: [[2512.23239]] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models(https://arxiv.org/abs/2512.23239)
Keywords: diffusion, foundation model, generative
Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.

Title: ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Authors: Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23245
Pdf URL: https://arxiv.org/pdf/2512.23245
Copy Paste: [[2512.23245]] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation(https://arxiv.org/abs/2512.23245)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: this https URL

Title: Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23258
Pdf URL: https://arxiv.org/pdf/2512.23258
Copy Paste: [[2512.23258]] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization(https://arxiv.org/abs/2512.23258)
Keywords: diffusion
Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$\alpha$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.

Title: On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases

Authors: Alexander Korotin, Gudmund Pammer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23265
Pdf URL: https://arxiv.org/pdf/2512.23265
Copy Paste: [[2512.23265]] On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases(https://arxiv.org/abs/2512.23265)
Keywords: generative
Abstract: This paper studies the inverse problem of flow matching (FM) between distributions with finite exponential moment, a problem motivated by modern generative AI applications such as the distillation of flow matching models. Uniqueness of the solution is established in two cases - the one-dimensional setting and the Gaussian case. The general multidimensional problem remains open for future studies.

Title: Diffusion priors enhanced velocity model building from time-lag images using a neural operator

Authors: Xiao Ma, Mohammad Hasyim Taufik, Tariq Alkhalifah
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.23375
Pdf URL: https://arxiv.org/pdf/2512.23375
Copy Paste: [[2512.23375]] Diffusion priors enhanced velocity model building from time-lag images using a neural operator(https://arxiv.org/abs/2512.23375)
Keywords: diffusion, generative
Abstract: Velocity model building serves as a crucial component for achieving high precision subsurface imaging. However, conventional velocity model building methods are often computationally expensive and time consuming. In recent years, with the rapid advancement of deep learning, particularly the success of generative models and neural operators, deep learning based approaches that integrate data and their statistics have attracted increasing attention in addressing the limitations of traditional methods. In this study, we propose a novel framework that combines generative models with neural operators to obtain high resolution velocity models efficiently. Within this workflow, the neural operator functions as a forward mapping operator to rapidly generate time lag reverse time migration (RTM) extended images from the true and migration velocity models. In this framework, the neural operator is acting as a surrogate for modeling followed by migration, which uses the true and migration velocities, respectively. The trained neural operator is then employed, through automatic differentiation, to gradually update the migration velocity placed in the true velocity input channel with high resolution components so that the output of the network matches the time lag images of observed data obtained using the migration velocity. By embedding a generative model, trained on a high-resolution velocity model distribution, which corresponds to the true velocity model distribution used to train the neural operator, as a regularizer, the resulting predictions are cleaner with higher resolution information. Both synthetic and field data experiments demonstrate the effectiveness of the proposed generative neural operator based velocity model building approach.

Title: SoulX-LiveTalk Technical Report

Authors: Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23379
Pdf URL: https://arxiv.org/pdf/2512.23379
Copy Paste: [[2512.23379]] SoulX-LiveTalk Technical Report(https://arxiv.org/abs/2512.23379)
Keywords: diffusion
Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

Title: A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

Authors: Mohammad Nasirzadeh, Jafar Tahmoresnezhad, Parviz Rashidi-Khazaee
Subjects: cs.LG, cs.AI, cs.CV, cs.NI, cs.OS
Abstract URL: https://arxiv.org/abs/2512.23380
Pdf URL: https://arxiv.org/pdf/2512.23380
Copy Paste: [[2512.23380]] A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers(https://arxiv.org/abs/2512.23380)
Keywords: anomaly
Abstract: Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog's superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at this https URL.

Title: SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation

Authors: Xiaolan Li, Wanquan Liu, Pengcheng Li, Pengyu Jie, Chenqiang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23411
Pdf URL: https://arxiv.org/pdf/2512.23411
Copy Paste: [[2512.23411]] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation(https://arxiv.org/abs/2512.23411)
Keywords: foundation model
Abstract: Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.

Title: DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Authors: Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Hangjun Ye, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23421
Pdf URL: https://arxiv.org/pdf/2512.23421
Copy Paste: [[2512.23421]] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World(https://arxiv.org/abs/2512.23421)
Keywords: diffusion
Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

Title: Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Authors: Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23426
Pdf URL: https://arxiv.org/pdf/2512.23426
Copy Paste: [[2512.23426]] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision(https://arxiv.org/abs/2512.23426)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: this https URL

Title: Towards Integrating Uncertainty for Domain-Agnostic Segmentation

Authors: Jesse Brouwers, Xiaoyan Xing, Alexander Timans
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23427
Pdf URL: https://arxiv.org/pdf/2512.23427
Copy Paste: [[2512.23427]] Towards Integrating Uncertainty for Domain-Agnostic Segmentation(https://arxiv.org/abs/2512.23427)
Keywords: foundation model
Abstract: Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.

Title: Stochastic Siamese MAE Pretraining for Longitudinal Medical Images

Authors: Taha Emre, Arunava Chakravarty, Thomas Pinetz, Dmitrii Lachinov, Martin J. Menten, Hendrik Scholl, Sobha Sivaprasad, Daniel Rueckert, Andrew Lotery, Stefan Sacu, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.23441
Pdf URL: https://arxiv.org/pdf/2512.23441
Copy Paste: [[2512.23441]] Stochastic Siamese MAE Pretraining for Longitudinal Medical Images(https://arxiv.org/abs/2512.23441)
Keywords: self-supervised, foundation model
Abstract: Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer's Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.

Title: CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23453
Pdf URL: https://arxiv.org/pdf/2512.23453
Copy Paste: [[2512.23453]] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models(https://arxiv.org/abs/2512.23453)
Keywords: generative
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at this https URL.

Title: Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

Authors: Kayathri Vigneswaran, Hugo Retief, Jai Clifford Holmes, Mariangel Garcia Andarcia, Hansaka Tennakoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23454
Pdf URL: https://arxiv.org/pdf/2512.23454
Copy Paste: [[2512.23454]] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin(https://arxiv.org/abs/2512.23454)
Keywords: generative
Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.

Title: Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Authors: Bohan Xiao, Peiyong Wang, Qisheng He, Ming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23463
Pdf URL: https://arxiv.org/pdf/2512.23463
Copy Paste: [[2512.23463]] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators(https://arxiv.org/abs/2512.23463)
Keywords: generative
Abstract: Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: this https URL

Title: HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Authors: Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, Jinkun Guo, Dongyue An, Han Liu, Yanyu Tong, Chao Zhang, Qing Guo, Juan Chen, Qiao Zhang, Youyi Zhang, Zihao Yao, Cheng Zhang, Hong Duan, Xiaoping Wu, Qi Chen, Fei Cheng, Liang Dong, Peng He, Hao Zhang, Jiaxin Lin, Chao Zhang, Zhongyi Fan, Yifan Li, Zhichao Hu, Yuhong Liu, Linus, Jie Jiang, Xiaolong Li, Linchao Bao
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.23464
Pdf URL: https://arxiv.org/pdf/2512.23464
Copy Paste: [[2512.23464]] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation(https://arxiv.org/abs/2512.23464)
Keywords: diffusion
Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.

Title: FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence

Authors: Guoan Wan, Tianyu Chen, Fangzheng Feng, Haoyi Zhou, Runhua Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23485
Pdf URL: https://arxiv.org/pdf/2512.23485
Copy Paste: [[2512.23485]] FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence(https://arxiv.org/abs/2512.23485)
Keywords: foundation model
Abstract: Parameter-efficient fine-tuning (PEFT) methods have emerged as a practical solution for adapting large foundation models to downstream tasks, reducing computational and memory costs by updating only a small subset of parameters. Among them, approaches like LoRA aim to strike a balance between efficiency and expressiveness, but often suffer from slow convergence and limited adaptation capacity due to their inherent low-rank constraints. This trade-off hampers the ability of PEFT methods to capture complex patterns needed for diverse tasks. To address these challenges, we propose FRoD, a novel fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom. By extracting a globally shared basis across layers and injecting sparse, learnable perturbations into scaling factors for flexible full-rank updates, FRoD enhances expressiveness and efficiency, leading to faster and more robust convergence. On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy, while using only 1.72% of trainable parameters under identical training budgets.

Title: IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Authors: Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23519
Pdf URL: https://arxiv.org/pdf/2512.23519
Copy Paste: [[2512.23519]] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation(https://arxiv.org/abs/2512.23519)
Keywords: generative
Abstract: Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

Title: Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

Authors: Hexin Zhang, Dong Li, Jie Huang, Bingzhou Wang, Xueyang Fu, Zhengjun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23532
Pdf URL: https://arxiv.org/pdf/2512.23532
Copy Paste: [[2512.23532]] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution(https://arxiv.org/abs/2512.23532)
Keywords: diffusion
Abstract: Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.

Title: AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23537
Pdf URL: https://arxiv.org/pdf/2512.23537
Copy Paste: [[2512.23537]] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization(https://arxiv.org/abs/2512.23537)
Keywords: diffusion
Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

Title: PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Authors: Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23545
Pdf URL: https://arxiv.org/pdf/2512.23545
Copy Paste: [[2512.23545]] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis(https://arxiv.org/abs/2512.23545)
Keywords: foundation model
Abstract: Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.

Title: PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23546
Pdf URL: https://arxiv.org/pdf/2512.23546
Copy Paste: [[2512.23546]] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation(https://arxiv.org/abs/2512.23546)
Keywords: diffusion
Abstract: Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to this https URL.

Title: ThinkGen: Generalized Thinking for Visual Generation

Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23568
Pdf URL: https://arxiv.org/pdf/2512.23568
Copy Paste: [[2512.23568]] ThinkGen: Generalized Thinking for Visual Generation(https://arxiv.org/abs/2512.23568)
Keywords: diffusion, generative
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: this https URL

Title: ProGuard: Towards Proactive Multimodal Safeguard

Authors: Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23573
Pdf URL: https://arxiv.org/pdf/2512.23573
Copy Paste: [[2512.23573]] ProGuard: Towards Proactive Multimodal Safeguard(https://arxiv.org/abs/2512.23573)
Keywords: generative
Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

Title: LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Authors: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23576
Pdf URL: https://arxiv.org/pdf/2512.23576
Copy Paste: [[2512.23576]] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation(https://arxiv.org/abs/2512.23576)
Keywords: diffusion
Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

Title: Distribution-Free Process Monitoring with Conformal Prediction

Authors: Christopher Burger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23602
Pdf URL: https://arxiv.org/pdf/2512.23602
Copy Paste: [[2512.23602]] Distribution-Free Process Monitoring with Conformal Prediction(https://arxiv.org/abs/2512.23602)
Keywords: anomaly
Abstract: Traditional Statistical Process Control (SPC) is essential for quality management but is limited by its reliance on often violated statistical assumptions, leading to unreliable monitoring in modern, complex manufacturing environments. This paper introduces a hybrid framework that enhances SPC by integrating the distribution free, model agnostic guarantees of Conformal Prediction. We propose two novel applications: Conformal-Enhanced Control Charts, which visualize process uncertainty and enable proactive signals like 'uncertainty spikes', and Conformal-Enhanced Process Monitoring, which reframes multivariate control as a formal anomaly detection problem using an intuitive p-value chart. Our framework provides a more robust and statistically rigorous approach to quality control while maintaining the interpretability and ease of use of classic methods.

Title: Memorization in 3D Shape Generation: An Empirical Study

Authors: Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, Zhuang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23628
Pdf URL: https://arxiv.org/pdf/2512.23628
Copy Paste: [[2512.23628]] Memorization in 3D Shape Generation: An Empirical Study(https://arxiv.org/abs/2512.23628)
Keywords: diffusion, generative
Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at this https URL.

Title: IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Authors: Kang Du, Yirui Guan, Zeyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23667
Pdf URL: https://arxiv.org/pdf/2512.23667
Copy Paste: [[2512.23667]] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition(https://arxiv.org/abs/2512.23667)
Keywords: diffusion, generative
Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

Title: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Authors: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23705
Pdf URL: https://arxiv.org/pdf/2512.23705
Copy Paste: [[2512.23705]] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation(https://arxiv.org/abs/2512.23705)
Keywords: diffusion, generative
Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

Title: Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Authors: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23709
Pdf URL: https://arxiv.org/pdf/2512.23709
Copy Paste: [[2512.23709]] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion(https://arxiv.org/abs/2512.23709)
Keywords: diffusion
Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: this https URL