2026-01-01

Title: Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Authors: Zahra Abedi, Richard M.K. van Dijk, Gijs Wijnholds, Tessa Verhoef
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23710
Pdf URL: https://arxiv.org/pdf/2512.23710
Copy Paste: [[2512.23710]] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration(https://arxiv.org/abs/2512.23710)
Keywords: generative
Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

Title: A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

Authors: Haley Rosso, Talea Mayo
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2512.23748
Pdf URL: https://arxiv.org/pdf/2512.23748
Copy Paste: [[2512.23748]] A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios(https://arxiv.org/abs/2512.23748)
Keywords: diffusion, generative
Abstract: For complex simulation problems, inferring parameters of scientific interest often precludes the use of classical likelihood-based techniques due to intractable likelihood functions. Simulation-based inference (SBI) methods forego the need for explicit likelihoods by directly utilizing samples from the simulator to learn posterior distributions over parameters $\mathbf{\theta}$ given observed data $\mathbf{x}_{\text{o}}$. Recent work has brought attention to diffusion models -- a type of generative model rooted in score matching and reverse-time stochastic dynamics -- as a flexible framework SBI tasks. This article reviews diffusion-based SBI from first principles to applications in practice. We first recall the mathematical foundations of diffusion modeling (forward noising, reverse-time SDE/ODE, probability flow, and denoising score matching) and explain how conditional scores enable likelihood-free posterior sampling. We then examine where diffusion models address pain points of normalizing flows in neural posterior/likelihood estimation and where they introduce new trade-offs (e.g., iterative sampling costs). The key theme of this review is robustness of diffusion-based SBI in non-ideal conditions common to scientific data: misspecification (mismatch between simulated training data and reality), unstructured or infinite-dimensional observations, and missingness. We synthesize methods spanning foundations drawing from Schrodinger-bridge formulations, conditional and sequential posterior samplers, amortized architectures for unstructured data, and inference-time prior adaptation. Throughout, we adopt consistent notation and emphasize conditions and caveats required for accurate posteriors. The review closes with a discussion of open problems with an eye toward applications of uncertainty quantification for probabilistic geophysical models that may benefit from diffusion-based SBI.

Title: Geometric Scaling of Bayesian Inference in LLMs

Authors: Naman Aggarwal, Siddhartha R. Dalal, Vishal Misra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23752
Pdf URL: https://arxiv.org/pdf/2512.23752
Copy Paste: [[2512.23752]] Geometric Scaling of Bayesian Inference in LLMs(https://arxiv.org/abs/2512.23752)
Keywords: in-context
Abstract: Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

Title: HINTS: Extraction of Human Insights from Time-Series Without External Sources

Authors: Sheo Yon Jhin, Noseong Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23755
Pdf URL: https://arxiv.org/pdf/2512.23755
Copy Paste: [[2512.23755]] HINTS: Extraction of Human Insights from Time-Series Without External Sources(https://arxiv.org/abs/2512.23755)
Keywords: self-supervised
Abstract: Human decision-making, emotions, and collective psychology are complex factors that shape the temporal dynamics observed in financial and economic systems. Many recent time series forecasting models leverage external sources (e.g., news and social media) to capture human factors, but these approaches incur high data dependency costs in terms of financial, computational, and practical implications. In this study, we propose HINTS, a self-supervised learning framework that extracts these latent factors endogenously from time series residuals without external data. HINTS leverages the Friedkin-Johnsen (FJ) opinion dynamics model as a structural inductive bias to model evolving social influence, memory, and bias patterns. The extracted human factors are integrated into a state-of-the-art backbone model as an attention map. Experimental results using nine real-world and benchmark datasets demonstrate that HINTS consistently improves forecasting accuracy. Furthermore, multiple case studies and ablation studies validate the interpretability of HINTS, demonstrating strong semantic alignment between the extracted factors and real-world events, demonstrating the practical utility of HINTS.

Title: A Survey on Graph Neural Networks for Fraud Detection in Ride Hailing Platforms

Authors: Kanishka Hewageegana, Janani Harischandra, Nipuna Senanayake, Gihan Danansuriya, Kavindu Hapuarachchi, Pooja Illangarathne
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23777
Pdf URL: https://arxiv.org/pdf/2512.23777
Copy Paste: [[2512.23777]] A Survey on Graph Neural Networks for Fraud Detection in Ride Hailing Platforms(https://arxiv.org/abs/2512.23777)
Keywords: anomaly
Abstract: This study investigates fraud detection in ride hailing platforms through Graph Neural Networks (GNNs),focusing on the effectiveness of various models. By analyzing prevalent fraudulent activities, the research highlights and compares the existing work related to fraud detection which can be useful when addressing fraudulent incidents within the online ride hailing platforms. Also, the paper highlights addressing class imbalance and fraudulent camouflage. It also outlines a structured overview of GNN architectures and methodologies applied to anomaly detection, identifying significant methodological progress and gaps. The paper calls for further exploration into real-world applicability and technical improvements to enhance fraud detection strategies in the rapidly evolving ride-hailing industry.

Title: Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

Authors: Ankan Aich, Yangming Lee
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.23786
Pdf URL: https://arxiv.org/pdf/2512.23786
Copy Paste: [[2512.23786]] Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments(https://arxiv.org/abs/2512.23786)
Keywords: self-supervised, foundation model
Abstract: Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

Title: Exploiting the Prior of Generative Time Series Imputation

Authors: YuYang Miao, Chang Li, Zehua Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23832
Pdf URL: https://arxiv.org/pdf/2512.23832
Copy Paste: [[2512.23832]] Exploiting the Prior of Generative Time Series Imputation(https://arxiv.org/abs/2512.23832)
Keywords: diffusion, generative
Abstract: Time series imputation, i.e., filling the missing values of a time recording, finds various applications in electricity, finance, and weather modelling. Previous methods have introduced generative models such as diffusion probabilistic models and Schrodinger bridge models to conditionally generate the missing values from Gaussian noise or directly from linear interpolation results. However, as their prior is not informative to the ground-truth target, their generation process inevitably suffer increased burden and limited imputation accuracy. In this work, we present Bridge-TS, building a data-to-data generation process for generative time series imputation and exploiting the design of prior with two novel designs. Firstly, we propose expert prior, leveraging a pretrained transformer-based module as an expert to fill the missing values with a deterministic estimation, and then taking the results as the prior of ground truth target. Secondly, we explore compositional priors, utilizing several pretrained models to provide different estimation results, and then combining them in the data-to-data generation process to achieve a compositional priors-to-target imputation process. Experiments conducted on several benchmark datasets such as ETT, Exchange, and Weather show that Bridge-TS reaches a new record of imputation accuracy in terms of mean square error and mean absolute error, demonstrating the superiority of improving prior for generative time series imputation.

Title: Flow Matching Neural Processes

Authors: Hussen Abu Hamad, Dan Rosenbaum
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23853
Pdf URL: https://arxiv.org/pdf/2512.23853
Copy Paste: [[2512.23853]] Flow Matching Neural Processes(https://arxiv.org/abs/2512.23853)
Keywords: generative
Abstract: Neural processes (NPs) are a class of models that learn stochastic processes directly from data and can be used for inference, sampling and conditional sampling. We introduce a new NP model based on flow matching, a generative modeling paradigm that has demonstrated strong performance on various data modalities. Following the NP training framework, the model provides amortized predictions of conditional distributions over any arbitrary points in the data. Compared to previous NP models, our model is simple to implement and can be used to sample from conditional distributions using an ODE solver, without requiring auxiliary conditioning methods. In addition, the model provides a controllable tradeoff between accuracy and running time via the number of steps in the ODE solver. We show that our model outperforms previous state-of-the-art neural process methods on various benchmarks including synthetic 1D Gaussian processes data, 2D images, and real-world weather data.

Title: Lifelong Domain Adaptive 3D Human Pose Estimation

Authors: Qucheng Peng, Hongfei Xue, Pu Wang, Chen Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23860
Pdf URL: https://arxiv.org/pdf/2512.23860
Copy Paste: [[2512.23860]] Lifelong Domain Adaptive 3D Human Pose Estimation(https://arxiv.org/abs/2512.23860)
Keywords: generative
Abstract: 3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

Title: Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

Authors: Charith Wickrema, Eliza Mace, Hunter Brown, Heidys Cabrera, Nick Krall, Matthew O'Neill, Shivangi Sarkar, Lowell Weissman, Eric Hughes, Guido Zarrella
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23903
Pdf URL: https://arxiv.org/pdf/2512.23903
Copy Paste: [[2512.23903]] Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale(https://arxiv.org/abs/2512.23903)
Keywords: foundation model, generative
Abstract: We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

Title: T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Authors: Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23953
Pdf URL: https://arxiv.org/pdf/2512.23953
Copy Paste: [[2512.23953]] T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models(https://arxiv.org/abs/2512.23953)
Keywords: diffusion
Abstract: The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

Title: Efficient Context Scaling with LongCat ZigZag Attention

Authors: Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, Haihang Jing, Hongyin Tang, Xin Chen, Xiangzhou Huang, Fengcun Li, Rongxiang Weng, Yulei Qian, Yifan Lu, Yerui Sun, Jingang Wang, Yuchen Xie, Xunliang Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23966
Pdf URL: https://arxiv.org/pdf/2512.23966
Copy Paste: [[2512.23966]] Efficient Context Scaling with LongCat ZigZag Attention(https://arxiv.org/abs/2512.23966)
Keywords: foundation model
Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

Title: Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems

Authors: Tinglong Dai, David Simchi-Levi, Michelle Xiao Wu, Yao Xie
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2512.23978
Pdf URL: https://arxiv.org/pdf/2512.23978
Copy Paste: [[2512.23978]] Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems(https://arxiv.org/abs/2512.23978)
Keywords: generative
Abstract: Generative artificial intelligence (GenAI) is shifting from conversational assistants toward agentic systems -- autonomous decision-making systems that sense, decide, and act within operational workflows. This shift creates an autonomy paradox: as GenAI systems are granted greater operational autonomy, they should, by design, embody more formal structure, more explicit constraints, and stronger tail-risk discipline. We argue stochastic generative models can be fragile in operational domains unless paired with mechanisms that provide verifiable feasibility, robustness to distribution shift, and stress testing under high-consequence scenarios. To address this challenge, we develop a conceptual framework for assured autonomy grounded in operations research (OR), built on two complementary approaches. First, flow-based generative models frame generation as deterministic transport characterized by an ordinary differential equation, enabling auditability, constraint-aware generation, and connections to optimal transport, robust optimization, and sequential decision control. Second, operational safety is formulated through an adversarial robustness lens: decision rules are evaluated against worst-case perturbations within uncertainty or ambiguity sets, making unmodeled risks part of the design. This framework clarifies how increasing autonomy shifts OR's role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.

Title: DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

Authors: Yuang Jia, Jinlong Wang, Jiayi Zhao, Chunlam Li, Shunzhou Wang, Wei Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23983
Pdf URL: https://arxiv.org/pdf/2512.23983
Copy Paste: [[2512.23983]] DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation(https://arxiv.org/abs/2512.23983)
Keywords: diffusion
Abstract: This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

Title: Anomaly detection in satellite imagery through temporal inpainting

Authors: Bertrand Rouet-Leduc, Claudia Hulbert
Subjects: cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.23986
Pdf URL: https://arxiv.org/pdf/2512.23986
Copy Paste: [[2512.23986]] Anomaly detection in satellite imagery through temporal inpainting(https://arxiv.org/abs/2512.23986)
Keywords: foundation model, anomaly
Abstract: Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

Title: Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

Authors: Haotang Li, Zhenyu Qi, Hao Qin, Huanrui Yang, Sen He, Kebin Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23997
Pdf URL: https://arxiv.org/pdf/2512.23997
Copy Paste: [[2512.23997]] Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation(https://arxiv.org/abs/2512.23997)
Keywords: self-supervised
Abstract: Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

Title: Tracing the Heart's Pathways: ECG Representation Learning from a Cardiac Conduction Perspective

Authors: Tan Pan, Yixuan Sun, Chen Jiang, Qiong Gao, Rui Sun, Xingmeng Zhang, Zhenqi Yang, Limei Han, Yixiu Liang, Yuan Cheng, Kaiyu Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24002
Pdf URL: https://arxiv.org/pdf/2512.24002
Copy Paste: [[2512.24002]] Tracing the Heart's Pathways: ECG Representation Learning from a Cardiac Conduction Perspective(https://arxiv.org/abs/2512.24002)
Keywords: self-supervised
Abstract: The multi-lead electrocardiogram (ECG) stands as a cornerstone of cardiac diagnosis. Recent strides in electrocardiogram self-supervised learning (eSSL) have brightened prospects for enhancing representation learning without relying on high-quality annotations. Yet earlier eSSL methods suffer a key limitation: they focus on consistent patterns across leads and beats, overlooking the inherent differences in heartbeats rooted in cardiac conduction processes, while subtle but significant variations carry unique physiological signatures. Moreover, representation learning for ECG analysis should align with ECG diagnostic guidelines, which progress from individual heartbeats to single leads and ultimately to lead combinations. This sequential logic, however, is often neglected when applying pre-trained models to downstream tasks. To address these gaps, we propose CLEAR-HUG, a two-stage framework designed to capture subtle variations in cardiac conduction across leads while adhering to ECG diagnostic guidelines. In the first stage, we introduce an eSSL model termed Conduction-LEAd Reconstructor (CLEAR), which captures both specific variations and general commonalities across heartbeats. Treating each heartbeat as a distinct entity, CLEAR employs a simple yet effective sparse attention mechanism to reconstruct signals without interference from other heartbeats. In the second stage, we implement a Hierarchical lead-Unified Group head (HUG) for disease diagnosis, mirroring clinical workflow. Experimental results across six tasks show a 6.84% improvement, validating the effectiveness of CLEAR-HUG. This highlights its ability to enhance representations of cardiac conduction and align patterns with expert diagnostic guidelines.

Title: On Exact Editing of Flow-Based Diffusion Models

Authors: Zixiang Li, Yue Song, Jianing Peng, Ting Liu, Jun Huang, Xiaochao Qu, Luoqi Liu, Wei Wang, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24015
Pdf URL: https://arxiv.org/pdf/2512.24015
Copy Paste: [[2512.24015]] On Exact Editing of Flow-Based Diffusion Models(https://arxiv.org/abs/2512.24015)
Keywords: diffusion
Abstract: Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

Title: PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

Authors: Mustafa Munir, Md Mostafijur Rahman, Kartikeya Bhardwaj, Paul Whatmough, Radu Marculescu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24026
Pdf URL: https://arxiv.org/pdf/2512.24026
Copy Paste: [[2512.24026]] PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing(https://arxiv.org/abs/2512.24026)
Keywords: diffusion
Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

Title: Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

Authors: Xinran Qin, Yuhui Quan, Ruotao Xu, Hui Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24035
Pdf URL: https://arxiv.org/pdf/2512.24035
Copy Paste: [[2512.24035]] Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising(https://arxiv.org/abs/2512.24035)
Keywords: diffusion
Abstract: Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

Title: RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

Authors: Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao, Wangli Lan, Jing Lin, Zhixin Ma, Tingting Zhou, Harry Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24086
Pdf URL: https://arxiv.org/pdf/2512.24086
Copy Paste: [[2512.24086]] RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention(https://arxiv.org/abs/2512.24086)
Keywords: diffusion, generative
Abstract: In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

Title: Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

Authors: Yongtao Chen, Yanbo Wang, Wentao Zhao, Guole Shen, Tianchen Deng, Jingchuan Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.24111
Pdf URL: https://arxiv.org/pdf/2512.24111
Copy Paste: [[2512.24111]] Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks(https://arxiv.org/abs/2512.24111)
Keywords: diffusion, generative
Abstract: Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

Title: GARDO: Reinforcing Diffusion Models without Reward Hacking

Authors: Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.24138
Pdf URL: https://arxiv.org/pdf/2512.24138
Copy Paste: [[2512.24138]] GARDO: Reinforcing Diffusion Models without Reward Hacking(https://arxiv.org/abs/2512.24138)
Keywords: diffusion
Abstract: Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

Title: Activation Steering for Masked Diffusion Language Models

Authors: Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24143
Pdf URL: https://arxiv.org/pdf/2512.24143
Copy Paste: [[2512.24143]] Activation Steering for Masked Diffusion Language Models(https://arxiv.org/abs/2512.24143)
Keywords: diffusion
Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

Title: Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

Authors: Chubin Chen, Sujie Hu, Jiashu Zhu, Meiqi Wu, Jintao Chen, Yanxun Li, Nisha Huang, Chengyu Fang, Jiahong Wu, Xiangxiang Chu, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24146
Pdf URL: https://arxiv.org/pdf/2512.24146
Copy Paste: [[2512.24146]] Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning(https://arxiv.org/abs/2512.24146)
Keywords: diffusion, generative
Abstract: Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

Title: Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

Authors: TsaiChing Ni, ZhenQi Chen, YuanFu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24160
Pdf URL: https://arxiv.org/pdf/2512.24160
Copy Paste: [[2512.24160]] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset(https://arxiv.org/abs/2512.24160)
Keywords: diffusion, foundation model, generative
Abstract: We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

Title: DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Authors: Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24165
Pdf URL: https://arxiv.org/pdf/2512.24165
Copy Paste: [[2512.24165]] DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models(https://arxiv.org/abs/2512.24165)
Keywords: diffusion, generative
Abstract: While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

Title: Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

Authors: Yu-Tang Chang, Pin-Wei Chen, Shih-Fang Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24172
Pdf URL: https://arxiv.org/pdf/2512.24172
Copy Paste: [[2512.24172]] Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges(https://arxiv.org/abs/2512.24172)
Keywords: foundation model
Abstract: Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at this https URL.

Title: Guiding a Diffusion Transformer with the Internal Dynamics of Itself

Authors: Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, Shuhang Gu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24176
Pdf URL: https://arxiv.org/pdf/2512.24176
Copy Paste: [[2512.24176]] Guiding a Diffusion Transformer with the Internal Dynamics of Itself(https://arxiv.org/abs/2512.24176)
Keywords: diffusion, generative
Abstract: The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

Title: CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Authors: Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24195
Pdf URL: https://arxiv.org/pdf/2512.24195
Copy Paste: [[2512.24195]] CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers(https://arxiv.org/abs/2512.24195)
Keywords: diffusion
Abstract: Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

Title: Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

Authors: Sina Jahromi, Farshid Hajati, Alireza Rezaee, Javaher Nourian
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24214
Pdf URL: https://arxiv.org/pdf/2512.24214
Copy Paste: [[2512.24214]] Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19(https://arxiv.org/abs/2512.24214)
Keywords: generative
Abstract: The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

Title: ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Authors: Ziquan Liu, Zhewei Zhu, Xuyang Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24224
Pdf URL: https://arxiv.org/pdf/2512.24224
Copy Paste: [[2512.24224]] ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation(https://arxiv.org/abs/2512.24224)
Keywords: foundation model
Abstract: Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

Title: Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

Authors: Shuyun Wang, Haiyang Sun, Bing Wang, Hangjun Ye, Xin Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24227
Pdf URL: https://arxiv.org/pdf/2512.24227
Copy Paste: [[2512.24227]] Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes(https://arxiv.org/abs/2512.24227)
Keywords: diffusion
Abstract: Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at this https URL.

Title: MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

Authors: Rahul Medicharla, Alper Yilmaz
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24231
Pdf URL: https://arxiv.org/pdf/2512.24231
Copy Paste: [[2512.24231]] MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model(https://arxiv.org/abs/2512.24231)
Keywords: foundation model
Abstract: In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at this https URL.

Title: Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

Authors: Zhi Li, Yaqi Wang, Bingtao Ma, Yifan Zhang, Huiyu Zhou, Shuai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24260
Pdf URL: https://arxiv.org/pdf/2512.24260
Copy Paste: [[2512.24260]] Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT(https://arxiv.org/abs/2512.24260)
Keywords: diffusion, foundation model
Abstract: Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: this https URL

Title: Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Authors: Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24271
Pdf URL: https://arxiv.org/pdf/2512.24271
Copy Paste: [[2512.24271]] Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation(https://arxiv.org/abs/2512.24271)
Keywords: diffusion
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

Title: One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

Authors: Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Zhihua Wang, Fei Wu, Quanlin Li, Pinghong Zhou, Shuo Wang, Xian Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24278
Pdf URL: https://arxiv.org/pdf/2512.24278
Copy Paste: [[2512.24278]] One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training(https://arxiv.org/abs/2512.24278)
Keywords: generative
Abstract: Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

Title: Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

Authors: Md. Enamul Hoq, Linda Larson-Prior, Fred Prior
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24294
Pdf URL: https://arxiv.org/pdf/2512.24294
Copy Paste: [[2512.24294]] Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction(https://arxiv.org/abs/2512.24294)
Keywords: foundation model
Abstract: Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

Title: Skim-Aware Contrastive Learning for Efficient Document Representation

Authors: Waheed Ahmed Abro, Zied Bouraoui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24373
Pdf URL: https://arxiv.org/pdf/2512.24373
Copy Paste: [[2512.24373]] Skim-Aware Contrastive Learning for Efficient Document Representation(https://arxiv.org/abs/2512.24373)
Keywords: self-supervised
Abstract: Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

Title: Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Authors: Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.24385
Pdf URL: https://arxiv.org/pdf/2512.24385
Copy Paste: [[2512.24385]] Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems(https://arxiv.org/abs/2512.24385)
Keywords: foundation model
Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

Title: DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Authors: Bohong Chen, Haiyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24408
Pdf URL: https://arxiv.org/pdf/2512.24408
Copy Paste: [[2512.24408]] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model(https://arxiv.org/abs/2512.24408)
Keywords: generative
Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

Title: Generative forecasting with joint probability models

Authors: Patrick Wyrod, Ashesh Chattopadhyay, Daniele Venturi
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2512.24446
Pdf URL: https://arxiv.org/pdf/2512.24446
Copy Paste: [[2512.24446]] Generative forecasting with joint probability models(https://arxiv.org/abs/2512.24446)
Keywords: generative
Abstract: Chaotic dynamical systems exhibit strong sensitivity to initial conditions and often contain unresolved multiscale processes, making deterministic forecasting fundamentally limited. Generative models offer an appealing alternative by learning distributions over plausible system evolutions; yet, most existing approaches focus on next-step conditional prediction rather than the structure of the underlying dynamics. In this work, we reframe forecasting as a fully generative problem by learning the joint probability distribution of lagged system states over short temporal windows and obtaining forecasts through marginalization. This new perspective allows the model to capture nonlinear temporal dependencies, represent multistep trajectory segments, and produce next-step predictions consistent with the learned joint distribution. We also introduce a general, model-agnostic training and inference framework for joint generative forecasting and show how it enables assessment of forecast robustness and reliability using three complementary uncertainty quantification metrics (ensemble variance, short-horizon autocorrelation, and cumulative Wasserstein drift), without access to ground truth. We evaluate the performance of the proposed method on two canonical chaotic dynamical systems, the Lorenz-63 system and the Kuramoto-Sivashinsky equation, and show that joint generative models yield improved short-term predictive skill, preserve attractor geometry, and achieve substantially more accurate long-range statistical behaviour than conventional conditional next-step models.

Title: F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

Authors: Devendra K. Jangid, Ripon K. Saha, Dilshan Godaliyadda, Jing Li, Seok-Jun Lee, Hamid R. Sheikh
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2512.24473
Pdf URL: https://arxiv.org/pdf/2512.24473
Copy Paste: [[2512.24473]] F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model(https://arxiv.org/abs/2512.24473)
Keywords: diffusion, foundation model, generative
Abstract: With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

Title: Training-Free Color-Aware Adversarial Diffusion Sanitization for Diffusion Stegomalware Defense at Security Gateways

Authors: Vladimir Frants, Sos Agaian
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2512.24499
Pdf URL: https://arxiv.org/pdf/2512.24499
Copy Paste: [[2512.24499]] Training-Free Color-Aware Adversarial Diffusion Sanitization for Diffusion Stegomalware Defense at Security Gateways(https://arxiv.org/abs/2512.24499)
Keywords: diffusion, generative
Abstract: The rapid expansion of generative AI has normalized large-scale synthetic media creation, enabling new forms of covert communication. Recent generative steganography methods, particularly those based on diffusion models, can embed high-capacity payloads without fine-tuning or auxiliary decoders, creating significant challenges for detection and remediation. Coverless diffusion-based techniques are difficult to counter because they generate image carriers directly from secret data, enabling attackers to deliver stegomalware for command-and-control, payload staging, and data exfiltration while bypassing detectors that rely on cover-stego discrepancies. This work introduces Adversarial Diffusion Sanitization (ADS), a training-free defense for security gateways that neutralizes hidden payloads rather than detecting them. ADS employs an off-the-shelf pretrained denoiser as a differentiable proxy for diffusion-based decoders and incorporates a color-aware, quaternion-coupled update rule to reduce artifacts under strict distortion limits. Under a practical threat model and in evaluation against the state-of-the-art diffusion steganography method Pulsar, ADS drives decoder success rates to near zero with minimal perceptual impact. Results demonstrate that ADS provides a favorable security-utility trade-off compared to standard content transformations, offering an effective mitigation strategy against diffusion-driven steganography.

Title: Using Large Language Models To Translate Machine Results To Human Results

Authors: Trishna Niraula, Jonathan Stubblefield
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24518
Pdf URL: https://arxiv.org/pdf/2512.24518
Copy Paste: [[2512.24518]] Using Large Language Models To Translate Machine Results To Human Results(https://arxiv.org/abs/2512.24518)
Keywords: anomaly
Abstract: Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

Title: Do Large Language Models Know What They Are Capable Of?

Authors: Casey O. Barkan, Sid Black, Oliver Sourbut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24661
Pdf URL: https://arxiv.org/pdf/2512.24661
Copy Paste: [[2512.24661]] Do Large Language Models Know What They Are Capable Of?(https://arxiv.org/abs/2512.24661)
Keywords: in-context
Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

Title: HeteroHBA: A Generative Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

Authors: Honglin Gao, Lan Zhao, Junhao Ren, Xiang Li, Gaoxi Xiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.24665
Pdf URL: https://arxiv.org/pdf/2512.24665
Copy Paste: [[2512.24665]] HeteroHBA: A Generative Structure-Manipulating Backdoor Attack on Heterogeneous Graphs(https://arxiv.org/abs/2512.24665)
Keywords: generative
Abstract: Heterogeneous graph neural networks (HGNNs) have achieved strong performance in many real-world applications, yet targeted backdoor poisoning on heterogeneous graphs remains less studied. We consider backdoor attacks for heterogeneous node classification, where an adversary injects a small set of trigger nodes and connections during training to force specific victim nodes to be misclassified into an attacker-chosen label at test time while preserving clean performance. We propose HeteroHBA, a generative backdoor framework that selects influential auxiliary neighbors for trigger attachment via saliency-based screening and synthesizes diverse trigger features and connection patterns to better match the local heterogeneous context. To improve stealthiness, we combine Adaptive Instance Normalization (AdaIN) with a Maximum Mean Discrepancy (MMD) loss to align the trigger feature distribution with benign statistics, thereby reducing detectability, and we optimize the attack with a bilevel objective that jointly promotes attack success and maintains clean accuracy. Experiments on multiple real-world heterogeneous graphs with representative HGNN architectures show that HeteroHBA consistently achieves higher attack success than prior backdoor baselines with comparable or smaller impact on clean accuracy; moreover, the attack remains effective under our heterogeneity-aware structural defense, CSD. These results highlight practical backdoor risks in heterogeneous graph learning and motivate the development of stronger defenses.

Title: Nested Learning: The Illusion of Deep Learning Architectures

Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24695
Pdf URL: https://arxiv.org/pdf/2512.24695
Copy Paste: [[2512.24695]] Nested Learning: The Illusion of Deep Learning Architectures(https://arxiv.org/abs/2512.24695)
Keywords: in-context
Abstract: Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.

Title: BandiK: Efficient Multi-Task Decomposition Using a Multi-Bandit Framework

Authors: András Millinghoffer (1 and 2), András Formanek (1 and 3), András Antos (1), Péter Antal (1 and 2) ((1) Department of Artificial Intelligence and Systems Engineering, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, (2) E-Group ICT Software Zrt., Budapest, Hungary, (3) Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24708
Pdf URL: https://arxiv.org/pdf/2512.24708
Copy Paste: [[2512.24708]] BandiK: Efficient Multi-Task Decomposition Using a Multi-Bandit Framework(https://arxiv.org/abs/2512.24708)
Keywords: foundation model
Abstract: The challenge of effectively transferring knowledge across multiple tasks is of critical importance and is also present in downstream tasks with foundation models. However, the nature of transfer, its transitive-intransitive nature, is still an open problem, and negative transfer remains a significant obstacle. Selection of beneficial auxiliary task sets in multi-task learning is frequently hindered by the high computational cost of their evaluation, the high number of plausible candidate auxiliary sets, and the varying complexity of selection across target tasks. To address these constraints, we introduce BandiK, a novel three-stage multi-task auxiliary task subset selection method using multi-bandits, where each arm pull evaluates candidate auxiliary sets by training and testing a multiple output neural network on a single random train-test dataset split. Firstly, BandiK estimates the pairwise transfers between tasks, which helps in identifying which tasks are likely to benefit from joint learning. In the second stage, it constructs a linear number of candidate sets of auxiliary tasks (in the number of all tasks) for each target task based on the initial estimations, significantly reducing the exponential number of potential auxiliary task sets. Thirdly, it employs a Multi-Armed Bandit (MAB) framework for each task, where the arms correspond to the performance of candidate auxiliary sets realized as multiple output neural networks over train-test data set splits. To enhance efficiency, BandiK integrates these individual task-specific MABs into a multi-bandit structure. The proposed multi-bandit solution exploits that the same neural network realizes multiple arms of different individual bandits corresponding to a given candidate set. This semi-overlapping arm property defines a novel multi-bandit cost/reward structure utilized in BandiK.

Title: Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks

Authors: Shota Suzuki, Satoshi Ono
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2512.24793
Pdf URL: https://arxiv.org/pdf/2512.24793
Copy Paste: [[2512.24793]] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks(https://arxiv.org/abs/2512.24793)
Keywords: self-supervised
Abstract: Neural architecture search (NAS), which automates the architectural design process of deep neural networks (DNN), has attracted increasing attention. Multimodal DNNs that necessitate feature fusion from multiple modalities benefit from NAS due to their structural complexity; however, constructing an architecture for multimodal DNNs through NAS requires a substantial amount of labeled training data. Thus, this paper proposes a self-supervised learning (SSL) method for architecture search of multimodal DNNs. The proposed method applies SSL comprehensively for both the architecture search and model pretraining processes. Experimental results demonstrated that the proposed method successfully designed architectures for DNNs from unlabeled training data.

Title: AODDiff: Probabilistic Reconstruction of Aerosol Optical Depth via Diffusion-based Bayesian Inference

Authors: Linhao Fan, Hongqiang Fang, Jingyang Dai, Yong Jiang, Qixing Zhang
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2512.24847
Pdf URL: https://arxiv.org/pdf/2512.24847
Copy Paste: [[2512.24847]] AODDiff: Probabilistic Reconstruction of Aerosol Optical Depth via Diffusion-based Bayesian Inference(https://arxiv.org/abs/2512.24847)
Keywords: diffusion, generative
Abstract: High-quality reconstruction of Aerosol Optical Depth (AOD) fields is critical for Atmosphere monitoring, yet current models remain constrained by the scarcity of complete training data and a lack of uncertainty this http URL address these limitations, we propose AODDiff, a probabilistic reconstruction framework based on diffusion-based Bayesian inference. By leveraging the learned spatiotemporal probability distribution of the AOD field as a generative prior, this framework can be flexibly adapted to various reconstruction tasks without requiring task-specific retraining. We first introduce a corruption-aware training strategy to learns a spatiotemporal AOD prior solely from naturally incomplete data. Subsequently, we employ a decoupled annealing posterior sampling strategy that enables the more effective and integration of heterogeneous observations as constraints to guide the generation process. We validate the proposed framework through extensive experiments on Reanalysis data. Results across downscaling and inpainting tasks confirm the efficacy and robustness of AODDiff, specifically demonstrating its advantage in maintaining high spatial spectral fidelity. Furthermore, as a generative model, AODDiff inherently enables uncertainty quantification via multiple sampling, offering critical confidence metrics for downstream applications.

Title: Characterization of Transfer Using Multi-task Learning Curves

Authors: András Millinghoffer, Bence Bolgár, Péter Antal
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.24866
Pdf URL: https://arxiv.org/pdf/2512.24866
Copy Paste: [[2512.24866]] Characterization of Transfer Using Multi-task Learning Curves(https://arxiv.org/abs/2512.24866)
Keywords: foundation model
Abstract: Transfer effects manifest themselves both during training using a fixed data set and in inductive inference using accumulating data. We hypothesize that perturbing the data set by including more samples, instead of perturbing the model by gradient updates, provides a complementary and more fundamental characterization of transfer effects. To capture this phenomenon, we quantitatively model transfer effects using multi-task learning curves approximating the inductive performance over varying sample sizes. We describe an efficient method to approximate multi-task learning curves analogous to the Task Affinity Grouping method applied during training. We compare the statistical and computational approaches to transfer, which indicates considerably higher compute costs for the previous but better power and broader applicability. Evaluations are performed using a benchmark drug-target interaction data set. Our results show that learning curves can better capture the effects of multi-task learning and their multi-task extensions can delineate pairwise and contextual transfer effects in foundation models.

Title: Towards Provably Secure Generative AI: Reliable Consensus Sampling

Authors: Yu Cui, Hang Fu, Sicheng Pan, Zhuoyu Sun, Yifei Liu, Yuhong Nie, Bo Ran, Baohan Huang, Xufeng Zhang, Haibin Zhang, Cong Zuo, Licheng Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.24925
Pdf URL: https://arxiv.org/pdf/2512.24925
Copy Paste: [[2512.24925]] Towards Provably Secure Generative AI: Reliable Consensus Sampling(https://arxiv.org/abs/2512.24925)
Keywords: generative
Abstract: Existing research on generative AI security is primarily driven by mutually reinforcing attack and defense methodologies grounded in empirical experience. This dynamic frequently gives rise to previously unknown attacks that can circumvent current detection and prevention. This necessitates the continual updating of security mechanisms. Constructing generative AI with provable security and theoretically controllable risk is therefore necessary. Consensus Sampling (CS) is a promising algorithm toward provably secure AI. It controls risk by leveraging overlap in model output probabilities. However, we find that CS relies on frequent abstention to avoid unsafe outputs, which reduces utility. Moreover, CS becomes highly vulnerable when unsafe models are maliciously manipulated. To address these issues, we propose a new primitive called Reliable Consensus Sampling (RCS), that traces acceptance probability to tolerate extreme adversarial behaviors, improving robustness. RCS also eliminates the need for abstention entirely. We further develop a feedback algorithm to continuously and dynamically enhance the safety of RCS. We provide theoretical guarantees that RCS maintains a controllable risk threshold. Extensive experiments show that RCS significantly improves robustness and utility while maintaining latency comparable to CS. We hope this work contributes to the development of provably secure generative AI.

Title: HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

Authors: Rongji Xun, Junjie Yuan, Zhongjie Wang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2512.24946
Pdf URL: https://arxiv.org/pdf/2512.24946
Copy Paste: [[2512.24946]] HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films(https://arxiv.org/abs/2512.24946)
Keywords: diffusion
Abstract: Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source this http URL propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film this http URL, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion this http URL, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching this http URL, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic this http URL experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

Title: ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Authors: Xinran Gong, Gorkem Durak, Halil Ertugrul Aktas, Vedat Cicek, Jinkui Hao, Ulas Bagci, Nilay S. Shah, Bo Zhou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24948
Pdf URL: https://arxiv.org/pdf/2512.24948
Copy Paste: [[2512.24948]] ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT(https://arxiv.org/abs/2512.24948)
Keywords: diffusion, generative
Abstract: Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

Title: VIPER: Process-aware Evaluation for Generative Video Reasoning

Authors: Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.24952
Pdf URL: https://arxiv.org/pdf/2512.24952
Copy Paste: [[2512.24952]] VIPER: Process-aware Evaluation for Generative Video Reasoning(https://arxiv.org/abs/2512.24952)
Keywords: generative
Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

Title: ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2512.24965
Pdf URL: https://arxiv.org/pdf/2512.24965
Copy Paste: [[2512.24965]] ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands(https://arxiv.org/abs/2512.24965)
Keywords: generative
Abstract: Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$\pi$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$\pi$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at this https URL.

Title: FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Authors: Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.25008
Pdf URL: https://arxiv.org/pdf/2512.25008
Copy Paste: [[2512.25008]] FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM(https://arxiv.org/abs/2512.25008)
Keywords: foundation model
Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

Title: Diffusion Language Models are Provably Optimal Parallel Samplers

Authors: Haozhe Jiang, Nika Haghtalab, Lijie Chen
Subjects: cs.LG, cs.CC
Abstract URL: https://arxiv.org/abs/2512.25014
Pdf URL: https://arxiv.org/pdf/2512.25014
Copy Paste: [[2512.25014]] Diffusion Language Models are Provably Optimal Parallel Samplers(https://arxiv.org/abs/2512.25014)
Keywords: diffusion
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more expressive than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient parallel sampler, but also advocate for enabling revision in DLMs.

Title: Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander C. Li, Ananya Kumar, Deepak Pathak
Subjects: cs.LG, cs.AI, cs.CV, cs.NE
Abstract URL: https://arxiv.org/abs/2512.25034
Pdf URL: https://arxiv.org/pdf/2512.25034
Copy Paste: [[2512.25034]] Generative Classifiers Avoid Shortcut Solutions(https://arxiv.org/abs/2512.25034)
Keywords: diffusion, generative
Abstract: Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

Title: From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Authors: Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.25066
Pdf URL: https://arxiv.org/pdf/2512.25066
Copy Paste: [[2512.25066]] From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing(https://arxiv.org/abs/2512.25066)
Keywords: diffusion
Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

Title: GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Authors: Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.25073
Pdf URL: https://arxiv.org/pdf/2512.25073
Copy Paste: [[2512.25073]] GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction(https://arxiv.org/abs/2512.25073)
Keywords: diffusion
Abstract: Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: this https URL

Title: SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2512.25075
Pdf URL: https://arxiv.org/pdf/2512.25075
Copy Paste: [[2512.25075]] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time(https://arxiv.org/abs/2512.25075)
Keywords: diffusion, generative
Abstract: We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: this https URL Code: this https URL