2025-12-16

Title: Active Inference with Reusable State-Dependent Value Profiles

Authors: Jacob Poschl
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11829
Pdf URL: https://arxiv.org/pdf/2512.11829
Copy Paste: [[2512.11829]] Active Inference with Reusable State-Dependent Value Profiles(https://arxiv.org/abs/2512.11829)
Keywords: generative
Abstract: Adaptive behavior in volatile environments requires agents to switch among value-control regimes across latent contexts, but maintaining separate preferences, policy biases, and action-confidence parameters for every situation is intractable. We introduce value profiles: a small set of reusable bundles of value-related parameters (outcome preferences, policy priors, and policy precision) assigned to hidden states in a generative model. As posterior beliefs over states evolve trial by trial, effective control parameters arise via belief-weighted mixing, enabling state-conditional strategy recruitment without requiring independent parameters for each context. We evaluate this framework in probabilistic reversal learning, comparing static-precision, entropy-coupled dynamic-precision, and profile-based models using cross-validated log-likelihood and information criteria. Model comparison favors the profile-based model over simpler alternatives (about 100-point AIC differences), and parameter-recovery analyses support structural identifiability even when context must be inferred from noisy observations. Model-based inference further suggests that adaptive control in this task is driven primarily by modulation of policy priors rather than policy precision, with gradual belief-dependent profile recruitment consistent with state-conditional (not purely uncertainty-driven) control. Overall, reusable value profiles provide a tractable computational account of belief-conditioned value control in volatile environments and yield testable signatures of belief-dependent control and behavioral flexibility.

Title: On the Design of One-step Diffusion via Shortcutting Flow Paths

Authors: Haitao Lin, Peiyan Hu, Minsi Ren, Zhifeng Gao, Zhi-Ming Ma, Guolin ke, Tailin Wu, Stan Z. Li
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.11831
Pdf URL: https://arxiv.org/pdf/2512.11831
Copy Paste: [[2512.11831]] On the Design of One-step Diffusion via Shortcutting Flow Paths(https://arxiv.org/abs/2512.11831)
Keywords: diffusion
Abstract: Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (a.k.a. shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.

Title: Adaptive Path Integral Diffusion: AdaPID

Authors: Michael Chertkov, Hamidreza Behjoo (University of Arizona)
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11858
Pdf URL: https://arxiv.org/pdf/2512.11858
Copy Paste: [[2512.11858]] Adaptive Path Integral Diffusion: AdaPID(https://arxiv.org/abs/2512.11858)
Keywords: diffusion
Abstract: Diffusion-based samplers -- Score Based Diffusions, Bridge Diffusions and Path Integral Diffusions -- match a target at terminal time, but the real leverage comes from choosing the schedule that governs the intermediate-time dynamics. We develop a path-wise schedule -- selection gramework for Harmonic PID with a time-varying stiffness, exploiting Piece-Wise-Constant(PWC) parametrizations and a simple hierarchical refinement. We introduce schedule-sensitive Quality-of-Sampling (QoS) diagnostics. Assuming a Gaussian-Mixture (GM) target, we retain closed-form Green functions' ration and numerically stable, Neural-Network free oracles for predicted-state maps and score. Experiments in 2D show that QoS driven PWC schedules consistently improve early-exit fidelity, tail accuracy, conditioning of the dynamics, and speciation (label-selection) timing at fixed integration budgets.

Title: Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion

Authors: Michael Chertkov (University of Arizona)
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11859
Pdf URL: https://arxiv.org/pdf/2512.11859
Copy Paste: [[2512.11859]] Generative Stochastic Optimal Transport: Guided Harmonic Path-Integral Diffusion(https://arxiv.org/abs/2512.11859)
Keywords: diffusion, generative
Abstract: We introduce Guided Harmonic Path-Integral Diffusion (GH-PID), a linearly-solvable framework for guided Stochastic Optimal Transport (SOT) with a hard terminal distribution and soft, application-driven path costs. A low-dimensional guidance protocol shapes the trajectory ensemble while preserving analytic structure: the forward and backward Kolmogorov equations remain linear, the optimal score admits an explicit Green-function ratio, and Gaussian-Mixture Model (GMM) terminal laws yield closed-form expressions. This enables stable sampling and differentiable protocol learning under exact terminal matching. We develop guidance-centric diagnostics -- path cost, centerline adherence, variance flow, and drift effort -- that make GH-PID an interpretable variational ansatz for empirical SOT. Three navigation scenarios illustrated in 2D: (i) Case A: hand-crafted protocols revealing how geometry and stiffness shape lag, curvature effects, and mode evolution; (ii) Case B: single-task protocol learning, where a PWC centerline is optimized to minimize integrated cost; (iii) Case C: multi-expert fusion, in which a commander reconciles competing expert/teacher trajectories and terminal beliefs through an exact product-of-experts law and learns a consensus protocol. Across all settings, GH-PID generates geometry-aware, trust-aware trajectories that satisfy the prescribed terminal distribution while systematically reducing integrated cost.

Title: An Operator-Consistent Graph Neural Network for Learning Diffusion Dynamics on Irregular Meshes

Authors: Yuelian Li, Andrew Rushing Hands
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11860
Pdf URL: https://arxiv.org/pdf/2512.11860
Copy Paste: [[2512.11860]] An Operator-Consistent Graph Neural Network for Learning Diffusion Dynamics on Irregular Meshes(https://arxiv.org/abs/2512.11860)
Keywords: diffusion
Abstract: Classical numerical methods solve partial differential equations (PDEs) efficiently on regular meshes, but many of them become unstable on irregular domains. In practice, multiphysics interactions such as diffusion, damage, and healing often take place on irregular meshes. We develop an operator-consistent graph neural network (OCGNN-PINN) that approximates PDE evolution under physics-informed constraints. It couples node-edge message passing with a consistency loss enforcing the gradient-divergence relation through the graph incidence matrix, ensuring that discrete node and edge dynamics remain structurally coupled during temporal rollout. We evaluate the model on diffusion processes over physically driven evolving meshes and real-world scanned surfaces. The results show improved temporal stability and prediction accuracy compared with graph convolutional and multilayer perceptron baselines, approaching the performance of Crank-Nicolson solvers on unstructured domains.

Title: Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL

Authors: Jiahao You, Ziye Jia, Can Cui, Chao Dong, Qihui Wu, Zhu Han
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11862
Pdf URL: https://arxiv.org/pdf/2512.11862
Copy Paste: [[2512.11862]] Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL(https://arxiv.org/abs/2512.11862)
Keywords: diffusion, generative
Abstract: The low-altitude intelligent networks (LAINs) emerge as a promising architecture for delivering low-latency and energy-efficient edge intelligence in dynamic and infrastructure-limited environments. By integrating unmanned aerial vehicles (UAVs), aerial base stations, and terrestrial base stations, LAINs can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing. However, these systems face key challenges, including energy-constrained UAVs, stochastic task arrivals, and heterogeneous computing resources. To address these issues, we propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimizes UAV trajectory planning and task offloading decisions. The problem is challenging to solve due to temporal coupling among decision variables. Therefore, we design a hierarchical learning framework with two timescales. At the large timescale, a Vickrey-Clarke-Groves auction mechanism enables the energy-aware and incentive-compatible trajectory assignment. At the small timescale, we propose the diffusion-heterogeneous-agent proximal policy optimization, a generative multi-agent reinforcement learning algorithm that embeds latent diffusion models into actor networks. Each UAV samples actions from a Gaussian prior and refines them via observation-conditioned denoising, enhancing adaptability and policy diversity. Extensive simulations show that our framework outperforms baselines in energy efficiency, task success rate, and convergence performance.

Title: On the Dangers of Bootstrapping Generation for Continual Learning and Beyond

Authors: Daniil Zverev, A. Sophia Koepke, Joao F. Henriques
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11867
Pdf URL: https://arxiv.org/pdf/2512.11867
Copy Paste: [[2512.11867]] On the Dangers of Bootstrapping Generation for Continual Learning and Beyond(https://arxiv.org/abs/2512.11867)
Keywords: generative
Abstract: The use of synthetically generated data for training models is becoming a common practice. While generated data can augment the training data, repeated training on synthetic data raises concerns about distribution drift and degradation of performance due to contamination of the dataset. We investigate the consequences of this bootstrapping process through the lens of continual learning, drawing a connection to Generative Experience Replay (GER) methods. We present a statistical analysis showing that synthetic data introduces significant bias and variance into training objectives, weakening the reliability of maximum likelihood estimation. We provide empirical evidence showing that popular generative models collapse under repeated training with synthetic data. We quantify this degradation and show that state-of-the-art GER methods fail to maintain alignment in the latent space. Our findings raise critical concerns about the use of synthetic data in continual learning.

Title: Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors

Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee, Nikolaos D. Tselikas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11884
Pdf URL: https://arxiv.org/pdf/2512.11884
Copy Paste: [[2512.11884]] Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors(https://arxiv.org/abs/2512.11884)
Keywords: foundation model
Abstract: Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task-specific fine-tuning and generalist foundation models capable of zero-shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine-tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero-shot mode. However, YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open-source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine-tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as this https URL

Title: MONET -- Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion

Authors: Alexander Peysakhovich, William Berman, Joseph Rufo, Felix Wong, Maxwell Z. Wilson
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11928
Pdf URL: https://arxiv.org/pdf/2512.11928
Copy Paste: [[2512.11928]] MONET -- Virtual Cell Painting of Brightfield Images and Time Lapses Using Reference Consistent Diffusion(https://arxiv.org/abs/2512.11928)
Keywords: diffusion, in-context
Abstract: Cell painting is a popular technique for creating human-interpretable, high-contrast images of cell morphology. There are two major issues with cell paint: (1) it is labor-intensive and (2) it requires chemical fixation, making the study of cell dynamics impossible. We train a diffusion model (Morphological Observation Neural Enhancement Tool, or MONET) on a large dataset to predict cell paint channels from brightfield images. We show that model quality improves with scale. The model uses a consistency architecture to generate time-lapse videos, despite the impossibility of obtaining cell paint video training data. In addition, we show that this architecture enables a form of in-context learning, allowing the model to partially transfer to out-of-distribution cell lines and imaging protocols. Virtual cell painting is not intended to replace physical cell painting completely, but to act as a complementary tool enabling novel workflows in biological research.

Title: Learning to Extract Context for Context-Aware LLM Inference

Authors: Minseon Kim, Lucas Caccia, Zhengyan Shi, Matheus Pereira, Marc-Alexandre Côté, Xingdi Yuan, Alessandro Sordoni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11986
Pdf URL: https://arxiv.org/pdf/2512.11986
Copy Paste: [[2512.11986]] Learning to Extract Context for Context-Aware LLM Inference(https://arxiv.org/abs/2512.11986)
Keywords: foundation model
Abstract: User prompts to large language models (LLMs) are often ambiguous or under-specified, and subtle contextual cues shaped by user intentions, prior knowledge, and risk factors strongly influence what constitutes an appropriate response. Misinterpreting intent or risks may lead to unsafe outputs, while overly cautious interpretations can cause unnecessary refusal of benign requests. In this paper, we question the conventional framework in which LLMs generate immediate responses to requests without considering broader contextual factors. User requests are situated within broader contexts such as intentions, knowledge, and prior experience, which strongly influence what constitutes an appropriate answer. We propose a framework that extracts and leverages such contextual information from the user prompt itself. Specifically, a reinforcement learning based context generator, designed in an autoencoder-like fashion, is trained to infer contextual signals grounded in the prompt and use them to guide response generation. This approach is particularly important for safety tasks, where ambiguous requests may bypass safeguards while benign but confusing requests can trigger unnecessary refusals. Experiments show that our method reduces harmful responses by an average of 5.6% on the SafetyInstruct dataset across multiple foundation models and improves the harmonic mean of attack success rate and compliance on benign prompts by 6.2% on XSTest and WildJailbreak. These results demonstrate the effectiveness of context extraction for safer and more reliable LLM inferences.

Title: CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Authors: Xianghui Xie, Bowen Wen, Yan Chang, Hesam Rabeti, Jiefeng Li, Ye Yuan, Gerard Pons-Moll, Stan Birchfield
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11988
Pdf URL: https://arxiv.org/pdf/2512.11988
Copy Paste: [[2512.11988]] CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction(https://arxiv.org/abs/2512.11988)
Keywords: foundation model
Abstract: Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

Title: CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

Authors: Tejas Panambur, Ishan Rajendrakumar Dave, Chongjian Ge, Ersin Yumer, Xue Bai
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2512.12060
Pdf URL: https://arxiv.org/pdf/2512.12060
Copy Paste: [[2512.12060]] CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos(https://arxiv.org/abs/2512.12060)
Keywords: diffusion, generative
Abstract: Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: this https URL.

Title: Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Authors: Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang
Subjects: cs.CR, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12069
Pdf URL: https://arxiv.org/pdf/2512.12069
Copy Paste: [[2512.12069]] Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring(https://arxiv.org/abs/2512.12069)
Keywords: anomaly
Abstract: Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github this https URL.

Title: BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

Authors: Ryan Po, Eric Ryan Chan, Changan Chen, Gordon Wetzstein
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12080
Pdf URL: https://arxiv.org/pdf/2512.12080
Copy Paste: [[2512.12080]] BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models(https://arxiv.org/abs/2512.12080)
Keywords: diffusion, self-supervised
Abstract: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.

Title: RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer

Authors: Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12083
Pdf URL: https://arxiv.org/pdf/2512.12083
Copy Paste: [[2512.12083]] RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer(https://arxiv.org/abs/2512.12083)
Keywords: diffusion, foundation model
Abstract: The superior representation capability of pre-trained vision foundation models (VFMs) has been harnessed for enhancing latent diffusion models (LDMs). These approaches inject the rich semantics from high-dimensional VFM representations (e.g., DINOv3) into LDMs at different phases, resulting in accelerated learning and better generation performance. However, the high-dimensionality of VFM representations may also lead to Information Overload, particularly when the VFM features exceed the size of the original image for decoding. To address this issue while preserving the utility of VFM features, we propose RePack (Representation Packing), a simple yet effective framework for improving Diffusion Transformers (DiTs). RePack transforms the VFM representation into a more compact, decoder-friendly representation by projecting onto low-dimensional manifolds. We find that RePack can effectively filter out non-semantic noise while preserving the core structural information needed for high-fidelity reconstruction. Experimental results show that RePack significantly accelerates DiT convergence and outperforms recent methods that directly inject raw VFM features into the decoder for image reconstruction. On DiT-XL/2, RePack achieves an FID of 3.66 in only 64 epochs, which is 35% faster than the state-of-the-art method. This demonstrates that RePack successfully extracts the core semantics of VFM representations while bypassing their high-dimensionality side effects.

Title: CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation

Authors: Xin Yang, Omid Ardakanian
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.12086
Pdf URL: https://arxiv.org/pdf/2512.12086
Copy Paste: [[2512.12086]] CLOAK: Contrastive Guidance for Latent Diffusion-Based Data Obfuscation(https://arxiv.org/abs/2512.12086)
Keywords: diffusion, generative
Abstract: Data obfuscation is a promising technique for mitigating attribute inference attacks by semi-trusted parties with access to time-series data emitted by sensors. Recent advances leverage conditional generative models together with adversarial training or mutual information-based regularization to balance data privacy and utility. However, these methods often require modifying the downstream task, struggle to achieve a satisfactory privacy-utility trade-off, or are computationally intensive, making them impractical for deployment on resource-constrained mobile IoT devices. We propose Cloak, a novel data obfuscation framework based on latent diffusion models. In contrast to prior work, we employ contrastive learning to extract disentangled representations, which guide the latent diffusion process to retain useful information while concealing private information. This approach enables users with diverse privacy needs to navigate the privacy-utility trade-off with minimal retraining. Extensive experiments on four public time-series datasets, spanning multiple sensing modalities, and a dataset of facial images demonstrate that Cloak consistently outperforms state-of-the-art obfuscation techniques and is well-suited for deployment in resource-constrained settings.

Title: SPDMark: Selective Parameter Displacement for Robust Video Watermarking

Authors: Samar Fares, Nurbek Tastan, Karthik Nandakumar
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12090
Pdf URL: https://arxiv.org/pdf/2512.12090
Copy Paste: [[2512.12090]] SPDMark: Selective Parameter Displacement for Robust Video Watermarking(https://arxiv.org/abs/2512.12090)
Keywords: diffusion, generative
Abstract: The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called SPDMark (pronounced `SpeedMark') based on selective parameter displacement of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.

Title: EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

Authors: Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, Puneet Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12107
Pdf URL: https://arxiv.org/pdf/2512.12107
Copy Paste: [[2512.12107]] EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography(https://arxiv.org/abs/2512.12107)
Keywords: foundation model
Abstract: Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

Title: High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees

Authors: Elynn Chen, Yuefeng Han, Jiayu Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12122
Pdf URL: https://arxiv.org/pdf/2512.12122
Copy Paste: [[2512.12122]] High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical Guarantees(https://arxiv.org/abs/2512.12122)
Keywords: generative
Abstract: High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textsc{rc-PCA}) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes.

Title: BaRISTA: Brain Scale Informed Spatiotemporal Representation of Human Intracranial Neural Activity

Authors: Lucine L. Oganesian, Saba Hashemi, Maryam M. Shanechi
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.12135
Pdf URL: https://arxiv.org/pdf/2512.12135
Copy Paste: [[2512.12135]] BaRISTA: Brain Scale Informed Spatiotemporal Representation of Human Intracranial Neural Activity(https://arxiv.org/abs/2512.12135)
Keywords: self-supervised, foundation model
Abstract: Intracranial recordings have opened a unique opportunity to simultaneously measure activity across multiregional networks in the human brain. Recent works have focused on developing transformer-based neurofoundation models of such recordings that can generalize across subjects and datasets. However, these recordings exhibit highly complex spatiotemporal interactions across diverse spatial scales, from the single-channel scale to the scale of brain regions. As such, there remain critical open questions regarding how best to encode spatial information and how to design self-supervision tasks that enable the learning of brain network patterns and enhance downstream decoding performance using such high-dimensional, multiregional recordings. To allow for exploring these questions, we propose a new spatiotemporal transformer model of multiregional neural activity and a corresponding self-supervised masked latent reconstruction task, designed to enable flexibility in the spatial scale used for token encoding and masking. Applying this model on publicly available multiregional intracranial electrophysiology (iEEG) data, we demonstrate that adjusting the spatial scale for both token encoding and masked reconstruction significantly impacts downstream decoding. Further, we find that spatial encoding at larger scales than channel-level encoding, which is commonly used in existing iEEG transformer models, improves downstream decoding performance. Finally, we demonstrate that our method allows for region-level token encoding while also maintaining accurate channel-level neural reconstruction. Taken together, our modeling framework enables exploration of the spatial scales used for token encoding and masking, reveals their importance towards self-supervised pretraining of neurofoundation models of multiregional human brain activity, and enhances downstream decoding performance.

Title: Diffusion Language Model Inference with Monte Carlo Tree Search

Authors: Zheng Huang, Kiran Ramnath, Yueyan Chen, Aosong Feng, Sangmin Woo, Balasubramaniam Srinivasan, Zhichao Xu, Kang Zhou, Shuai Wang, Haibo Ding, Lin Lee Cheong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12168
Pdf URL: https://arxiv.org/pdf/2512.12168
Copy Paste: [[2512.12168]] Diffusion Language Model Inference with Monte Carlo Tree Search(https://arxiv.org/abs/2512.12168)
Keywords: diffusion
Abstract: Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models.

Title: HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone

Authors: Yihan Wang, Annan Yu, Lujun Zhang, Charuleka Varadharajan, N. Benjamin Erichson
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.12183
Pdf URL: https://arxiv.org/pdf/2512.12183
Copy Paste: [[2512.12183]] HydroDiffusion: Diffusion-Based Probabilistic Streamflow Forecasting with a State Space Backbone(https://arxiv.org/abs/2512.12183)
Keywords: diffusion, generative
Abstract: Recent advances have introduced diffusion models for probabilistic streamflow forecasting, demonstrating strong early flood-warning skill. However, current implementations rely on recurrent Long Short-Term Memory (LSTM) backbones and single-step training objectives, which limit their ability to capture long-range dependencies and produce coherent forecast trajectories across lead times. To address these limitations, we developed HydroDiffusion, a diffusion-based probabilistic forecasting framework with a decoder-only state space model backbone. The proposed framework jointly denoises full multi-day trajectories in a single pass, ensuring temporal coherence and mitigating error accumulation common in autoregressive prediction. HydroDiffusion is evaluated across 531 watersheds in the contiguous United States (CONUS) in the CAMELS dataset. We benchmark HydroDiffusion against two diffusion baselines with LSTM backbones, as well as the recently proposed Diffusion-based Runoff Model (DRUM). Results show that HydroDiffusion achieves strong nowcast accuracy when driven by observed meteorological forcings, and maintains consistent performance across the full simulation horizon. Moreover, HydroDiffusion delivers stronger deterministic and probabilistic forecast skill than DRUM in operational forecasting. These results establish HydroDiffusion as a robust generative modeling framework for medium-range streamflow forecasting, providing both a new modeling benchmark and a foundation for future research on probabilistic hydrologic prediction at continental scales.

Title: SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation

Authors: Xuancheng Xu, Yaning Li, Sisi You, Bing-Kun Bao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12193
Pdf URL: https://arxiv.org/pdf/2512.12193
Copy Paste: [[2512.12193]] SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation(https://arxiv.org/abs/2512.12193)
Keywords: self-supervised
Abstract: Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.

Title: MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching

Authors: Jirui Jin, Cheng Zeng, Pawan Prakash, Ellad B. Tadmor, Adrian Roitberg, Richard G. Hennig, Stefano Martiniani, Mingjie Liu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.12198
Pdf URL: https://arxiv.org/pdf/2512.12198
Copy Paste: [[2512.12198]] MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching(https://arxiv.org/abs/2512.12198)
Keywords: generative
Abstract: Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, promoting structural diversity, and enabling efficient sampling for discovery. Recent advances in computer vision introduced a range of new guidance strategies for generative models, many of which can be adapted to support these goals. In this work, we integrate state-of-the-art guidance methods -- including classifier-free guidance, autoguidance, and model guidance -- in a leading molecule generation framework built on an SE(3)-equivariant flow matching process. We propose a hybrid guidance strategy that separately guides continuous and discrete molecular modalities -- operating on velocity fields and predicted logits, respectively -- while jointly optimizing their guidance scales via Bayesian optimization. Our implementation, benchmarked on the QM9 and QMe14S datasets, achieves new state-of-the-art performance in property alignment for de novo molecular generation. The generated molecules also exhibit high structural validity. Furthermore, we systematically compare the strengths and limitations of various guidance methods, offering insights into their broader applicability.

Title: A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection

Authors: Peizheng Li, Ioannis Mavromatis, Ajith Sahadevan, Tim Farnham, Adnan Aijaz, Aftab Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12205
Pdf URL: https://arxiv.org/pdf/2512.12205
Copy Paste: [[2512.12205]] A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection(https://arxiv.org/abs/2512.12205)
Keywords: self-supervised, anomaly
Abstract: We present a large-scale, longitudinal visual dataset of urban streetlights captured by 22 fixed-angle cameras deployed across Bristol, U.K., from 2021 to 2025. The dataset contains over 526,000 images, collected hourly under diverse lighting, weather, and seasonal conditions. Each image is accompanied by rich metadata, including timestamps, GPS coordinates, and device identifiers. This unique real-world dataset enables detailed investigation of visual drift, anomaly detection, and MLOps strategies in smart city deployments. To promtoe seconardary analysis, we additionally provide a self-supervised framework based on convolutional variational autoencoders (CNN-VAEs). Models are trained separately for each camera node and for day/night image sets. We define two per-sample drift metrics: relative centroid drift, capturing latent space deviation from a baseline quarter, and relative reconstruction error, measuring normalized image-domain degradation. This dataset provides a realistic, fine-grained benchmark for evaluating long-term model stability, drift-aware learning, and deployment-ready vision systems. The images and structured metadata are publicly released in JPEG and CSV formats, supporting reproducibility and downstream applications such as streetlight monitoring, weather inference, and urban scene understanding. The dataset can be found at this https URL and this https URL.

Title: EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training

Authors: Yuting Tang, Weibang Jiang, Shanglin Li, Yong Li, Chenyu Liu, Xinliang Zhou, Yi Ding, Cuntai Guan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12210
Pdf URL: https://arxiv.org/pdf/2512.12210
Copy Paste: [[2512.12210]] EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training(https://arxiv.org/abs/2512.12210)
Keywords: self-supervised, foundation model
Abstract: Large-scale EEG foundation models have shown strong generalization across a range of downstream tasks, but their training remains resource-intensive due to the volume and variable quality of EEG data. In this work, we introduce EEG-DLite, a data distillation framework that enables more efficient pre-training by selectively removing noisy and redundant samples from large EEG datasets. EEG-DLite begins by encoding EEG segments into compact latent representations using a self-supervised autoencoder, allowing sample selection to be performed efficiently and with reduced sensitivity to noise. Based on these representations, EEG-DLite filters out outliers and minimizes redundancy, resulting in a smaller yet informative subset that retains the diversity essential for effective foundation model training. Through extensive experiments, we demonstrate that training on only 5 percent of a 2,500-hour dataset curated with EEG-DLite yields performance comparable to, and in some cases better than, training on the full dataset across multiple downstream tasks. To our knowledge, this is the first systematic study of pre-training data distillation in the context of EEG foundation models. EEG-DLite provides a scalable and practical path toward more effective and efficient physiological foundation modeling. The code is available at this https URL.

Title: Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

Authors: Tianyu Zhang, Dong Liu, Chang Wen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12229
Pdf URL: https://arxiv.org/pdf/2512.12229
Copy Paste: [[2512.12229]] Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder(https://arxiv.org/abs/2512.12229)
Keywords: diffusion, generative
Abstract: Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.

Title: Semantic Distance Measurement based on Multi-Kernel Gaussian Processes

Authors: Yinzhu Cheng, Haihua Xie, Yaqing Wang, Miao He, Mingming Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12238
Pdf URL: https://arxiv.org/pdf/2512.12238
Copy Paste: [[2512.12238]] Semantic Distance Measurement based on Multi-Kernel Gaussian Processes(https://arxiv.org/abs/2512.12238)
Keywords: in-context
Abstract: Semantic distance measurement is a fundamental problem in computational linguistics, providing a quantitative characterization of similarity or relatedness between text segments, and underpinning tasks such as text retrieval and text classification. From a mathematical perspective, a semantic distance can be viewed as a metric defined on a space of texts or on a representation space derived from them. However, most classical semantic distance methods are essentially fixed, making them difficult to adapt to specific data distributions and task requirements. In this paper, a semantic distance measure based on multi-kernel Gaussian processes (MK-GP) was proposed. The latent semantic function associated with texts was modeled as a Gaussian process, with its covariance function given by a combined kernel combining Matérn and polynomial components. The kernel parameters were learned automatically from data under supervision, rather than being hand-crafted. This semantic distance was instantiated and evaluated in the context of fine-grained sentiment classification with large language models under an in-context learning (ICL) setup. The experimental results demonstrated the effectiveness of the proposed measure.

Title: Moment and Highlight Detection via MLLM Frame Segmentation

Authors: I Putu Andika Bagas Jiwanta, Ayu Purwarianti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12246
Pdf URL: https://arxiv.org/pdf/2512.12246
Copy Paste: [[2512.12246]] Moment and Highlight Detection via MLLM Frame Segmentation(https://arxiv.org/abs/2512.12246)
Keywords: generative
Abstract: Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.

Title: MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

Authors: Yuqing Lei, Yingjun Du, Yawen Huang, Xiantong Zhen, Ling Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12268
Pdf URL: https://arxiv.org/pdf/2512.12268
Copy Paste: [[2512.12268]] MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models(https://arxiv.org/abs/2512.12268)
Keywords: self-supervised
Abstract: Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.

Title: MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

Authors: Benjamin Beilharz, Thomas S. A. Wallis
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.12307
Pdf URL: https://arxiv.org/pdf/2512.12307
Copy Paste: [[2512.12307]] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding(https://arxiv.org/abs/2512.12307)
Keywords: generative
Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

Title: Unified Control for Inference-Time Guidance of Denoising Diffusion Models

Authors: Maurya Goyal, Anuj Singh, Hadi Jamali-Rad
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12339
Pdf URL: https://arxiv.org/pdf/2512.12339
Copy Paste: [[2512.12339]] Unified Control for Inference-Time Guidance of Denoising Diffusion Models(https://arxiv.org/abs/2512.12339)
Keywords: diffusion
Abstract: Aligning diffusion model outputs with downstream objectives is essential for improving task-specific performance. Broadly, inference-time training-free approaches for aligning diffusion models can be categorized into two main strategies: sampling-based methods, which explore multiple candidate outputs and select those with higher reward signals, and gradient-guided methods, which use differentiable reward approximations to directly steer the generation process. In this work, we propose a universal algorithm, UniCoDe, which brings together the strengths of sampling and gradient-based guidance into a unified framework. UniCoDe integrates local gradient signals during sampling, thereby addressing the sampling inefficiency inherent in complex reward-based sampling approaches. By cohesively combining these two paradigms, UniCoDe enables more efficient sampling while offering better trade-offs between reward alignment and divergence from the diffusion unconditional prior. Empirical results demonstrate that UniCoDe remains competitive with state-of-the-art baselines across a range of tasks. The code is available at this https URL

Title: STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Authors: Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12372
Pdf URL: https://arxiv.org/pdf/2512.12372
Copy Paste: [[2512.12372]] STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative(https://arxiv.org/abs/2512.12372)
Keywords: generative
Abstract: While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.

Title: V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping

Authors: Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim, Sangwon Jung, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12375
Pdf URL: https://arxiv.org/pdf/2512.12375
Copy Paste: [[2512.12375]] V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping(https://arxiv.org/abs/2512.12375)
Keywords: diffusion
Abstract: Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation. (2) A inference-time fine appearance injection stage refines visual fidelity by computing semantic correspondences from RoPE-free mid-layer query--key features. These correspondences guide the warping of appearance-rich value representations into semantically aligned regions of the generation process, with masking ensuring spatial reliability. V-Warper significantly improves appearance fidelity while preserving prompt alignment and motion dynamics, and it achieves these gains efficiently without large-scale video finetuning.

Title: The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining

Authors: Jesse Ponnock
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.12384
Pdf URL: https://arxiv.org/pdf/2512.12384
Copy Paste: [[2512.12384]] The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining(https://arxiv.org/abs/2512.12384)
Keywords: foundation model
Abstract: Domain-adaptive pretraining (DAPT) offers a practical path to specializing large language models for high-value domains without full retraining. We conduct an early-stage scaling-law analysis of continued pretraining on U.S. SEC filings, training 1B and 3B-parameter Llama-3.2 models on a 400M-token financial corpus with validation checkpoints at 50M, 100M, 200M, and 400M tokens. Results show consistent improvements in SEC-domain validation loss for both models, with the largest gains occurring within the first 200M tokens and diminishing returns thereafter. Power-law fits reveal shallow exponents, indicating that financial language is highly regular and efficiently learnable under continued pretraining. General-domain validation loss remains effectively unchanged across all token budgets, suggesting minimal drift and no signs of catastrophic forgetting. A data-efficiency frontier further shows that both models move toward improved specialization with negligible mixed-domain degradation. Together, these findings provide early empirical guidance for scaling financial foundation models, suggesting that meaningful domain adaptation can be achieved with comparatively modest token budgets and that larger model scales (7B-70B) remain tractable under projected data requirements.

Title: Speedrunning ImageNet Diffusion

Authors: Swayam Bhanded
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12386
Pdf URL: https://arxiv.org/pdf/2512.12386
Copy Paste: [[2512.12386]] Speedrunning ImageNet Diffusion(https://arxiv.org/abs/2512.12386)
Keywords: diffusion
Abstract: Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.

Title: Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

Authors: Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12387
Pdf URL: https://arxiv.org/pdf/2512.12387
Copy Paste: [[2512.12387]] Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment(https://arxiv.org/abs/2512.12387)
Keywords: generative
Abstract: Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking. Project webpage: this https URL.

Title: ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States

Authors: Haowen Wang, Xiaoping Yuan, Fugang Zhang, Rui Jian, Yuanwei Zhu, Xiuquan Qiao, Yakun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12395
Pdf URL: https://arxiv.org/pdf/2512.12395
Copy Paste: [[2512.12395]] ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States(https://arxiv.org/abs/2512.12395)
Keywords: diffusion, generative
Abstract: Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.

Title: DeepVekua: Geometric-Spectral Representation Learning for Physics-Informed Fields

Authors: Vladimer Khasia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12402
Pdf URL: https://arxiv.org/pdf/2512.12402
Copy Paste: [[2512.12402]] DeepVekua: Geometric-Spectral Representation Learning for Physics-Informed Fields(https://arxiv.org/abs/2512.12402)
Keywords: diffusion
Abstract: We present DeepVekua, a hybrid architecture that unifies geometric deep learning with spectral analysis to solve partial differential equations (PDEs) in sparse data regimes. By learning a diffeomorphic coordinate transformation that maps complex geometries to a latent harmonic space, our method outperforms state-of-the-art implicit representations on advection-diffusion systems. Unlike standard coordinate-based networks which struggle with spectral bias, DeepVekua separates the learning of geometry from the learning of physics, solving for optimal spectral weights in closed form. We demonstrate a 100x improvement over spectral baselines. The code is available at this https URL.

Title: Can Graphs Improve Tabular Foundation Models?

Authors: Franck Le, Keith Grueneberg, Erich Nahum, Vadim Sheinin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12405
Pdf URL: https://arxiv.org/pdf/2512.12405
Copy Paste: [[2512.12405]] Can Graphs Improve Tabular Foundation Models?(https://arxiv.org/abs/2512.12405)
Keywords: foundation model, in-context
Abstract: Tabular data are central to many real-world systems. While recent tabular transformers and in-context learners such as SAINT, TP-BERTa, TabPFN, TabICL, and MITRA incorporate limited inter-row reasoning, most approaches still lack an explicit mechanism to model relationships among instances, even though similar samples often share related outcomes. We investigate whether introducing \emph{simple graph priors} can enhance \emph{pretrained tabular transformers}. Concretely, we introduce {BOLERO}, a lightweight, static bipartite graph head that augments {RoBERTa-Tab} (a RoBERTa-style tabular backbone pretrained with masked-token prediction.) Each instance connects to feature/value anchors; a small GNN refines row representations, while the backbone remains frozen. We evaluate on 80 classification and 64 regression datasets from the TP-BERTa benchmark suites, comparing against strong baselines including XGBoost, CatBoost, TabPFN-v2, MITRA, TabICL, TP-BERTa, and RoBERTa-Tab. To ensure statistically sound conclusions, we follow best practices for multi-dataset evaluation: pairwise Wilcoxon signed-rank tests on per-dataset score differences and effect sizes (median improvement with confidence intervals), rather than mean-rank post-hoc tests that depend on the competitor pool. BOLERO achieves the highest number of statistically significant wins across both classification and regression, demonstrating that lightweight graph priors meaningfully improve pretrained tabular transformers.

Title: BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation

Authors: Hangwei Zhang, Armando Teles Fortes, Tianyi Wei, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12425
Pdf URL: https://arxiv.org/pdf/2512.12425
Copy Paste: [[2512.12425]] BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation(https://arxiv.org/abs/2512.12425)
Keywords: foundation model
Abstract: Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.

Title: Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

Authors: Abdul Matin, Rupasree Dey, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12445
Pdf URL: https://arxiv.org/pdf/2512.12445
Copy Paste: [[2512.12445]] Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction(https://arxiv.org/abs/2512.12445)
Keywords: self-supervised
Abstract: Integrating domain knowledge into deep learning has emerged as a promising direction for improving model interpretability, generalization, and data efficiency. In this work, we present a novel knowledge-guided ViT-based Masked Autoencoder that embeds scientific domain knowledge within the self-supervised reconstruction process. Instead of relying solely on data-driven optimization, our proposed approach incorporates the Linear Spectral Mixing Model (LSMM) as a physical constraint and physically-based Spectral Angle Mapper (SAM), ensuring that learned representations adhere to known structural relationships between observed signals and their latent components. The framework jointly optimizes LSMM and SAM loss with a conventional Huber loss objective, promoting both numerical accuracy and geometric consistency in the feature space. This knowledge-guided design enhances reconstruction fidelity, stabilizes training under limited supervision, and yields interpretable latent representations grounded in physical principles. The experimental findings indicate that the proposed model substantially enhances reconstruction quality and improves downstream task performance, highlighting the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning.

Title: Exploring the Design Space of Transition Matching

Authors: Uriel Singer, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12465
Pdf URL: https://arxiv.org/pdf/2512.12465
Copy Paste: [[2512.12465]] Exploring the Design Space of Transition Matching(https://arxiv.org/abs/2512.12465)
Keywords: diffusion, generative
Abstract: Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

Title: The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting

Authors: James Luther, Donald Brown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12488
Pdf URL: https://arxiv.org/pdf/2512.12488
Copy Paste: [[2512.12488]] The American Ghost in the Machine: How language models align culturally and the effects of cultural prompting(https://arxiv.org/abs/2512.12488)
Keywords: generative
Abstract: Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede's cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.

Title: Generative Spatiotemporal Data Augmentation

Authors: Jinfan Zhou, Lixin Luo, Sungmin Eum, Heesung Kwon, Jeong Joon Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12508
Pdf URL: https://arxiv.org/pdf/2512.12508
Copy Paste: [[2512.12508]] Generative Spatiotemporal Data Augmentation(https://arxiv.org/abs/2512.12508)
Keywords: diffusion, foundation model, generative
Abstract: We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method leverages off-the-shelf video diffusion models to generate realistic 3D spatial and temporal variations from a given image dataset. Incorporating these synthesized video clips as supplemental training data yields consistent performance gains in low-data settings, such as UAV-captured imagery where annotations are scarce. Beyond empirical improvements, we provide practical guidelines for (i) choosing an appropriate spatiotemporal generative setup, (ii) transferring annotations to synthetic frames, and (iii) addressing disocclusion - regions newly revealed and unlabeled in generated views. Experiments on COCO subsets and UAV-captured datasets show that, when applied judiciously, spatiotemporal augmentation broadens the data distribution along axes underrepresented by traditional and prior generative methods, offering an effective lever for improving model performance in data-scarce regimes.

Title: Animus3D: Text-driven 3D Animation via Motion Score Distillation

Authors: Qi Sun, Can Wang, Jiaxiang Shang, Wensen Feng, Jing Liao
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.12534
Pdf URL: https://arxiv.org/pdf/2512.12534
Copy Paste: [[2512.12534]] Animus3D: Text-driven 3D Animation via Motion Score Distillation(https://arxiv.org/abs/2512.12534)
Keywords: diffusion
Abstract: We present Animus3D, a text-driven 3D animation framework that generates motion field given a static 3D asset and text prompt. Previous methods mostly leverage the vanilla Score Distillation Sampling (SDS) objective to distill motion from pretrained text-to-video diffusion, leading to animations with minimal movement or noticeable jitter. To address this, our approach introduces a novel SDS alternative, Motion Score Distillation (MSD). Specifically, we introduce a LoRA-enhanced video diffusion model that defines a static source distribution rather than pure noise as in SDS, while another inversion-based noise estimation technique ensures appearance preservation when guiding motion. To further improve motion fidelity, we incorporate explicit temporal and spatial regularization terms that mitigate geometric distortions across time and space. Additionally, we propose a motion refinement module to upscale the temporal resolution and enhance fine-grained details, overcoming the fixed-resolution constraints of the underlying video model. Extensive experiments demonstrate that Animus3D successfully animates static 3D assets from diverse text prompts, generating significantly more substantial and detailed motion than state-of-the-art baselines while maintaining high visual integrity. Code will be released at this https URL.

Title: NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

Authors: Agniva Maiti, Manya Pandey, Murari Mandal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.12537
Pdf URL: https://arxiv.org/pdf/2512.12537
Copy Paste: [[2512.12537]] NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data(https://arxiv.org/abs/2512.12537)
Keywords: generative
Abstract: The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.

Title: Skillful Subseasonal-to-Seasonal Forecasting of Extreme Events with a Multi-Sphere Coupled Probabilistic Model

Authors: Bin Mu, Yuxuan Chen, Shijin Yuan, Bo Qin, Hao Guo
Subjects: cs.LG, cs.AI, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2512.12545
Pdf URL: https://arxiv.org/pdf/2512.12545
Copy Paste: [[2512.12545]] Skillful Subseasonal-to-Seasonal Forecasting of Extreme Events with a Multi-Sphere Coupled Probabilistic Model(https://arxiv.org/abs/2512.12545)
Keywords: diffusion
Abstract: Accurate subseasonal-to-seasonal (S2S) prediction of extreme events is critical for resource planning and disaster mitigation under accelerating climate change. However, such predictions remain challenging due to complex multi-sphere interactions and intrinsic atmospheric uncertainty. Here we present TianXing-S2S, a multi-sphere coupled probabilistic model for global S2S daily ensemble forecast. TianXing-S2S first encodes diverse multi-sphere predictors into a compact latent space, then employs a diffusion model to generate daily ensemble forecasts. A novel coupling module based on optimal transport (OT) is incorporated in the denoiser to optimize the interactions between atmospheric and multi-sphere boundary conditions. Across key atmospheric variables, TianXing-S2S outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF) S2S system and FuXi-S2S in 45-day daily-mean ensemble forecasts at 1.5 resolution. Our model achieves skillful subseasonal prediction of extreme events including heat waves and anomalous precipitation, identifying soil moisture as a critical precursor signal. Furthermore, we demonstrate that TianXing-S2S can generate stable rollout forecasts up to 180 days, establishing a robust framework for S2S research in a warming world.

Title: Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses

Authors: David Strnadel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12581
Pdf URL: https://arxiv.org/pdf/2512.12581
Copy Paste: [[2512.12581]] Differentiable Energy-Based Regularization in GANs: A Simulator-Based Exploration of VQE-Inspired Auxiliary Losses(https://arxiv.org/abs/2512.12581)
Keywords: generative
Abstract: This paper presents an exploratory, simulator-based proof of concept investigating whether differentiable energy terms derived from parameterized quantum circuits can serve as auxiliary regularization signals in Generative Adversarial Networks (GANs). We augment the Auxiliary Classifier GAN (ACGAN) generator objective with a Variational Quantum Eigensolver (VQE)-inspired energy term computed from class-specific Ising Hamiltonians using Qiskit's EstimatorQNN and TorchConnector. Important limitations: All experiments run on a noiseless statevector simulator with only 4 qubits, use a deliberately simple Hamiltonian parameterization, and lack ablation studies comparing against equivalent classical biases. The computational overhead (approximately 200x slower than classical ACGAN) reflects simulator artifacts rather than inherent quantum costs. On MNIST, we observe that the energy-regularized model (termed QACGAN) achieves high classification accuracy (99 to 100 percent) within 5 epochs compared to 87.8 percent for ACGAN, suggesting the auxiliary term influences class conditioning. However, sample quality metrics (FID) show high variance across runs (coefficient of variation approximately 25 percent at epoch 5), with values ranging from 19.92 to 35.96. Extended runs stabilize around FID 23 to 24, comparable to the ACGAN baseline. We explicitly do not claim quantum advantage, improved stability in any general sense, or scalability beyond this toy setting. The contribution is methodological: demonstrating that VQE-style energy computations can be integrated into GAN training loops via differentiable pathways. Whether such auxiliary signals provide benefits beyond equivalent classical regularizers remains an open question requiring systematic ablation studies, which we leave for future work.

Title: Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Authors: Karthikeya KV
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12595
Pdf URL: https://arxiv.org/pdf/2512.12595
Copy Paste: [[2512.12595]] Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation(https://arxiv.org/abs/2512.12595)
Keywords: diffusion, generative
Abstract: This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures to tackle challenges in high-resolution image synthesis and multimodal data interpretation. The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation. A bidirectional tokenization strategy is employed to seamlessly merge inputs from text, image, and video modalities, fostering a unified understanding across diverse data types. By embedding spatial-temporal features and leveraging a hybrid text-image sequence modeling approach, the framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations. The architecture is optimized with a noise-aware learning algorithm, addressing discrepancies in noisy data distributions and improving generative performance under varying input conditions. Rigorous evaluations on benchmark datasets demonstrate a 25% increase in image resolution clarity and a 20% reduction in computational requirements compared to diffusion-based methods. Furthermore, the model exhibits robust scalability and adaptability, showcasing its potential in applications like autonomous systems, creative content generation, and advanced video analysis. This work underscores the role of vision-centric LLMs in redefining capabilities in computer vision and multimodal artificial intelligence.

Title: No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching

Authors: Tingyan Wen, Haoyu Li, Yihuang Chen, Xing Zhou, Lifei Zhu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12604
Pdf URL: https://arxiv.org/pdf/2512.12604
Copy Paste: [[2512.12604]] No Cache Left Idle: Accelerating diffusion model via Extreme-slimming Caching(https://arxiv.org/abs/2512.12604)
Keywords: diffusion, generative
Abstract: Diffusion models achieve remarkable generative quality, but computational overhead scales with step count, model depth, and sequence length. Feature caching is effective since adjacent timesteps yield highly similar features. However, an inherent trade-off remains: aggressive timestep reuse offers large speedups but can easily cross the critical line, hurting fidelity, while block- or token-level reuse is safer but yields limited computational savings. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that, to our knowledge, is the first unified framework to exploit cacheable redundancy across timesteps, structure (blocks), and space (tokens). Rather than simply mixing levels, X-Slim introduces a dual-threshold controller that turns caching into a push-then-polish process: it first pushes reuse at the timestep level up to an early-warning line, then switches to lightweight block- and token-level refresh to polish the remaining redundancy, and triggers full inference once the critical line is crossed to reset accumulated error. At each level, context-aware indicators decide when and where to cache. Across diverse tasks, X-Slim advances the speed-quality frontier. On FLUX.1-dev and HunyuanVideo, it reduces latency by up to 4.97x and 3.52x with minimal perceptual loss. On DiT-XL/2, it reaches 3.13x acceleration and improves FID by 2.42 over prior methods.

Title: InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation

Authors: Sreehari Rajan, Kunal Bhosikar, Charu Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12664
Pdf URL: https://arxiv.org/pdf/2512.12664
Copy Paste: [[2512.12664]] InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation(https://arxiv.org/abs/2512.12664)
Keywords: diffusion
Abstract: Generating realistic human motions that naturally respond to both spoken language and physical objects is crucial for interactive digital experiences. Current methods, however, address speech-driven gestures or object interactions independently, limiting real-world applicability due to a lack of integrated, comprehensive datasets. To overcome this, we introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation. We achieve this by employing a multi-stage training process to learn a unified motion, speech, and prompt embedding space. To support this, we curate a rich human-object interaction dataset, formed by augmenting an existing text-to-motion dataset with detailed object interaction annotations. Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition, which is then dynamically combined during inference. To address the imbalance between heterogeneous conditioning signals, we propose an adaptive fusion strategy, which dynamically reweights the conditioning signals during diffusion sampling. InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis, outperforming gesture-focused diffusion methods, yielding highly realistic, object-aware full-body motions with enhanced realism, flexibility, and control.

Title: DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization

Authors: Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Guoqing Ma, Yidan Liang, Jingjiang Liu, Hao Chen, Shimin Di
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12669
Pdf URL: https://arxiv.org/pdf/2512.12669
Copy Paste: [[2512.12669]] DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization(https://arxiv.org/abs/2512.12669)
Keywords: diffusion, generative
Abstract: Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.

Title: On Approaches to Building Surrogate ODE Models for Diffusion Bridges

Authors: Maria Khilchuk, Vladimir Latypov, Pavel Kleshchev, Alexander Hvatov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12671
Pdf URL: https://arxiv.org/pdf/2512.12671
Copy Paste: [[2512.12671]] On Approaches to Building Surrogate ODE Models for Diffusion Bridges(https://arxiv.org/abs/2512.12671)
Keywords: diffusion, generative
Abstract: Diffusion and Schrödinger Bridge models have established state-of-the-art performance in generative modeling but are often hampered by significant computational costs and complex training procedures. While continuous-time bridges promise faster sampling, overparameterized neural networks describe their optimal dynamics, and the underlying stochastic differential equations can be difficult to integrate efficiently. This work introduces a novel paradigm that uses surrogate models to create simpler, faster, and more flexible approximations of these dynamics. We propose two specific algorithms: SINDy Flow Matching (SINDy-FM), which leverages sparse regression to identify interpretable, symbolic differential equations from data, and a Neural-ODE reformulation of the Schrödinger Bridge (DSBM-NeuralODE) for flexible continuous-time parameterization. Our experiments on Gaussian transport tasks and MNIST latent translation demonstrate that these surrogates achieve competitive performance while offering dramatic improvements in efficiency and interpretability. The symbolic SINDy-FM models, in particular, reduce parameter counts by several orders of magnitude and enable near-instantaneous inference, paving the way for a new class of tractable and high-performing bridge models for practical deployment.

Title: GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Authors: Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12751
Pdf URL: https://arxiv.org/pdf/2512.12751
Copy Paste: [[2512.12751]] GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation(https://arxiv.org/abs/2512.12751)
Keywords: diffusion
Abstract: Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.

Title: Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior

Authors: Hao Wang, Ashish Bastola, Chaoyi Zhou, Wenhui Zhu, Xiwen Chen, Xuanzhao Dong, Siyu Huang, Abolfazl Razi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12774
Pdf URL: https://arxiv.org/pdf/2512.12774
Copy Paste: [[2512.12774]] Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior(https://arxiv.org/abs/2512.12774)
Keywords: generative
Abstract: As generative models become increasingly capable of producing high-fidelity visual content, the demand for efficient, interpretable, and editable image representations has grown substantially. Recent advances in 2D Gaussian Splatting (2DGS) have emerged as a promising solution, offering explicit control, high interpretability, and real-time rendering capabilities (>1000 FPS). However, high-quality 2DGS typically requires post-optimization. Existing methods adopt random or heuristics (e.g., gradient maps), which are often insensitive to image complexity and lead to slow convergence (>10s). More recent approaches introduce learnable networks to predict initial Gaussian configurations, but at the cost of increased computational and architectural complexity. To bridge this gap, we present Fast-2DGS, a lightweight framework for efficient Gaussian image representation. Specifically, we introduce Deep Gaussian Prior, implemented as a conditional network to capture the spatial distribution of Gaussian primitives under different complexities. In addition, we propose an attribute regression network to predict dense Gaussian properties. Experiments demonstrate that this disentangled architecture achieves high-quality reconstruction in a single forward pass, followed by minimal fine-tuning. More importantly, our approach significantly reduces computational cost without compromising visual quality, bringing 2DGS closer to industry-ready deployment.

Title: Learning Common and Salient Generative Factors Between Two Image Datasets

Authors: Yunlong He, Gwilherm Lesné, Ziqian Liu, Michaël Soumm, Pietro Gori
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12800
Pdf URL: https://arxiv.org/pdf/2512.12800
Copy Paste: [[2512.12800]] Learning Common and Salient Generative Factors Between Two Image Datasets(https://arxiv.org/abs/2512.12800)
Keywords: diffusion, generative
Abstract: Recent advancements in image synthesis have enabled high-quality image generation and manipulation. Most works focus on: 1) conditional manipulation, where an image is modified conditioned on a given attribute, or 2) disentangled representation learning, where each latent direction should represent a distinct semantic attribute. In this paper, we focus on a different and less studied research problem, called Contrastive Analysis (CA). Given two image datasets, we want to separate the common generative factors, shared across the two datasets, from the salient ones, specific to only one dataset. Compared to existing methods, which use attributes as supervised signals for editing (e.g., glasses, gender), the proposed method is weaker, since it only uses the dataset signal. We propose a novel framework for CA, that can be adapted to both GAN and Diffusion models, to learn both common and salient factors. By defining new and well-adapted learning strategies and losses, we ensure a relevant separation between common and salient factors, preserving a high-quality generation. We evaluate our approach on diverse datasets, covering human faces, animal images and medical scans. Our framework demonstrates superior separation ability and image quality synthesis compared to prior methods.

Title: On the continuity of flows

Authors: Congzhou M Sha
Subjects: cs.LG, cs.AI, physics.data-an
Abstract URL: https://arxiv.org/abs/2512.12821
Pdf URL: https://arxiv.org/pdf/2512.12821
Copy Paste: [[2512.12821]] On the continuity of flows(https://arxiv.org/abs/2512.12821)
Keywords: generative
Abstract: Flow matching has emerged as a powerful framework for generative modeling through continuous normalizing flows. We investigate a potential topological constraint: when the prior distribution and target distribution have mismatched topology (e.g., unimodal to multimodal), the optimal velocity field under standard flow matching objectives may exhibit spatial discontinuities. We suggest that this discontinuity arises from the requirement that continuous flows must bifurcate to map a single mode to multiple modes, forcing particles to make discrete routing decisions at intermediate times. Through theoretical analysis on bimodal Gaussian mixtures, we demonstrate that the optimal velocity field exhibits jump discontinuities along decision boundaries, with magnitude approaching infinity as time approaches the target distribution. Our analysis suggests that this phenomenon is not specific to $L^2$ loss, but rather may be a consequence of topological mismatch between distributions. We validate our theory empirically and discuss potential implications for flow matching on manifolds, connecting our findings to recent work on Riemannian flow matching and the challenge of learning discontinuous representations in neural networks.

Title: Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners

Authors: N.K.B.M.P.K.B. Narasinghe, Uthayasanker Thayasivam
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12824
Pdf URL: https://arxiv.org/pdf/2512.12824
Copy Paste: [[2512.12824]] Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners(https://arxiv.org/abs/2512.12824)
Keywords: foundation model, generative
Abstract: Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.

Title: Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Authors: Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.12858
Pdf URL: https://arxiv.org/pdf/2512.12858
Copy Paste: [[2512.12858]] Information-Consistent Language Model Recommendations through Group Relative Policy Optimization(https://arxiv.org/abs/2512.12858)
Keywords: generative
Abstract: Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios-such as HR onboarding, customer support, or policy disclosure-require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-trained model reduces variability more effectively than fine-tuning or decoding-based baselines. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity but as a correctable flaw in enterprise deployments.

Title: Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

Authors: Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12887
Pdf URL: https://arxiv.org/pdf/2512.12887
Copy Paste: [[2512.12887]] Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification(https://arxiv.org/abs/2512.12887)
Keywords: foundation model
Abstract: 3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

Title: Distillation of Discrete Diffusion by Exact Conditional Distribution Matching

Authors: Yansong Gao, Yu Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.12889
Pdf URL: https://arxiv.org/pdf/2512.12889
Copy Paste: [[2512.12889]] Distillation of Discrete Diffusion by Exact Conditional Distribution Matching(https://arxiv.org/abs/2512.12889)
Keywords: diffusion, generative
Abstract: Discrete diffusion models (DDMs) are a powerful class of generative models for categorical data, but they typically require many function evaluations for a single sample, making inference expensive. Existing acceleration methods either rely on approximate simulators, such as $\tau$-leaping, or on distillation schemes that train new student models and auxiliary networks with proxy objectives. We propose a simple and principled distillation alternative based on \emph{conditional distribution matching}. Our key observation is that the reverse conditional distribution of clean data given a noisy state, $p_{0\mid t}(x_0 \mid x_t)$, admits a Markov decomposition through intermediate times and can be recovered from marginal density ratios and the known forward CTMC kernel. We exploit this structure to define distillation objectives that directly match conditional distributions between a pre-trained teacher and a low-NFE student, both for one-step and few-step samplers.

Title: Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

Authors: Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2512.12932
Pdf URL: https://arxiv.org/pdf/2512.12932
Copy Paste: [[2512.12932]] Investigating Data Pruning for Pretraining Biological Foundation Models at Scale(https://arxiv.org/abs/2512.12932)
Keywords: foundation model
Abstract: Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

Title: SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer

Authors: Luan Thanh Trinh, Kenji Doi, Atsuki Osanai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12963
Pdf URL: https://arxiv.org/pdf/2512.12963
Copy Paste: [[2512.12963]] SCAdapter: Content-Style Disentanglement for Diffusion Style Transfer(https://arxiv.org/abs/2512.12963)
Keywords: diffusion
Abstract: Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least $2\times$ faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.

Title: Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes

Authors: Ziheng Qin, Yuheng Ji, Renshuai Tao, Yuxuan Tian, Yuyang Liu, Yipu Wang, Xiaolong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.12982
Pdf URL: https://arxiv.org/pdf/2512.12982
Copy Paste: [[2512.12982]] Scaling Up AI-Generated Image Detection via Generator-Aware Prototypes(https://arxiv.org/abs/2512.12982)
Keywords: diffusion
Abstract: The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that constrain representation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data this http URL resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of GAN and diffusion-based generators. Code is available at this https URL

Title: Few-Step Distillation for Text-to-Image Generation: A Practical Guide

Authors: Yifan Pu, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Fan Wang, Bohan Zhuang, Gao Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13006
Pdf URL: https://arxiv.org/pdf/2512.13006
Copy Paste: [[2512.13006]] Few-Step Distillation for Text-to-Image Generation: A Practical Guide(https://arxiv.org/abs/2512.13006)
Keywords: diffusion
Abstract: Diffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on this http URL.

Title: Light Field Based 6DoF Tracking of Previously Unobserved Objects

Authors: Nikolai Goncharov, James L. Gray, Donald G. Dansereau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13007
Pdf URL: https://arxiv.org/pdf/2512.13007
Copy Paste: [[2512.13007]] Light Field Based 6DoF Tracking of Previously Unobserved Objects(https://arxiv.org/abs/2512.13007)
Keywords: foundation model
Abstract: Object tracking is an important step in robotics and reautonomous driving pipelines, which has to generalize to previously unseen and complex objects. Existing high-performing methods often rely on pre-captured object views to build explicit reference models, which restricts them to a fixed set of known objects. However, such reference models can struggle with visually complex appearance, reducing the quality of tracking. In this work, we introduce an object tracking method based on light field images that does not depend on a pre-trained model, while being robust to complex visual behavior, such as reflections. We extract semantic and geometric features from light field inputs using vision foundation models and convert them into view-dependent Gaussian splats. These splats serve as a unified object representation, supporting differentiable rendering and pose optimization. We further introduce a light field object tracking dataset containing challenging reflective objects with precise ground truth poses. Experiments demonstrate that our method is competitive with state-of-the-art model-based trackers in these difficult cases, paving the way toward universal object tracking in robotic systems. Code/data available at this https URL.

Title: JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

Authors: Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13014
Pdf URL: https://arxiv.org/pdf/2512.13014
Copy Paste: [[2512.13014]] JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion(https://arxiv.org/abs/2512.13014)
Keywords: diffusion, generative
Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

Title: SneakPeek: Future-Guided Instructional Streaming Video Generation

Authors: Cheeun Hong, German Barquero, Fadime Sener, Markos Georgopoulos, Edgar Schönfeld, Stefan Popov, Yuming Du, Oscar Mañas, Albert Pumarola
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13019
Pdf URL: https://arxiv.org/pdf/2512.13019
Copy Paste: [[2512.13019]] SneakPeek: Future-Guided Instructional Streaming Video Generation(https://arxiv.org/abs/2512.13019)
Keywords: diffusion
Abstract: Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.

Title: Motus: A Unified Latent Action World Model

Authors: Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.13030
Pdf URL: https://arxiv.org/pdf/2512.13030
Copy Paste: [[2512.13030]] Motus: A Unified Latent Action World Model(https://arxiv.org/abs/2512.13030)
Keywords: generative
Abstract: While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.

Title: Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models

Authors: Hao Chen, Yiwei Wang, Songze Li
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2512.13039
Pdf URL: https://arxiv.org/pdf/2512.13039
Copy Paste: [[2512.13039]] Bi-Erasing: A Bidirectional Framework for Concept Removal in Diffusion Models(https://arxiv.org/abs/2512.13039)
Keywords: diffusion
Abstract: Concept erasure, which fine-tunes diffusion models to remove undesired or harmful visual concepts, has become a mainstream approach to mitigating unsafe or illegal image generation in text-to-image this http URL, existing removal methods typically adopt a unidirectional erasure strategy by either suppressing the target concept or reinforcing safe alternatives, making it difficult to achieve a balanced trade-off between concept removal and generation quality. To address this limitation, we propose a novel Bidirectional Image-Guided Concept Erasure (Bi-Erasing) framework that performs concept suppression and safety enhancement simultaneously. Specifically, based on the joint representation of text prompts and corresponding images, Bi-Erasing introduces two decoupled image branches: a negative branch responsible for suppressing harmful semantics and a positive branch providing visual guidance for safe alternatives. By jointly optimizing these complementary directions, our approach achieves a balance between erasure efficacy and generation usability. In addition, we apply mask-based filtering to the image branches to prevent interference from irrelevant content during the erasure process. Across extensive experiment evaluations, the proposed Bi-Erasing outperforms baseline methods in balancing concept removal effectiveness and visual fidelity.

Title: Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

Authors: Xuwei Tan, Yao Ma, Xueru Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.13040
Pdf URL: https://arxiv.org/pdf/2512.13040
Copy Paste: [[2512.13040]] Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection(https://arxiv.org/abs/2512.13040)
Keywords: in-context
Abstract: Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.

Title: Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing

Authors: Jaeyoon Kim, Yoonki Cho, Sung-Eui Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13055
Pdf URL: https://arxiv.org/pdf/2512.13055
Copy Paste: [[2512.13055]] Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing(https://arxiv.org/abs/2512.13055)
Keywords: foundation model
Abstract: Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at this https URL

Title: Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Authors: Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen, Dingkang Yang, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13072
Pdf URL: https://arxiv.org/pdf/2512.13072
Copy Paste: [[2512.13072]] Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models(https://arxiv.org/abs/2512.13072)
Keywords: foundation model
Abstract: Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.

Title: UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era

Authors: Ziqiang Zhu, Bowei Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13089
Pdf URL: https://arxiv.org/pdf/2512.13089
Copy Paste: [[2512.13089]] UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era(https://arxiv.org/abs/2512.13089)
Keywords: foundation model
Abstract: Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at this https URL.

Title: Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation

Authors: Wenjing Lu, Yi Hong, Yang Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13101
Pdf URL: https://arxiv.org/pdf/2512.13101
Copy Paste: [[2512.13101]] Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2512.13101)
Keywords: foundation model
Abstract: Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.

Title: Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

Authors: Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Liu, Jintao Cheng, Xieyuanli Chen, Xiaoyu Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13107
Pdf URL: https://arxiv.org/pdf/2512.13107
Copy Paste: [[2512.13107]] Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather(https://arxiv.org/abs/2512.13107)
Keywords: diffusion
Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.

Title: Intrinsic Image Fusion for Multi-View 3D Material Reconstruction

Authors: Peter Kocsis (1), Lukas Höllein (1), Matthias Nießner (1) ((1) Technical University of Munich)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13157
Pdf URL: https://arxiv.org/pdf/2512.13157
Copy Paste: [[2512.13157]] Intrinsic Image Fusion for Multi-View 3D Material Reconstruction(https://arxiv.org/abs/2512.13157)
Keywords: diffusion
Abstract: We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view. To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions. We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.

Title: A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis

Authors: Xianchao Guan, Zhiyuan Fan, Yifeng Wang, Fuqiang Chen, Yanjiang Zhou, Zengyang Che, Hongxue Meng, Xin Li, Yaowei Wang, Hongpeng Wang, Min Zhang, Heng Tao Shen, Zheng Zhang, Yongbing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13164
Pdf URL: https://arxiv.org/pdf/2512.13164
Copy Paste: [[2512.13164]] A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis(https://arxiv.org/abs/2512.13164)
Keywords: self-supervised, foundation model, generative
Abstract: The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.

Title: POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

Authors: Zhuo Chen, Chengqun Yang, Zhuo Su, Zheng Lv, Jingnan Gao, Xiaoyuan Zhang, Xiaokang Yang, Yichao Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13192
Pdf URL: https://arxiv.org/pdf/2512.13192
Copy Paste: [[2512.13192]] POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling(https://arxiv.org/abs/2512.13192)
Keywords: diffusion, generative
Abstract: Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining "chicken-and-egg" cycle for scalable and reproducible portrait illumination.

Title: CORE: Contrastive Masked Feature Reconstruction on Graphs

Authors: Jianyuan Bo, Yuan Fang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13235
Pdf URL: https://arxiv.org/pdf/2512.13235
Copy Paste: [[2512.13235]] CORE: Contrastive Masked Feature Reconstruction on Graphs(https://arxiv.org/abs/2512.13235)
Keywords: self-supervised, generative
Abstract: In the rapidly evolving field of self-supervised learning on graphs, generative and contrastive methodologies have emerged as two dominant approaches. Our study focuses on masked feature reconstruction (MFR), a generative technique where a model learns to restore the raw features of masked nodes in a self-supervised manner. We observe that both MFR and graph contrastive learning (GCL) aim to maximize agreement between similar elements. Building on this observation, we reveal a novel theoretical insight: under specific conditions, the objectives of MFR and node-level GCL converge, despite their distinct operational mechanisms. This theoretical connection suggests these approaches are complementary rather than fundamentally different, prompting us to explore their integration to enhance self-supervised learning on graphs. Our research presents Contrastive Masked Feature Reconstruction (CORE), a novel graph self-supervised learning framework that integrates contrastive learning into MFR. Specifically, we form positive pairs exclusively between the original and reconstructed features of masked nodes, encouraging the encoder to prioritize contextual information over the node's own features. Additionally, we leverage the masked nodes themselves as negative samples, combining MFR's reconstructive power with GCL's discriminative ability to better capture intrinsic graph structures. Empirically, our proposed framework CORE significantly outperforms MFR across node and graph classification tasks, demonstrating state-of-the-art results. In particular, CORE surpasses GraphMAE and GraphMAE2 by up to 2.80% and 3.72% on node classification tasks, and by up to 3.82% and 3.76% on graph classification tasks.

Title: STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

Authors: Foivos Paraperas Papantoniou, Stathis Galanakis, Rolandos Alexandros Potamias, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13247
Pdf URL: https://arxiv.org/pdf/2512.13247
Copy Paste: [[2512.13247]] STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits(https://arxiv.org/abs/2512.13247)
Keywords: diffusion
Abstract: This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.

Title: BézierFlow: Bézier Stochastic Interpolant Schedulers for Few-Step Generation

Authors: Yunhong Min, Juil Koo, Seungwoo Yoo, Minhyuk Sung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13255
Pdf URL: https://arxiv.org/pdf/2512.13255
Copy Paste: [[2512.13255]] BézierFlow: Bézier Stochastic Interpolant Schedulers for Few-Step Generation(https://arxiv.org/abs/2512.13255)
Keywords: diffusion
Abstract: We introduce BézierFlow, a lightweight training approach for few-step generation with pretrained diffusion and flow models. BézierFlow achieves a 2-3x performance improvement for sampling with $\leq$ 10 NFEs while requiring only 15 minutes of training. Recent lightweight training approaches have shown promise by learning optimal timesteps, but their scope remains restricted to ODE discretizations. To broaden this scope, we propose learning the optimal transformation of the sampling trajectory by parameterizing stochastic interpolant (SI) schedulers. The main challenge lies in designing a parameterization that satisfies critical desiderata, including boundary conditions, differentiability, and monotonicity of the SNR. To effectively meet these requirements, we represent scheduler functions as Bézier functions, where control points naturally enforce these properties. This reduces the problem to learning an ordered set of points in the time range, while the interpretation of the points changes from ODE timesteps to Bézier control points. Across a range of pretrained diffusion and flow models, BézierFlow consistently outperforms prior timestep-learning methods, demonstrating the effectiveness of expanding the search space from discrete timesteps to Bézier-based trajectory transformations.

Title: CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing

Authors: Yan Li, Lin Liu, Xiaopeng Zhang, Wei Xue, Wenhan Luo, Yike Guo, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13276
Pdf URL: https://arxiv.org/pdf/2512.13276
Copy Paste: [[2512.13276]] CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing(https://arxiv.org/abs/2512.13276)
Keywords: diffusion
Abstract: Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods strug- gle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization that propagates gradients across con- secutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation

Title: CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

Authors: Bo Liu, Qiao Qin, Qinghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13285
Pdf URL: https://arxiv.org/pdf/2512.13285
Copy Paste: [[2512.13285]] CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images(https://arxiv.org/abs/2512.13285)
Keywords: generative
Abstract: The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.

Title: LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

Authors: Shu Yu, Chaochao Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13290
Pdf URL: https://arxiv.org/pdf/2512.13290
Copy Paste: [[2512.13290]] LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models(https://arxiv.org/abs/2512.13290)
Keywords: diffusion
Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at this https URL.

Title: ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

Authors: Zhihang Liu, Xiaoyi Bao, Pandeng Li, Junjie Zhou, Zhaohe Liao, Yefei He, Kaixun Jiang, Chen-Wei Xie, Yun Zheng, Hongtao Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13303
Pdf URL: https://arxiv.org/pdf/2512.13303
Copy Paste: [[2512.13303]] ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement(https://arxiv.org/abs/2512.13303)
Keywords: diffusion
Abstract: While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.

Title: ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning

Authors: Mayank Gulati, Benedikt Groß, Gerhard Wunder
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13316
Pdf URL: https://arxiv.org/pdf/2512.13316
Copy Paste: [[2512.13316]] ALIGN-FL: Architecture-independent Learning through Invariant Generative component sharing in Federated Learning(https://arxiv.org/abs/2512.13316)
Keywords: generative
Abstract: We present ALIGN-FL, a novel approach to distributed learning that addresses the challenge of learning from highly disjoint data distributions through selective sharing of generative components. Instead of exchanging full model parameters, our framework enables privacy-preserving learning by transferring only generative capabilities across clients, while the server performs global training using synthetic samples. Through complementary privacy mechanisms: DP-SGD with adaptive clipping and Lipschitz regularized VAE decoders and a stateful architecture supporting heterogeneous clients, we experimentally validate our approach on MNIST and Fashion-MNIST datasets with cross-domain outliers. Our analysis demonstrates that both privacy mechanisms effectively map sensitive outliers to typical data points while maintaining utility in extreme Non-IID scenarios typical of cross-silo collaborations. Index Terms: Client-invariant Learning, Federated Learning (FL), Privacy-preserving Generative Models, Non-Independent and Identically Distributed (Non-IID), Heterogeneous Architectures

Title: FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Authors: Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13330
Pdf URL: https://arxiv.org/pdf/2512.13330
Copy Paste: [[2512.13330]] FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models(https://arxiv.org/abs/2512.13330)
Keywords: generative
Abstract: We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at this https URL. Supplementary resources are released in a separate repository at this https URL.

Title: Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs

Authors: Anran Qi, Changjian Li, Adrien Bousseau, Niloy J.Mitra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13392
Pdf URL: https://arxiv.org/pdf/2512.13392
Copy Paste: [[2512.13392]] Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs(https://arxiv.org/abs/2512.13392)
Keywords: diffusion, generative
Abstract: We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: this https URL

Title: RecTok: Reconstruction Distillation along Rectified Flow

Authors: Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13421
Pdf URL: https://arxiv.org/pdf/2512.13421
Copy Paste: [[2512.13421]] RecTok: Reconstruction Distillation along Rectified Flow(https://arxiv.org/abs/2512.13421)
Keywords: diffusion, foundation model
Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at this https URL.

Title: Test-Time Modification: Inverse Domain Transformation for Robust Perception

Authors: Arpit Jadon, Joshua Niemeijer, Yuki M. Asano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13454
Pdf URL: https://arxiv.org/pdf/2512.13454
Copy Paste: [[2512.13454]] Test-Time Modification: Inverse Domain Transformation for Robust Perception(https://arxiv.org/abs/2512.13454)
Keywords: diffusion, foundation model, generative
Abstract: Generative foundation models contain broad visual knowledge and can produce diverse image variations, making them particularly promising for advancing domain generalization tasks. While they can be used for training data augmentation, synthesizing comprehensive target-domain variations remains slow, expensive, and incomplete. We propose an alternative: using diffusion models at test time to map target images back to the source distribution where the downstream model was trained. This approach requires only a source domain description, preserves the task model, and eliminates large-scale synthetic data generation. We demonstrate consistent improvements across segmentation, detection, and classification tasks under challenging environmental shifts in real-to-real domain generalization scenarios with unknown target distributions. Our analysis spans multiple generative and downstream models, including an ensemble variant for enhanced robustness. The method achieves substantial relative gains: 137% on BDD100K-Night, 68% on ImageNet-R, and 62% on DarkZurich.

Title: On-Device Continual Learning for Unsupervised Visual Anomaly Detection in Dynamic Manufacturing

Authors: Haoyu Ren, Kay Koehle, Kirill Dorofeev, Darko Anicic
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13497
Pdf URL: https://arxiv.org/pdf/2512.13497
Copy Paste: [[2512.13497]] On-Device Continual Learning for Unsupervised Visual Anomaly Detection in Dynamic Manufacturing(https://arxiv.org/abs/2512.13497)
Keywords: anomaly
Abstract: In modern manufacturing, Visual Anomaly Detection (VAD) is essential for automated inspection and consistent product quality. Yet, increasingly dynamic and flexible production environments introduce key challenges: First, frequent product changes in small-batch and on-demand manufacturing require rapid model updates. Second, legacy edge hardware lacks the resources to train and run large AI models. Finally, both anomalous and normal training data are often scarce, particularly for newly introduced product variations. We investigate on-device continual learning for unsupervised VAD with localization, extending the PatchCore to incorporate online learning for real-world industrial scenarios. The proposed method leverages a lightweight feature extractor and an incremental coreset update mechanism based on k-center selection, enabling rapid, memory-efficient adaptation from limited data while eliminating costly cloud retraining. Evaluations on an industrial use case are conducted using a testbed designed to emulate flexible production with frequent variant changes in a controlled environment. Our method achieves a 12% AUROC improvement over the baseline, an 80% reduction in memory usage, and faster training compared to batch retraining. These results confirm that our method delivers accurate, resource-efficient, and adaptive VAD suitable for dynamic and smart manufacturing.

Title: Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Authors: Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Zeyu Sun, Wenjing Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13507
Pdf URL: https://arxiv.org/pdf/2512.13507
Copy Paste: [[2512.13507]] Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model(https://arxiv.org/abs/2512.13507)
Keywords: diffusion, foundation model
Abstract: Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.

Title: Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains

Authors: Marianne Rakic, Siyu Gai, Etienne Chollet, John V. Guttag, Adrian V. Dalca
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13534
Pdf URL: https://arxiv.org/pdf/2512.13534
Copy Paste: [[2512.13534]] Pancakes: Consistent Multi-Protocol Image Segmentation Across Biomedical Domains(https://arxiv.org/abs/2512.13534)
Keywords: foundation model
Abstract: A single biomedical image can be meaningfully segmented in multiple ways, depending on the desired application. For instance, a brain MRI can be segmented according to tissue types, vascular territories, broad anatomical regions, fine-grained anatomy, or pathology, etc. Existing automatic segmentation models typically either (1) support only a single protocol, the one they were trained on, or (2) require labor-intensive manual prompting to specify the desired segmentation. We introduce Pancakes, a framework that, given a new image from a previously unseen domain, automatically generates multi-label segmentation maps for multiple plausible protocols, while maintaining semantic consistency across related images. Pancakes introduces a new problem formulation that is not currently attainable by existing foundation models. In a series of experiments on seven held-out datasets, we demonstrate that our model can significantly outperform existing foundation models in producing several plausible whole-image segmentations, that are semantically coherent across images.

Title: PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

Authors: Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13552
Pdf URL: https://arxiv.org/pdf/2512.13552
Copy Paste: [[2512.13552]] PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation(https://arxiv.org/abs/2512.13552)
Keywords: generative
Abstract: This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

Title: 3D Human-Human Interaction Anomaly Detection

Authors: Shun Maeda, Chunzhi Gu, Koichiro Kamide, Katsuya Hotta, Shangce Gao, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13560
Pdf URL: https://arxiv.org/pdf/2512.13560
Copy Paste: [[2512.13560]] 3D Human-Human Interaction Anomaly Detection(https://arxiv.org/abs/2512.13560)
Keywords: anomaly
Abstract: Human-centric anomaly detection (AD) has been primarily studied to specify anomalous behaviors in a single person. However, as humans by nature tend to act in a collaborative manner, behavioral anomalies can also arise from human-human interactions. Detecting such anomalies using existing single-person AD models is prone to low accuracy, as these approaches are typically not designed to capture the complex and asymmetric dynamics of interactions. In this paper, we introduce a novel task, Human-Human Interaction Anomaly Detection (H2IAD), which aims to identify anomalous interactive behaviors within collaborative 3D human actions. To address H2IAD, we then propose Interaction Anomaly Detection Network (IADNet), which is formalized with a Temporal Attention Sharing Module (TASM). Specifically, in designing TASM, we share the encoded motion embeddings across both people such that collaborative motion correlations can be effectively synchronized. Moreover, we notice that in addition to temporal dynamics, human interactions are also characterized by spatial configurations between two people. We thus introduce a Distance-Based Relational Encoding Module (DREM) to better reflect social cues in H2IAD. The normalizing flow is eventually employed for anomaly scoring. Extensive experiments on human-human motion benchmarks demonstrate that IADNet outperforms existing Human-centric AD baselines in H2IAD.

Title: Memory in the Age of AI Agents

Authors: Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13564
Pdf URL: https://arxiv.org/pdf/2512.13564
Copy Paste: [[2512.13564]] Memory in the Age of AI Agents(https://arxiv.org/abs/2512.13564)
Keywords: foundation model
Abstract: Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.

Title: ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13586
Pdf URL: https://arxiv.org/pdf/2512.13586
Copy Paste: [[2512.13586]] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding(https://arxiv.org/abs/2512.13586)
Keywords: diffusion
Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

Title: Image Diffusion Preview with Consistency Solver

Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13592
Pdf URL: https://arxiv.org/pdf/2512.13592
Copy Paste: [[2512.13592]] Image Diffusion Preview with Consistency Solver(https://arxiv.org/abs/2512.13592)
Keywords: diffusion
Abstract: The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.

Title: Lighting in Motion: Spatiotemporal HDR Lighting Estimation

Authors: Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13597
Pdf URL: https://arxiv.org/pdf/2512.13597
Copy Paste: [[2512.13597]] Lighting in Motion: Spatiotemporal HDR Lighting Estimation(https://arxiv.org/abs/2512.13597)
Keywords: diffusion
Abstract: We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

Title: DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides

Authors: Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen, Lucas Liu, Roxanne Wadia, Peter A Humphrey, Fady Ghali, Alberto Contreras-Sanz, Peter Black, Jonathan Wright, Stephanie Harmon, Michael Haffner
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13600
Pdf URL: https://arxiv.org/pdf/2512.13600
Copy Paste: [[2512.13600]] DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides(https://arxiv.org/abs/2512.13600)
Keywords: self-supervised
Abstract: Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at this https URL.

Title: DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis

Authors: Felix J. Dorfner, Manon A. Dorster, Ryan Connolly, Oscar Gentilhomme, Edward Gibbs, Steven Graham, Seth Wander, Thomas Schultz, Manisha Bahl, Dania Daye, Albert E. Kim, Christopher P. Bridge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13608
Pdf URL: https://arxiv.org/pdf/2512.13608
Copy Paste: [[2512.13608]] DBT-DINO: Towards Foundation model based analysis of Digital Breast Tomosynthesis(https://arxiv.org/abs/2512.13608)
Keywords: self-supervised, foundation model
Abstract: Foundation models have shown promise in medical imaging but remain underexplored for three-dimensional imaging modalities. No foundation model currently exists for Digital Breast Tomosynthesis (DBT), despite its use for breast cancer screening. To develop and evaluate a foundation model for DBT (DBT-DINO) across multiple clinical tasks and assess the impact of domain-specific pre-training. Self-supervised pre-training was performed using the DINOv2 methodology on over 25 million 2D slices from 487,975 DBT volumes from 27,990 patients. Three downstream tasks were evaluated: (1) breast density classification using 5,000 screening exams; (2) 5-year risk of developing breast cancer using 106,417 screening exams; and (3) lesion detection using 393 annotated volumes. For breast density classification, DBT-DINO achieved an accuracy of 0.79 (95\% CI: 0.76--0.81), outperforming both the MetaAI DINOv2 baseline (0.73, 95\% CI: 0.70--0.76, p<.001) and DenseNet-121 (0.74, 95\% CI: 0.71--0.76, p<.001). For 5-year breast cancer risk prediction, DBT-DINO achieved an AUROC of 0.78 (95\% CI: 0.76--0.80) compared to DINOv2's 0.76 (95\% CI: 0.74--0.78, p=.57). For lesion detection, DINOv2 achieved a higher average sensitivity of 0.67 (95\% CI: 0.60--0.74) compared to DBT-DINO with 0.62 (95\% CI: 0.53--0.71, p=.60). DBT-DINO demonstrated better performance on cancerous lesions specifically with a detection rate of 78.8\% compared to Dinov2's 77.3\%. Using a dataset of unprecedented size, we developed DBT-DINO, the first foundation model for DBT. DBT-DINO demonstrated strong performance on breast density classification and cancer risk prediction. However, domain-specific pre-training showed variable benefits on the detection task, with ImageNet baseline outperforming DBT-DINO on general lesion detection, indicating that localized detection tasks require further methodological development.

Title: Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13609
Pdf URL: https://arxiv.org/pdf/2512.13609
Copy Paste: [[2512.13609]] Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models(https://arxiv.org/abs/2512.13609)
Keywords: generative
Abstract: We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

Title: SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

Authors: Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Siqi Lu, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yalin Zheng, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13635
Pdf URL: https://arxiv.org/pdf/2512.13635
Copy Paste: [[2512.13635]] SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning(https://arxiv.org/abs/2512.13635)
Keywords: foundation model
Abstract: Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL

Title: Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency

Authors: Wenhan Chen, Sezer Karaoglu, Theo Gevers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13665
Pdf URL: https://arxiv.org/pdf/2512.13665
Copy Paste: [[2512.13665]] Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency(https://arxiv.org/abs/2512.13665)
Keywords: diffusion
Abstract: Recent advances in diffusion-based generation techniques enable AI models to produce highly realistic videos, heightening the need for reliable detection mechanisms. However, existing detection methods provide only limited exploration of the 3D geometric patterns present in generated videos. In this paper, we use vanishing points as an explicit representation of 3D geometry patterns, revealing fundamental discrepancies in geometric consistency between real and AI-generated videos. We introduce Grab-3D, a geometry-aware transformer framework for detecting AI-generated videos based on 3D geometric temporal consistency. To enable reliable evaluation, we construct an AI-generated video dataset of static scenes, allowing stable 3D geometric feature extraction. We propose a geometry-aware transformer equipped with geometric positional encoding, temporal-geometric attention, and an EMA-based geometric classifier head to explicitly inject 3D geometric awareness into temporal modeling. Experiments demonstrate that Grab-3D significantly outperforms state-of-the-art detectors, achieving robust cross-domain generalization to unseen generators.

Title: AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

Authors: Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13671
Pdf URL: https://arxiv.org/pdf/2512.13671
Copy Paste: [[2512.13671]] AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection(https://arxiv.org/abs/2512.13671)
Keywords: anomaly
Abstract: Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

Title: Feedforward 3D Editing via Text-Steerable Image-to-3D

Authors: Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13678
Pdf URL: https://arxiv.org/pdf/2512.13678
Copy Paste: [[2512.13678]] Feedforward 3D Editing via Text-Steerable Image-to-3D(https://arxiv.org/abs/2512.13678)
Keywords: generative
Abstract: Recent progress in image-to-3D has opened up immense possibilities for design, AR/VR, and robotics. However, to use AI-generated 3D assets in real applications, a critical requirement is the capability to edit them easily. We present a feedforward method, Steer3D, to add text steerability to image-to-3D models, which enables editing of generated 3D assets with language. Our approach is inspired by ControlNet, which we adapt to image-to-3D generation to enable text steering directly in a forward pass. We build a scalable data engine for automatic data generation, and develop a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO). Compared to competing methods, Steer3D more faithfully follows the language instruction and maintains better consistency with the original 3D asset, while being 2.4x to 28.5x faster. Steer3D demonstrates that it is possible to add a new modality (text) to steer the generation of pretrained image-to-3D generative models with 100k data. Project website: this https URL

Title: I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13683
Pdf URL: https://arxiv.org/pdf/2512.13683
Copy Paste: [[2512.13683]] I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners(https://arxiv.org/abs/2512.13683)
Keywords: foundation model
Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: this https URL

Title: Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech

Authors: Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Lilian Hubner, Bárbara Malcorra, César Rennó-Costa, Marco Idiart, Maria-Cruz Villa-Uriol, Aline Villavicencio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.13685
Pdf URL: https://arxiv.org/pdf/2512.13685
Copy Paste: [[2512.13685]] Beyond surface form: A pipeline for semantic analysis in Alzheimer's Disease detection from spontaneous speech(https://arxiv.org/abs/2512.13685)
Keywords: generative
Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative condition that adversely affects cognitive abilities. Language-related changes can be automatically identified through the analysis of outputs from linguistic assessment tasks, such as picture description. Language models show promise as a basis for screening tools for AD, but their limited interpretability poses a challenge in distinguishing true linguistic markers of cognitive decline from surface-level textual patterns. To address this issue, we examine how surface form variation affects classification performance, with the goal of assessing the ability of language models to represent underlying semantic indicators. We introduce a novel approach where texts surface forms are transformed by altering syntax and vocabulary while preserving semantic content. The transformations significantly modify the structure and lexical content, as indicated by low BLEU and chrF scores, yet retain the underlying semantics, as reflected in high semantic similarity scores, isolating the effect of semantic information, and finding models perform similarly to if they were using the original text, with only small deviations in macro-F1. We also investigate whether language from picture descriptions retains enough detail to reconstruct the original image using generative models. We found that image-based transformations add substantial noise reducing classification accuracy. Our methodology provides a novel way of looking at what features influence model predictions, and allows the removal of possible spurious correlations. We find that just using semantic information, language model based classifiers can still detect AD. This work shows that difficult to detect semantic impairment can be identified, addressing an overlooked feature of linguistic deterioration, and opening new pathways for early detection systems.

Title: Towards Scalable Pre-training of Visual Tokenizers for Generation

Authors: Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13687
Pdf URL: https://arxiv.org/pdf/2512.13687
Copy Paste: [[2512.13687]] Towards Scalable Pre-training of Visual Tokenizers for Generation(https://arxiv.org/abs/2512.13687)
Keywords: self-supervised, generative
Abstract: The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at this https URL.

Title: DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

Authors: Susung Hong, Chongjian Ge, Zhifei Zhang, Jui-Hsien Wang
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.13690
Pdf URL: https://arxiv.org/pdf/2512.13690
Copy Paste: [[2512.13690]] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders(https://arxiv.org/abs/2512.13690)
Keywords: diffusion, generative
Abstract: Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.