2025-12-25

Title: Dominating vs. Dominated: Generative Collapse in Diffusion Models

Authors: Hayeon Jeong, Jong-Seok Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20666
Pdf URL: https://arxiv.org/pdf/2512.20666
Copy Paste: [[2512.20666]] Dominating vs. Dominated: Generative Collapse in Diffusion Models(https://arxiv.org/abs/2512.20666)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models have drawn significant attention for their ability to generate diverse and high-fidelity images. However, when generating from multi-concept prompts, one concept token often dominates the generation, suppressing the others-a phenomenon we term the Dominant-vs-Dominated (DvD) imbalance. To systematically analyze this imbalance, we introduce DominanceBench and examine its causes from both data and architectural perspectives. Through various experiments, we show that the limited instance diversity in training data exacerbates the inter-concept interference. Analysis of cross-attention dynamics further reveals that dominant tokens rapidly saturate attention, progressively suppressing others across diffusion timesteps. In addition, head ablation studies show that the DvD behavior arises from distributed attention mechanisms across multiple heads. Our findings provide key insights into generative collapse, advancing toward more reliable and controllable text-to-image generation.

Title: SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

Authors: Alexandros Christoforos, Chadbourne Davis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20724
Pdf URL: https://arxiv.org/pdf/2512.20724
Copy Paste: [[2512.20724]] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention(https://arxiv.org/abs/2512.20724)
Keywords: diffusion
Abstract: Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.

Title: VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Authors: Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20735
Pdf URL: https://arxiv.org/pdf/2512.20735
Copy Paste: [[2512.20735]] VL4Gaze: Unleashing Vision-Language Models for Gaze Following(https://arxiv.org/abs/2512.20735)
Keywords: in-context
Abstract: Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.

Title: TS-Arena Technical Report -- A Pre-registered Live Forecasting Platform

Authors: Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Henrik Albers, Oliver Müller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20761
Pdf URL: https://arxiv.org/pdf/2512.20761
Copy Paste: [[2512.20761]] TS-Arena Technical Report -- A Pre-registered Live Forecasting Platform(https://arxiv.org/abs/2512.20761)
Keywords: foundation model
Abstract: While Time Series Foundation Models (TSFMs) offer transformative capabilities for forecasting, they simultaneously risk triggering a fundamental evaluation crisis. This crisis is driven by information leakage due to overlapping training and test sets across different models, as well as the illegitimate transfer of global patterns to test data. While the ability to learn shared temporal dynamics represents a primary strength of these models, their evaluation on historical archives often permits the exploitation of observed global shocks, which violates the independence required for valid benchmarking. We introduce TS-Arena, a platform that restores the operational integrity of forecasting by treating the genuinely unknown future as the definitive test environment. By implementing a pre-registration mechanism on live data streams, the platform ensures that evaluation targets remain physically non-existent during inference, thereby enforcing a strict global temporal split. This methodology establishes a moving temporal frontier that prevents historical contamination and provides an authentic assessment of model generalization. Initially applied within the energy sector, TS-Arena provides a sustainable infrastructure for comparing foundation models under real-world constraints. A prototype of the platform is available at this https URL.

Title: Improving Matrix Exponential for Generative AI Flows: A Taylor-Based Approach Beyond Paterson--Stockmeyer

Authors: Jorge Sastre, Daniel Faronbi, José Miguel Alonso, Peter Traver, Javier Ibáñez, Nuria Lloret
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2512.20777
Pdf URL: https://arxiv.org/pdf/2512.20777
Copy Paste: [[2512.20777]] Improving Matrix Exponential for Generative AI Flows: A Taylor-Based Approach Beyond Paterson--Stockmeyer(https://arxiv.org/abs/2512.20777)
Keywords: generative
Abstract: The matrix exponential is a fundamental operator in scientific computing and system simulation, with applications ranging from control theory and quantum mechanics to modern generative machine learning. While Padé approximants combined with scaling and squaring have long served as the standard, recent Taylor-based methods, which utilize polynomial evaluation schemes that surpass the classical Paterson--Stockmeyer technique, offer superior accuracy and reduced computational complexity. This paper presents an optimized Taylor-based algorithm for the matrix exponential, specifically designed for the high-throughput requirements of generative AI flows. We provide a rigorous error analysis and develop a dynamic selection strategy for the Taylor order and scaling factor to minimize computational effort under a prescribed error tolerance. Extensive numerical experiments demonstrate that our approach provides significant acceleration and maintains high numerical stability compared to existing state-of-the-art implementations. These results establish the proposed method as a highly efficient tool for large-scale generative modeling.

Title: Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

Authors: Pierre Abillama, Changwoo Lee, Juechu Dong, David Blaauw, Dennis Sylvester, Hun-Seok Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20861
Pdf URL: https://arxiv.org/pdf/2512.20861
Copy Paste: [[2512.20861]] Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs(https://arxiv.org/abs/2512.20861)
Keywords: foundation model
Abstract: Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) methods often incur sharp accuracy drops, BLR approaches such as Monarch and BLAST can better capture the underlying structure, thus preserving accuracy while reducing computations and memory footprints. In this work, we use roofline analysis to show that, although BLR methods achieve theoretical savings and practical speedups for single-token inference, multi-token inference often becomes memory-bound in practice, increasing latency despite compiler-level optimizations in PyTorch. To address this, we introduce custom Triton kernels with partial fusion and memory layout optimizations for both Monarch and BLAST. On memory-constrained NVIDIA GPUs such as Jetson Orin Nano and A40, our kernels deliver up to $3.76\times$ speedups and $3\times$ model size compression over PyTorch dense baselines using CUDA backend and compiler-level optimizations, while supporting various models including Llama-7/1B, GPT2-S, DiT-XL/2, and ViT-B. Our code is available at this https URL .

Title: Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification

Authors: Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Zhelin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20892
Pdf URL: https://arxiv.org/pdf/2512.20892
Copy Paste: [[2512.20892]] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification(https://arxiv.org/abs/2512.20892)
Keywords: foundation model
Abstract: Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at this https URL.

Title: DiEC: Diffusion Embedded Clustering

Authors: Haidong Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.20905
Pdf URL: https://arxiv.org/pdf/2512.20905
Copy Paste: [[2512.20905]] DiEC: Diffusion Embedded Clustering(https://arxiv.org/abs/2512.20905)
Keywords: diffusion, generative
Abstract: Deep clustering hinges on learning representations that are inherently clusterable. However, using a single encoder to produce a fixed embedding ignores the representation trajectory formed by a pretrained diffusion model across network hierarchies and noise timesteps, where clusterability varies substantially. We propose DiEC (Diffusion Embedded Clustering), which performs unsupervised clustering by directly reading internal activations from a pretrained diffusion U-Net. DiEC formulates representation selection as a two-dimensional search over layer x timestep, and exploits a weak-coupling property to decompose it into two stages. Specifically, we first fix the U-Net bottleneck layer as the Clustering-friendly Middle Layer (CML), and then use Optimal Timestep Search (OTS) to identify the clustering-optimal timestep (t*). During training, we extract bottleneck features at the fixed t* and obtain clustering representations via a lightweight residual mapping. We optimize a DEC-style KL self-training objective, augmented with adaptive graph regularization and entropy regularization to strengthen cluster structures. In parallel, we introduce a denoising-consistency branch at random timesteps to stabilize the representations and preserve generative consistency. Experiments show that DiEC achieves competitive clustering performance on multiple standard benchmarks.

Title: Self-supervised Multiplex Consensus Mamba for General Image Fusion

Authors: Yingying Wang, Rongjin Zhuang, Hui Zheng, Xuanhua He, Ke Cao, Xiaotong Tu, Xinghao Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20921
Pdf URL: https://arxiv.org/pdf/2512.20921
Copy Paste: [[2512.20921]] Self-supervised Multiplex Consensus Mamba for General Image Fusion(https://arxiv.org/abs/2512.20921)
Keywords: self-supervised
Abstract: Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

Title: Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

Authors: Ruiqi Liu, Yi Han, Zhengbo Zhang, Liwei Yao, Zhiyuan Yan, Jialiang Shen, ZhiJin Chen, Boyi Sun, Lubin Weng, Jing Dong, Yan Wang, Shu Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20937
Pdf URL: https://arxiv.org/pdf/2512.20937
Copy Paste: [[2512.20937]] Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection(https://arxiv.org/abs/2512.20937)
Keywords: generative
Abstract: The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.

Title: Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study

Authors: Zhongren Dong, Haotian Guo, Weixiang Xu, Huan Zhao, Zixing Zhang
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2512.20948
Pdf URL: https://arxiv.org/pdf/2512.20948
Copy Paste: [[2512.20948]] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study(https://arxiv.org/abs/2512.20948)
Keywords: foundation model
Abstract: Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.

Title: Generalization of Diffusion Models Arises with a Balanced Representation Space

Authors: Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.20963
Pdf URL: https://arxiv.org/pdf/2512.20963
Copy Paste: [[2512.20963]] Generalization of Diffusion Models Arises with a Balanced Representation Space(https://arxiv.org/abs/2512.20963)
Keywords: diffusion, generative
Abstract: Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized "spiky" representations, whereas (ii) generalization arises when the model captures local data statistics, producing "balanced" representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.

Title: Neutralization of IMU-Based GPS Spoofing Detection using external IMU sensor and feedback methodology

Authors: Ji Hyuk Jung, Ji Won Yoon
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.20964
Pdf URL: https://arxiv.org/pdf/2512.20964
Copy Paste: [[2512.20964]] Neutralization of IMU-Based GPS Spoofing Detection using external IMU sensor and feedback methodology(https://arxiv.org/abs/2512.20964)
Keywords: anomaly
Abstract: Autonomous Vehicles (AVs) refer to systems capable of perceiving their states and moving without human intervention. Among the factors required for autonomous decision-making in mobility, positional awareness of the vehicle itself is the most critical. Accordingly, extensive research has been conducted on defense mechanisms against GPS spoofing attacks, which threaten AVs by disrupting position recognition. Among these, detection methods based on internal IMU sensors are regarded as some of the most effective. In this paper, we propose a spoofing attack system designed to neutralize IMU sensor-based detection. First, we present an attack modeling approach for bypassing such detection. Then, based on EKF sensor fusion, we experimentally analyze both the impact of GPS spoofing values on the internal target system and how our proposed methodology reduces anomaly detection within the target system. To this end, this paper proposes an attack model that performs GPS spoofing by stealing internal dynamic state information using an external IMU sensor, and the experimental results demonstrate that attack values can be injected without being detected.

Title: X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data

Authors: Xinquan Yang, Jinheng Xie, Yawen Huang, Yuexiang Li, Huimin Huang, Hao Zheng, Xian Wu, Yefeng Zheng, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20980
Pdf URL: https://arxiv.org/pdf/2512.20980
Copy Paste: [[2512.20980]] X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data(https://arxiv.org/abs/2512.20980)
Keywords: diffusion, generative
Abstract: Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.

Title: PUFM++: Point Cloud Upsampling via Enhanced Flow Matching

Authors: Zhi-Song Liu, Chenhang He, Roland Maier, Andreas Rupp
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.20988
Pdf URL: https://arxiv.org/pdf/2512.20988
Copy Paste: [[2512.20988]] PUFM++: Point Cloud Upsampling via Enhanced Flow Matching(https://arxiv.org/abs/2512.20988)
Keywords: generative
Abstract: Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at this https URL.

Title: Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Authors: Jinghan Li, Yang Jin, Hao Jiang, Yadong Mu, Yang Song, Kun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21004
Pdf URL: https://arxiv.org/pdf/2512.21004
Copy Paste: [[2512.21004]] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations(https://arxiv.org/abs/2512.21004)
Keywords: foundation model, generative
Abstract: Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

Title: FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Authors: Mingshu Cai, Yixuan Li, Osamu Yoshie, Yuya Ieiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21015
Pdf URL: https://arxiv.org/pdf/2512.21015
Copy Paste: [[2512.21015]] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing(https://arxiv.org/abs/2512.21015)
Keywords: diffusion, generative
Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

Title: Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

Authors: Mingshu Cai, Osamu Yoshie, Yuya Ieiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21032
Pdf URL: https://arxiv.org/pdf/2512.21032
Copy Paste: [[2512.21032]] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model(https://arxiv.org/abs/2512.21032)
Keywords: diffusion, generative
Abstract: Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

Title: Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

Authors: Yiwen Shan, Haiyu Zhao, Peng Hu, Xi Peng, Yuanbiao Gou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21038
Pdf URL: https://arxiv.org/pdf/2512.21038
Copy Paste: [[2512.21038]] Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising(https://arxiv.org/abs/2512.21038)
Keywords: self-supervised
Abstract: Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.

Title: DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

Authors: Kaustubh Kundu, Hrishav Bakul Barua, Lucy Robertson-Bell, Zhixi Cai, Kalin Stefanov
Subjects: cs.CV, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21054
Pdf URL: https://arxiv.org/pdf/2512.21054
Copy Paste: [[2512.21054]] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors(https://arxiv.org/abs/2512.21054)
Keywords: generative
Abstract: The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: this https URL.

Title: Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Authors: Minghao Han, YiChen Liu, Yizhou Liu, Zizhi Chen, Jingqun Tang, Xuecheng Wu, Dingkang Yang, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21058
Pdf URL: https://arxiv.org/pdf/2512.21058
Copy Paste: [[2512.21058]] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control(https://arxiv.org/abs/2512.21058)
Keywords: generative
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.

Title: Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Authors: Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21064
Pdf URL: https://arxiv.org/pdf/2512.21064
Copy Paste: [[2512.21064]] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition(https://arxiv.org/abs/2512.21064)
Keywords: self-supervised
Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.

Title: FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting

Authors: Chao Gong, Dong Li, Yingwei Pan, Jingjing Chen, Ting Yao, Tao Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21104
Pdf URL: https://arxiv.org/pdf/2512.21104
Copy Paste: [[2512.21104]] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting(https://arxiv.org/abs/2512.21104)
Keywords: diffusion
Abstract: Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

Title: STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting

Authors: Shi Quan Foo, Chi-Ho Wong, Zhihan Gao, Dit-Yan Yeung, Ka-Hing Wong, Wai-Kin Wong
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.21118
Pdf URL: https://arxiv.org/pdf/2512.21118
Copy Paste: [[2512.21118]] STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting(https://arxiv.org/abs/2512.21118)
Keywords: diffusion, generative
Abstract: Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the complex and stochastic nature of this task still poses challenges to existing approaches. Specifically, deterministic models tend to produce blurry predictions while generative models often struggle with poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, a diffusion-based model that learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. STLDM decomposes this task into two stages: a deterministic forecasting stage handled by the conditioning network, and an enhancement stage performed by the latent diffusion model. Experimental results on multiple radar datasets demonstrate that STLDM achieves superior performance compared to the state of the art, while also improving inference efficiency. The code is available in this https URL.

Title: A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation

Authors: Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Cheng Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21174
Pdf URL: https://arxiv.org/pdf/2512.21174
Copy Paste: [[2512.21174]] A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation(https://arxiv.org/abs/2512.21174)
Keywords: generative
Abstract: Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain's distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.

Title: UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Authors: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.21185
Pdf URL: https://arxiv.org/pdf/2512.21185
Copy Paste: [[2512.21185]] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement(https://arxiv.org/abs/2512.21185)
Keywords: diffusion
Abstract: In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

Title: SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

Authors: Mahi Luthra, Jiayi Shen, Maxime Poli, Angelo Ortiz, Yosuke Higuchi, Youssef Benchekroun, Martin Gleize, Charles-Eric Saint-James, Dongyan Lin, Phillip Rust, Angel Villar, Surya Parimi, Vanessa Stark, Rashel Moritz, Juan Pino, Yann LeCun, Emmanuel Dupoux
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.21204
Pdf URL: https://arxiv.org/pdf/2512.21204
Copy Paste: [[2512.21204]] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation(https://arxiv.org/abs/2512.21204)
Keywords: self-supervised
Abstract: Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at this https URL.

Title: AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI

Authors: Changwei Wu, Yifei Chen, Yuxin Du, Mingxuan Liu, Jinying Zong, Beining Wu, Jie Dong, Feiwei Qin, Yunkang Cao, Qiyuan Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21264
Pdf URL: https://arxiv.org/pdf/2512.21264
Copy Paste: [[2512.21264]] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI(https://arxiv.org/abs/2512.21264)
Keywords: anomaly
Abstract: Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at this https URL.

Title: ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21268
Pdf URL: https://arxiv.org/pdf/2512.21268
Copy Paste: [[2512.21268]] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision(https://arxiv.org/abs/2512.21268)
Keywords: diffusion
Abstract: Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

Title: GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21276
Pdf URL: https://arxiv.org/pdf/2512.21276
Copy Paste: [[2512.21276]] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation(https://arxiv.org/abs/2512.21276)
Keywords: diffusion, generative
Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

Title: Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21284
Pdf URL: https://arxiv.org/pdf/2512.21284
Copy Paste: [[2512.21284]] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential(https://arxiv.org/abs/2512.21284)
Keywords: foundation model
Abstract: Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

Title: Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering

Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H.M. Abou-El-Enien
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.21301
Pdf URL: https://arxiv.org/pdf/2512.21301
Copy Paste: [[2512.21301]] Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering(https://arxiv.org/abs/2512.21301)
Keywords: generative
Abstract: Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond

Title: TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Authors: Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis Maria Vakalopoulou, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21331
Pdf URL: https://arxiv.org/pdf/2512.21331
Copy Paste: [[2512.21331]] TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning(https://arxiv.org/abs/2512.21331)
Keywords: foundation model
Abstract: The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

Title: Fast SAM2 with Text-Driven Token Pruning

Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21333
Pdf URL: https://arxiv.org/pdf/2512.21333
Copy Paste: [[2512.21333]] Fast SAM2 with Text-Driven Token Pruning(https://arxiv.org/abs/2512.21333)
Keywords: foundation model
Abstract: Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

Title: Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.21336
Pdf URL: https://arxiv.org/pdf/2512.21336
Copy Paste: [[2512.21336]] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty(https://arxiv.org/abs/2512.21336)
Keywords: diffusion, generative
Abstract: Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

Title: HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Authors: Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.21338
Pdf URL: https://arxiv.org/pdf/2512.21338
Copy Paste: [[2512.21338]] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming(https://arxiv.org/abs/2512.21338)
Keywords: diffusion
Abstract: High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.