2025-12-15

Title: MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax

Authors: Zhanpeng Chen, Weihao Gao, Shunyu Wang, Yanan Zhu, Hong Meng, Yuexian Zou
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.10991
Pdf URL: https://arxiv.org/pdf/2512.10991
Copy Paste: [[2512.10991]] MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax(https://arxiv.org/abs/2512.10991)
Keywords: diffusion, foundation model
Abstract: Generating precise 3D molecular geometries is crucial for drug discovery and material science. While prior efforts leverage 1D representations like SELFIES to ensure molecular validity, they fail to fully exploit the rich chemical knowledge entangled within 1D models, leading to a disconnect between 1D syntactic generation and 3D geometric realization. To bridge this gap, we propose MolSculpt, a novel framework that "sculpts" 3D molecular geometries from chemical syntax. MolSculpt is built upon a frozen 1D molecular foundation model and a 3D molecular diffusion model. We introduce a set of learnable queries to extract inherent chemical knowledge from the foundation model, and a trainable projector then injects this cross-modal information into the conditioning space of the diffusion model to guide the 3D geometry generation. In this way, our model deeply integrates 1D latent chemical knowledge into the 3D generation process through end-to-end optimization. Experiments demonstrate that MolSculpt achieves state-of-the-art (SOTA) performance in \textit{de novo} 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on both the GEOM-DRUGS and QM9 datasets. Code is available at this https URL.

Title: PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

Authors: Pawel Batorski, Paul Swoboda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11013
Pdf URL: https://arxiv.org/pdf/2512.11013
Copy Paste: [[2512.11013]] PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data(https://arxiv.org/abs/2512.11013)
Keywords: in-context
Abstract: LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at this https URL.

Title: SoccerMaster: A Vision Foundation Model for Soccer Understanding

Authors: Haolin Yang, Jiayuan Rao, Haoning Wu, Weidi Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11016
Pdf URL: https://arxiv.org/pdf/2512.11016
Copy Paste: [[2512.11016]] SoccerMaster: A Vision Foundation Model for Soccer Understanding(https://arxiv.org/abs/2512.11016)
Keywords: foundation model
Abstract: Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

Title: VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Authors: Felix O'Mahony, Roberto Cipolla, Ayush Tewari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11061
Pdf URL: https://arxiv.org/pdf/2512.11061
Copy Paste: [[2512.11061]] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation(https://arxiv.org/abs/2512.11061)
Keywords: generative
Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

Title: Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description

Authors: Nazanin Mahjourian, Vinh Nguyen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11098
Pdf URL: https://arxiv.org/pdf/2512.11098
Copy Paste: [[2512.11098]] Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description(https://arxiv.org/abs/2512.11098)
Keywords: foundation model
Abstract: Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.

Title: Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization

Authors: Brennan Flannery, Thomas DeSilvio, Jane Nguyen, Satish E. Viswanath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11104
Pdf URL: https://arxiv.org/pdf/2512.11104
Copy Paste: [[2512.11104]] Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization(https://arxiv.org/abs/2512.11104)
Keywords: foundation model
Abstract: Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic H&E whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.

Title: In-Context Multi-Objective Optimization

Authors: Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11114
Pdf URL: https://arxiv.org/pdf/2512.11114
Copy Paste: [[2512.11114]] In-Context Multi-Objective Optimization(https://arxiv.org/abs/2512.11114)
Keywords: in-context
Abstract: Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

Title: Learning from a Generative Oracle: Domain Adaptation for Restoration

Authors: Yuyang Hu, Mojtaba Sahraee-Ardakan, Arpit Bansal, Kangfu Mei, Christian Qi, Peyman Milanfar, Mauricio Delbracio
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11121
Pdf URL: https://arxiv.org/pdf/2512.11121
Copy Paste: [[2512.11121]] Learning from a Generative Oracle: Domain Adaptation for Restoration(https://arxiv.org/abs/2512.11121)
Keywords: generative
Abstract: Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.

Title: Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11130
Pdf URL: https://arxiv.org/pdf/2512.11130
Copy Paste: [[2512.11130]] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching(https://arxiv.org/abs/2512.11130)
Keywords: foundation model
Abstract: Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: this https URL

Title: Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models

Authors: Divya Kothandaraman, Jaclyn Pytlarz
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.11194
Pdf URL: https://arxiv.org/pdf/2512.11194
Copy Paste: [[2512.11194]] Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models(https://arxiv.org/abs/2512.11194)
Keywords: diffusion, generative
Abstract: Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature's embedding space, thereby zeroing out its influence on the model's weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.

Title: CADKnitter: Compositional CAD Generation from Text and Geometry Guidance

Authors: Tri Le, Khang Nguyen, Baoru Huang, Tung D. Ta, Anh Nguyen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11199
Pdf URL: https://arxiv.org/pdf/2512.11199
Copy Paste: [[2512.11199]] CADKnitter: Compositional CAD Generation from Text and Geometry Guidance(https://arxiv.org/abs/2512.11199)
Keywords: diffusion
Abstract: Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.

Title: AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Authors: Zhengyang Yu, Akio Hayakawa, Masato Ishii, Qingtao Yu, Takashi Shibuya, Jing Zhang, Yuki Mitsufuji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11203
Pdf URL: https://arxiv.org/pdf/2512.11203
Copy Paste: [[2512.11203]] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path(https://arxiv.org/abs/2512.11203)
Keywords: diffusion
Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

Title: VFMF: World Modeling by Forecasting Vision Foundation Model Features

Authors: Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11225
Pdf URL: https://arxiv.org/pdf/2512.11225
Copy Paste: [[2512.11225]] VFMF: World Modeling by Forecasting Vision Foundation Model Features(https://arxiv.org/abs/2512.11225)
Keywords: diffusion, foundation model, generative
Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

Title: REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Authors: Haotian Wang, Yuzhe Weng, Xinyi Yu, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2512.11229
Pdf URL: https://arxiv.org/pdf/2512.11229
Copy Paste: [[2512.11229]] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation(https://arxiv.org/abs/2512.11229)
Keywords: diffusion
Abstract: Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

Title: WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

Authors: Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.11237
Pdf URL: https://arxiv.org/pdf/2512.11237
Copy Paste: [[2512.11237]] WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering(https://arxiv.org/abs/2512.11237)
Keywords: diffusion
Abstract: Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{this https URL}{\textcolor{magenta}{here}}.

Title: PersonaLive! Expressive Portrait Image Animation for Live Streaming

Authors: Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11253
Pdf URL: https://arxiv.org/pdf/2512.11253
Copy Paste: [[2512.11253]] PersonaLive! Expressive Portrait Image Animation for Live Streaming(https://arxiv.org/abs/2512.11253)
Keywords: diffusion
Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

Title: A Simple Generalisation of the Implicit Dynamics of In-Context Learning

Authors: Francesco Innocenti, El Mehdi Achour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11255
Pdf URL: https://arxiv.org/pdf/2512.11255
Copy Paste: [[2512.11255]] A Simple Generalisation of the Implicit Dynamics of In-Context Learning(https://arxiv.org/abs/2512.11255)
Keywords: in-context
Abstract: In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models.

Title: FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Authors: Xiangyang Luo, Qingyu Li, Xiaokun Liu, Wenyu Qin, Miao Yang, Meng Wang, Pengfei Wan, Di Zhang, Kun Gai, Shao-Lun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11274
Pdf URL: https://arxiv.org/pdf/2512.11274
Copy Paste: [[2512.11274]] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion(https://arxiv.org/abs/2512.11274)
Keywords: diffusion
Abstract: Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: this https URL

Title: RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection

Authors: Rongcheng Wu, Hao Zhu, Shiying Zhang, Mingzhe Wang, Zhidong Li, Hui Li, Jianlong Zhou, Jiangtao Cui, Fang Chen, Pingyang Sun, Qiyu Liao, Ye Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11284
Pdf URL: https://arxiv.org/pdf/2512.11284
Copy Paste: [[2512.11284]] RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection(https://arxiv.org/abs/2512.11284)
Keywords: diffusion, anomaly
Abstract: Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.

Title: QGEC : Quantum Golay Code Error Correction

Authors: Hideo Mukai, Hoshitaro Ohnishi
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2512.11307
Pdf URL: https://arxiv.org/pdf/2512.11307
Copy Paste: [[2512.11307]] QGEC : Quantum Golay Code Error Correction(https://arxiv.org/abs/2512.11307)
Keywords: generative
Abstract: Quantum computers have the possibility of a much reduced calculation load compared with classical computers in specific problems. Quantum error correction (QEC) is vital for handling qubits, which are vulnerable to external noise. In QEC, actual errors are predicted from the results of syndrome measurements by stabilizer generators, in place of making direct measurements of the data qubits. Here, we propose Quantum Golay code Error Correction (QGEC), a QEC method using Golay code, which is an efficient coding method in classical information theory. We investigated our method's ability in decoding calculations with the Transformer. We evaluated the accuracy of the decoder in a code space defined by the generative polynomials with three different weights sets and three noise models with different correlations of bit-flip error and phase-flip error. Furthermore, under a noise model following a discrete uniform distribution, we compared the decoding performance of Transformer decoders with identical architectures trained respectively on Golay and toric codes. The results showed that the noise model with the smaller correlation gave better accuracy, while the weights of the generative polynomials had little effect on the accuracy of the decoder. In addition, they showed that Golay code requiring 23 data qubits and having a code distance of 7 achieved higher decoding accuracy than toric code which requiring 50 data qubits and having a code distance of 5. This suggests that implementing quantum error correction using a Transformer may enable the Golay code to realize fault-tolerant quantum computation more efficiently.

Title: Benchmarking the Generality of Vision-Language-Action Models

Authors: Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11315
Pdf URL: https://arxiv.org/pdf/2512.11315
Copy Paste: [[2512.11315]] Benchmarking the Generality of Vision-Language-Action Models(https://arxiv.org/abs/2512.11315)
Keywords: foundation model
Abstract: Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training this http URL failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain this http URL findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation this http URL v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist this http URL, data, and leaderboards are publicly available.

Title: Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits

Authors: Minwoo Park, Junwoo Chang, Jongeun Choi, Roberto Horowitz
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11345
Pdf URL: https://arxiv.org/pdf/2512.11345
Copy Paste: [[2512.11345]] Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits(https://arxiv.org/abs/2512.11345)
Keywords: diffusion, generative
Abstract: Equivariant diffusion policies (EDPs) combine the generative expressivity of diffusion models with the strong generalization and sample efficiency afforded by geometric symmetries. While steering these policies with reinforcement learning (RL) offers a promising mechanism for fine-tuning beyond demonstration data, directly applying standard (non-equivariant) RL can be sample-inefficient and unstable, as it ignores the symmetries that EDPs are designed to exploit. In this paper, we theoretically establish that the diffusion process of an EDP is equivariant, which in turn induces a group-invariant latent-noise MDP that is well-suited for equivariant diffusion steering. Building on this theory, we introduce a principled symmetry-aware steering framework and compare standard, equivariant, and approximately equivariant RL strategies through comprehensive experiments across tasks with varying degrees of symmetry. While we identify the practical boundaries of strict equivariance under symmetry breaking, we show that exploiting symmetry during the steering process yields substantial benefits-enhancing sample efficiency, preventing value divergence, and achieving strong policy improvements even when EDPs are trained from extremely limited demonstrations.

Title: Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection

Authors: Qishan Wang, Haofeng Wang, Shuyong Gao, Jia Guo, Li Xiong, Jiaqi Li, Dengxuan Bai, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11401
Pdf URL: https://arxiv.org/pdf/2512.11401
Copy Paste: [[2512.11401]] Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection(https://arxiv.org/abs/2512.11401)
Keywords: anomaly
Abstract: Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder's output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.

Title: Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

Authors: Siwan Boufadène (LIGM), François-Xavier Vialard (LIGM)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11411
Pdf URL: https://arxiv.org/pdf/2512.11411
Copy Paste: [[2512.11411]] Sliced ReLU attention: Quasi-linear contextual expressivity via sorting(https://arxiv.org/abs/2512.11411)
Keywords: in-context
Abstract: We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and ReLU-based alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small-scale experiments.

Title: JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Authors: Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11423
Pdf URL: https://arxiv.org/pdf/2512.11423
Copy Paste: [[2512.11423]] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion(https://arxiv.org/abs/2512.11423)
Keywords: diffusion
Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

Title: Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Authors: Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11464
Pdf URL: https://arxiv.org/pdf/2512.11464
Copy Paste: [[2512.11464]] Exploring MLLM-Diffusion Information Transfer with MetaCanvas(https://arxiv.org/abs/2512.11464)
Keywords: diffusion, in-context
Abstract: Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

Title: DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation

Authors: Mohamed Abdelsamad, Michael Ulrich, Bin Yang, Miao Zhang, Yakov Miron, Abhinav Valada
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11465
Pdf URL: https://arxiv.org/pdf/2512.11465
Copy Paste: [[2512.11465]] DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation(https://arxiv.org/abs/2512.11465)
Keywords: self-supervised
Abstract: Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

Title: CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Authors: Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11480
Pdf URL: https://arxiv.org/pdf/2512.11480
Copy Paste: [[2512.11480]] CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop(https://arxiv.org/abs/2512.11480)
Keywords: diffusion, foundation model
Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

Title: Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning

Authors: Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11485
Pdf URL: https://arxiv.org/pdf/2512.11485
Copy Paste: [[2512.11485]] Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning(https://arxiv.org/abs/2512.11485)
Keywords: in-context
Abstract: Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.

Title: VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Authors: Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2512.11490
Pdf URL: https://arxiv.org/pdf/2512.11490
Copy Paste: [[2512.11490]] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing(https://arxiv.org/abs/2512.11490)
Keywords: generative
Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Title: SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design

Authors: Mianjie Zheng, Xinquan Yang, Along He, Xuguang Li, Feilie Zhong, Xuefen Liu, Kun Tang, Zhicheng Zhang, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11507
Pdf URL: https://arxiv.org/pdf/2512.11507
Copy Paste: [[2512.11507]] SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design(https://arxiv.org/abs/2512.11507)
Keywords: self-supervised
Abstract: Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.

Title: On Geometric Understanding and Learned Data Priors in VGGT

Authors: Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11508
Pdf URL: https://arxiv.org/pdf/2512.11508
Copy Paste: [[2512.11508]] On Geometric Understanding and Learned Data Priors in VGGT(https://arxiv.org/abs/2512.11508)
Keywords: foundation model
Abstract: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.

Title: NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics

Authors: Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Guangliang Liu, Yuxuan Liang, Xiaomeng Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11525
Pdf URL: https://arxiv.org/pdf/2512.11525
Copy Paste: [[2512.11525]] NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics(https://arxiv.org/abs/2512.11525)
Keywords: diffusion
Abstract: High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms key physical parameters (e.g., diffusion coefficients) into learnable parameters, enabling the model to autonomously optimize its physical core via end-to-end training. Concurrently, a deep neural network learns to correct for subgrid-scale processes and discretization errors not captured by the physics model. Both components work in synergy, with their outputs integrated by a unified ODE solver. Experiments demonstrate that NeuralOGCM maintains long-term stability and physical consistency, significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy. Our work paves a new path for building fast, stable, and physically-plausible models for scientific computing.

Title: xGR: Efficient Generative Recommendation Serving at Scale

Authors: Qingxiao Sun, Tongxuan Liu, Shen Zhang, Siyu Wu, Peijun Yang, Haotian Liang, Menxin Li, Xiaolong Ma, Zhiwei Liang, Ziyi Ren, Minchao Zhang, Xinyu Liu, Ke Zhang, Depei Qian, Hailong Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11529
Pdf URL: https://arxiv.org/pdf/2512.11529
Copy Paste: [[2512.11529]] xGR: Efficient Generative Recommendation Serving at Scale(https://arxiv.org/abs/2512.11529)
Keywords: generative
Abstract: Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

Title: A Multi-Criteria Automated MLOps Pipeline for Cost-Effective Cloud-Based Classifier Retraining in Response to Data Distribution Shifts

Authors: Emmanuel K. Katalay, David O. Dimandja, Jordan F. Masakuna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11541
Pdf URL: https://arxiv.org/pdf/2512.11541
Copy Paste: [[2512.11541]] A Multi-Criteria Automated MLOps Pipeline for Cost-Effective Cloud-Based Classifier Retraining in Response to Data Distribution Shifts(https://arxiv.org/abs/2512.11541)
Keywords: anomaly
Abstract: The performance of machine learning (ML) models often deteriorates when the underlying data distribution changes over time, a phenomenon known as data distribution drift. When this happens, ML models need to be retrained and redeployed. ML Operations (MLOps) is often manual, i.e., humans trigger the process of model retraining and redeployment. In this work, we present an automated MLOps pipeline designed to address neural network classifier retraining in response to significant data distribution changes. Our MLOps pipeline employs multi-criteria statistical techniques to detect distribution shifts and triggers model updates only when necessary, ensuring computational efficiency and resource optimization. We demonstrate the effectiveness of our framework through experiments on several benchmark anomaly detection data sets, showing significant improvements in model accuracy and robustness compared to traditional retraining strategies. Our work provides a foundation for deploying more reliable and adaptive ML systems in dynamic real-world settings, where data distribution changes are common.

Title: Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Authors: Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11542
Pdf URL: https://arxiv.org/pdf/2512.11542
Copy Paste: [[2512.11542]] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models(https://arxiv.org/abs/2512.11542)
Keywords: diffusion
Abstract: Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$\alpha$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$\alpha$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

Title: SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2

Authors: Zhendi Gong, Xin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11548
Pdf URL: https://arxiv.org/pdf/2512.11548
Copy Paste: [[2512.11548]] SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2(https://arxiv.org/abs/2512.11548)
Keywords: foundation model
Abstract: Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via this https URL.

Title: 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

Authors: Zhiguo Lu, Jianwen Lou, Mingjun Ma, Hairong Jin, Youyi Zheng, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11557
Pdf URL: https://arxiv.org/pdf/2512.11557
Copy Paste: [[2512.11557]] 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation(https://arxiv.org/abs/2512.11557)
Keywords: foundation model
Abstract: 3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.

Title: Fully Inductive Node Representation Learning via Graph View Transformation

Authors: Dooho Lee, Myeong Kong, Minho Jeong, Jaemin Yoo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11561
Pdf URL: https://arxiv.org/pdf/2512.11561
Copy Paste: [[2512.11561]] Fully Inductive Node Representation Learning via Graph View Transformation(https://arxiv.org/abs/2512.11561)
Keywords: foundation model
Abstract: Generalizing a pretrained model to unseen datasets without retraining is an essential step toward a foundation model. However, achieving such cross-dataset, fully inductive inference is difficult in graph-structured data where feature spaces vary widely in both dimensionality and semantics. Any transformation in the feature space can easily violate the inductive applicability to unseen datasets, strictly limiting the design space of a graph model. In this work, we introduce the view space, a novel representational axis in which arbitrary graphs can be naturally encoded in a unified manner. We then propose Graph View Transformation (GVT), a node- and feature-permutation-equivariant mapping in the view space. GVT serves as the building block for Recurrent GVT, a fully inductive model for node representation learning. Pretrained on OGBN-Arxiv and evaluated on 27 node-classification benchmarks, Recurrent GVT outperforms GraphAny, the prior fully inductive graph model, by +8.93% and surpasses 12 individually tuned GNNs by at least +3.30%. These results establish the view space as a principled and effective ground for fully inductive node representation learning.

Title: Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

Authors: Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, Lívia Baxová, Cees G. M. Snoek, Mohammadreza Salehi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11574
Pdf URL: https://arxiv.org/pdf/2512.11574
Copy Paste: [[2512.11574]] Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis(https://arxiv.org/abs/2512.11574)
Keywords: foundation model, in-context
Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at this https URL .

Title: In-Context Learning for Seismic Data Processing

Authors: Fabian Fuchs, Mario Ruben Fernandez, Norman Ettrich, Janis Keuper
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11575
Pdf URL: https://arxiv.org/pdf/2512.11575
Copy Paste: [[2512.11575]] In-Context Learning for Seismic Data Processing(https://arxiv.org/abs/2512.11575)
Keywords: in-context
Abstract: Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.

Title: Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Authors: Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter
Subjects: cs.LG, cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.11582
Pdf URL: https://arxiv.org/pdf/2512.11582
Copy Paste: [[2512.11582]] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model(https://arxiv.org/abs/2512.11582)
Keywords: self-supervised, foundation model
Abstract: The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

Title: Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling

Authors: Maik Dannecker, Steven Jia, Nil Stolt-Ansó, Nadine Girard, Guillaume Auzias, François Rousseau, Daniel Rueckert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11624
Pdf URL: https://arxiv.org/pdf/2512.11624
Copy Paste: [[2512.11624]] Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling(https://arxiv.org/abs/2512.11624)
Keywords: self-supervised
Abstract: Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbf{\Sigma}_{obs} = \mathbf{\Sigma}_{HR} + \mathbf{\Sigma}_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$--10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {this https URL}.

Title: FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

Authors: Jiapeng Tang, Kai Li, Chengxiang Yin, Liuhao Ge, Fei Jiang, Jiu Xu, Matthias Nießner, Christian Häne, Timur Bagautdinov, Egor Zakharov, Peihong Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11645
Pdf URL: https://arxiv.org/pdf/2512.11645
Copy Paste: [[2512.11645]] FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint(https://arxiv.org/abs/2512.11645)
Keywords: diffusion
Abstract: We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

Title: Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Authors: Luca Cazzola, Ahed Alboody
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11654
Pdf URL: https://arxiv.org/pdf/2512.11654
Copy Paste: [[2512.11654]] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation(https://arxiv.org/abs/2512.11654)
Keywords: diffusion, generative
Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (this https URL).

Title: Bridging Streaming Continual Learning via In-Context Large Tabular Models

Authors: Afonso Lourenço, João Gama, Eric P. Xing, Goreti Marreiros
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11668
Pdf URL: https://arxiv.org/pdf/2512.11668
Copy Paste: [[2512.11668]] Bridging Streaming Continual Learning via In-Context Large Tabular Models(https://arxiv.org/abs/2512.11668)
Keywords: in-context
Abstract: In streaming scenarios, models must learn continuously, adapting to concept drifts without erasing previously acquired knowledge. However, existing research communities address these challenges in isolation. Continual Learning (CL) focuses on long-term retention and mitigating catastrophic forgetting, often without strict real-time constraints. Stream Learning (SL) emphasizes rapid, efficient adaptation to high-frequency data streams, but typically neglects forgetting. Recent efforts have tried to combine these paradigms, yet no clear algorithmic overlap exists. We argue that large in-context tabular models (LTMs) provide a natural bridge for Streaming Continual Learning (SCL). In our view, unbounded streams should be summarized on-the-fly into compact sketches that can be consumed by LTMs. This recovers the classical SL motivation of compressing massive streams with fixed-size guarantees, while simultaneously aligning with the experience-replay desiderata of CL. To clarify this bridge, we show how the SL and CL communities implicitly adopt a divide-to-conquer strategy to manage the tension between plasticity (performing well on the current distribution) and stability (retaining past knowledge), while also imposing a minimal complexity constraint that motivates diversification (avoiding redundancy in what is stored) and retrieval (re-prioritizing past information when needed). Within this perspective, we propose structuring SCL with LTMs around two core principles of data selection for in-context learning: (1) distribution matching, which balances plasticity and stability, and (2) distribution compression, which controls memory size through diversification and retrieval mechanisms.

Title: EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Authors: Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11715
Pdf URL: https://arxiv.org/pdf/2512.11715
Copy Paste: [[2512.11715]] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing(https://arxiv.org/abs/2512.11715)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Title: Referring Change Detection in Remote Sensing Imagery

Authors: Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11719
Pdf URL: https://arxiv.org/pdf/2512.11719
Copy Paste: [[2512.11719]] Referring Change Detection in Remote Sensing Imagery(https://arxiv.org/abs/2512.11719)
Keywords: diffusion
Abstract: Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: this https URL.

Title: SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Authors: Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11749
Pdf URL: https://arxiv.org/pdf/2512.11749
Copy Paste: [[2512.11749]] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder(https://arxiv.org/abs/2512.11749)
Keywords: diffusion, self-supervised, foundation model, generative
Abstract: Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

Title: Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting

Authors: Mohammad Dehghanmanshadi, Wallapak Tavanapong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11763
Pdf URL: https://arxiv.org/pdf/2512.11763
Copy Paste: [[2512.11763]] Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting(https://arxiv.org/abs/2512.11763)
Keywords: diffusion
Abstract: Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: this https URL.

Title: Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Authors: Etienne Boursier, Claire Boyer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11784
Pdf URL: https://arxiv.org/pdf/2512.11784
Copy Paste: [[2512.11784]] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective(https://arxiv.org/abs/2512.11784)
Keywords: in-context
Abstract: Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

Title: Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Authors: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11792
Pdf URL: https://arxiv.org/pdf/2512.11792
Copy Paste: [[2512.11792]] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation(https://arxiv.org/abs/2512.11792)
Keywords: diffusion
Abstract: Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at this https URL .