2026-01-09

Title: STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models

Authors: Xinhao Sun, Maoliang Li, Zihao Zheng, Jiayu Chen, Hezhao Xu, Yun Liang, Xiang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04205
Pdf URL: https://arxiv.org/pdf/2601.04205
Copy Paste: [[2601.04205]] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models(https://arxiv.org/abs/2601.04205)
Keywords: diffusion
Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low- priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spa- tial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical re- sults show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.

Title: Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

Authors: Seokhwan Ko, Donghyeon Lee, Jaewoo Chun, Hyungsoo Han, Junghwan Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04209
Pdf URL: https://arxiv.org/pdf/2601.04209
Copy Paste: [[2601.04209]] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments(https://arxiv.org/abs/2601.04209)
Keywords: generative
Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.

Title: LEGATO: Good Identity Unlearning Is Continuous

Authors: Qiang Chen, Chun-Wun Cheng, Xiu Su, Hongyan Xu, Xi Lin, Shan You, Angelica I. Aviles-Rivero, Yi Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04282
Pdf URL: https://arxiv.org/pdf/2601.04282
Copy Paste: [[2601.04282]] LEGATO: Good Identity Unlearning Is Continuous(https://arxiv.org/abs/2601.04282)
Keywords: generative
Abstract: Machine unlearning has become a crucial role in enabling generative models trained on large datasets to remove sensitive, private, or copyright-protected data. However, existing machine unlearning methods face three challenges in learning to forget identity of generative models: 1) inefficient, where identity erasure requires fine-tuning all the model's parameters; 2) limited controllability, where forgetting intensity cannot be controlled and explainability is lacking; 3) catastrophic collapse, where the model's retention capability undergoes drastic degradation as forgetting progresses. Forgetting has typically been handled through discrete and unstable updates, often requiring full-model fine-tuning and leading to catastrophic collapse. In this work, we argue that identity forgetting should be modeled as a continuous trajectory, and introduce LEGATO - Learn to ForgEt Identity in GenerAtive Models via Trajectory-consistent Neural Ordinary Differential Equations. LEGATO augments pre-trained generators with fine-tunable lightweight Neural ODE adapters, enabling smooth, controllable forgetting while keeping the original model weights frozen. This formulation allows forgetting intensity to be precisely modulated via ODE step size, offering interpretability and robustness. To further ensure stability, we introduce trajectory consistency constraints that explicitly prevent catastrophic collapse during unlearning. Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show that LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse and reduces fine-tuned parameters.

Title: Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

Authors: Chenye Meng, Zejian Li, Zhongni Liu, Yize Li, Changle Xie, Kaixin Jia, Ling Yang, Huanghuang Deng, Shiying Ding, Shengyuan Zhang, Jiayi Li, Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04300
Pdf URL: https://arxiv.org/pdf/2601.04300
Copy Paste: [[2601.04300]] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes(https://arxiv.org/abs/2601.04300)
Keywords: diffusion
Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.

Title: Quantifying the Effect of Test Set Contamination on Generative Evaluations

Authors: Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Abhay Puri, Niloofar Mireshghallah, Sanmi Koyejo
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2601.04301
Pdf URL: https://arxiv.org/pdf/2601.04301
Copy Paste: [[2601.04301]] Quantifying the Effect of Test Set Contamination on Generative Evaluations(https://arxiv.org/abs/2601.04301)
Keywords: generative
Abstract: As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.

Title: Embedding Textual Information in Images Using Quinary Pixel Combinations

Authors: A V Uday Kiran Kandala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04302
Pdf URL: https://arxiv.org/pdf/2601.04302
Copy Paste: [[2601.04302]] Embedding Textual Information in Images Using Quinary Pixel Combinations(https://arxiv.org/abs/2601.04302)
Keywords: generative
Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.

Title: ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Authors: Mohsen Ghafoorian, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04342
Pdf URL: https://arxiv.org/pdf/2601.04342
Copy Paste: [[2601.04342]] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers(https://arxiv.org/abs/2601.04342)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at this https URL.

Title: PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

Authors: Kunyang Li, Mubarak Shah, Yuzhang Shang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04359
Pdf URL: https://arxiv.org/pdf/2601.04359
Copy Paste: [[2601.04359]] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache(https://arxiv.org/abs/2601.04359)
Keywords: generative
Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

Title: Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

Authors: Maxim Clouser, Kia Khezeli, John Kalantari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04381
Pdf URL: https://arxiv.org/pdf/2601.04381
Copy Paste: [[2601.04381]] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection(https://arxiv.org/abs/2601.04381)
Keywords: foundation model
Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

Title: From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery

Authors: Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack Noble
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04405
Pdf URL: https://arxiv.org/pdf/2601.04405
Copy Paste: [[2601.04405]] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery(https://arxiv.org/abs/2601.04405)
Keywords: self-supervised
Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

Title: CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction

Authors: Donghang Lyu, Marius Staring, Hildo Lamb, Mariya Doneva
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04428
Pdf URL: https://arxiv.org/pdf/2601.04428
Copy Paste: [[2601.04428]] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction(https://arxiv.org/abs/2601.04428)
Keywords: foundation model
Abstract: In recent years, deep learning has attracted increasing at- tention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clini- cal applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or nar- row subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that lever- ages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently out- performs baseline methods across a wide range of settings, highlighting its effectiveness and promise.

Title: UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04453
Pdf URL: https://arxiv.org/pdf/2601.04453
Copy Paste: [[2601.04453]] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving(https://arxiv.org/abs/2601.04453)
Keywords: generative
Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at this https URL .

Title: Meta-probabilistic Modeling

Authors: Kevin Zhang, Yixin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04462
Pdf URL: https://arxiv.org/pdf/2601.04462
Copy Paste: [[2601.04462]] Meta-probabilistic Modeling(https://arxiv.org/abs/2601.04462)
Keywords: generative
Abstract: While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.

Title: Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

Authors: Ignacio Sastre, Aiala Rosá
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04465
Pdf URL: https://arxiv.org/pdf/2601.04465
Copy Paste: [[2601.04465]] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions(https://arxiv.org/abs/2601.04465)
Keywords: in-context
Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.

Title: Surface-based Molecular Design with Multi-modal Flow Matching

Authors: Fang Wu, Zhengyuan Zhou, Shuting Jin, Xiangxiang Zeng, Jure Leskovec, Jinbo Xu
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2601.04506
Pdf URL: https://arxiv.org/pdf/2601.04506
Copy Paste: [[2601.04506]] Surface-based Molecular Design with Multi-modal Flow Matching(https://arxiv.org/abs/2601.04506)
Keywords: generative
Abstract: Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors. However, the critical role of molecular surfaces in protein-protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides. SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.

Title: Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation

Authors: Xiaowei Mao, Huihu Ding, Yan Lin, Tingrui Wu, Shengnan Guo, Dazhuo Qiu, Feiling Fang, Jilin Hu, Huaiyu Wan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04572
Pdf URL: https://arxiv.org/pdf/2601.04572
Copy Paste: [[2601.04572]] Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation(https://arxiv.org/abs/2601.04572)
Keywords: diffusion, generative
Abstract: Imputing missing values in spatial-temporal traffic data is essential for intelligent transportation systems. Among advanced imputation methods, score-based diffusion models have demonstrated competitive performance. These models generate data by reversing a noising process, using observed values as conditional guidance. However, existing diffusion models typically apply a uniform guidance scale across both spatial and temporal dimensions, which is inadequate for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than closely following the conditional observations, resulting in suboptimal imputation performance. To address this, we propose FENCE, a spatial-temporal feedback diffusion guidance method designed to adaptively control guidance scales during imputation. First, FENCE introduces a dynamic feedback mechanism that adjusts the guidance scale based on the posterior likelihood approximations. The guidance scale is increased when generated values diverge from observations and reduced when alignment improves, preventing overcorrection. Second, because alignment to observations varies across nodes and denoising steps, a global guidance scale for all nodes is suboptimal. FENCE computes guidance scales at the cluster level by grouping nodes based on their attention scores, leveraging spatial-temporal correlations to provide more accurate guidance. Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy.

Title: 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks

Authors: Yusri Al-Sanaani, Rebecca Thornhill, Sreeraman Rajan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04588
Pdf URL: https://arxiv.org/pdf/2601.04588
Copy Paste: [[2601.04588]] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks(https://arxiv.org/abs/2601.04588)
Keywords: generative
Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the this http URL findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.

Title: On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions

Authors: Zhiyuan He, Binghan Chen, Tianxiang Xiong, Ziyang Sun, Mozhao Zhu, Xi Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04600
Pdf URL: https://arxiv.org/pdf/2601.04600
Copy Paste: [[2601.04600]] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions(https://arxiv.org/abs/2601.04600)
Keywords: in-context
Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.

Title: HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

Authors: Yang Zou, Xingyue Zhu, Kaiqi Han, Jun Ma, Xingyuan Li, Zhiying Jiang, Jinyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04682
Pdf URL: https://arxiv.org/pdf/2601.04682
Copy Paste: [[2601.04682]] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution(https://arxiv.org/abs/2601.04682)
Keywords: diffusion
Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: this https URL

Title: Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?

Authors: Mir Rayat Imtiaz Hossain, Leo Feng, Leonid Sigal, Mohamed Osama Ahmed
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.04690
Pdf URL: https://arxiv.org/pdf/2601.04690
Copy Paste: [[2601.04690]] Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?(https://arxiv.org/abs/2601.04690)
Keywords: generative
Abstract: Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings. These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information. In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs.

Title: See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

Authors: Naquee Rizwan, Subhankar Swain, Paramananda Bhaskar, Gagan Aryan, Shehryaar Shah Khan, Animesh Mukherjee
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.04692
Pdf URL: https://arxiv.org/pdf/2601.04692
Copy Paste: [[2601.04692]] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation(https://arxiv.org/abs/2601.04692)
Keywords: generative
Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.

Title: Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Authors: Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04720
Pdf URL: https://arxiv.org/pdf/2601.04720
Copy Paste: [[2601.04720]] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking(https://arxiv.org/abs/2601.04720)
Keywords: foundation model
Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Title: Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition

Authors: Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04752
Pdf URL: https://arxiv.org/pdf/2601.04752
Copy Paste: [[2601.04752]] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition(https://arxiv.org/abs/2601.04752)
Keywords: foundation model
Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.

Title: CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

Authors: Tobia Poppi, Burak Uzkent, Amanmeet Garg, Lucas Porto, Garin Kessler, Yezhou Yang, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, Florian Schiffers
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2601.04778
Pdf URL: https://arxiv.org/pdf/2601.04778
Copy Paste: [[2601.04778]] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models(https://arxiv.org/abs/2601.04778)
Keywords: diffusion
Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.

Title: Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers

Authors: Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04791
Pdf URL: https://arxiv.org/pdf/2601.04791
Copy Paste: [[2601.04791]] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers(https://arxiv.org/abs/2601.04791)
Keywords: diffusion, generative
Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.

Title: PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Authors: Denis Korzhenkov, Adil Karjauv, Animesh Karnewar, Mohsen Ghafoorian, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04792
Pdf URL: https://arxiv.org/pdf/2601.04792
Copy Paste: [[2601.04792]] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference(https://arxiv.org/abs/2601.04792)
Keywords: diffusion
Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at this https URL.

Title: Detector-Augmented SAMURAI for Long-Duration Drone Tracking

Authors: Tamara R. Lenhard, Andreas Weinmann, Hichem Snoussi, Tobias Koch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04798
Pdf URL: https://arxiv.org/pdf/2601.04798
Copy Paste: [[2601.04798]] Detector-Augmented SAMURAI for Long-Duration Drone Tracking(https://arxiv.org/abs/2601.04798)
Keywords: foundation model
Abstract: Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.

Title: Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

Authors: Oshri Naparstek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04854
Pdf URL: https://arxiv.org/pdf/2601.04854
Copy Paste: [[2601.04854]] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics(https://arxiv.org/abs/2601.04854)
Keywords: diffusion
Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.

Title: DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation

Authors: Ayush Pande
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04860
Pdf URL: https://arxiv.org/pdf/2601.04860
Copy Paste: [[2601.04860]] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation(https://arxiv.org/abs/2601.04860)
Keywords: foundation model
Abstract: Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.

Title: OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Authors: Minseong Kweon, Jinsun Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.04984
Pdf URL: https://arxiv.org/pdf/2601.04984
Copy Paste: [[2601.04984]] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction(https://arxiv.org/abs/2601.04984)
Keywords: self-supervised
Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.

Title: Patch-based Representation and Learning for Efficient Deformation Modeling

Authors: Ruochen Chen, Thuy Tran, Shaifali Parashar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05035
Pdf URL: https://arxiv.org/pdf/2601.05035
Copy Paste: [[2601.05035]] Patch-based Representation and Learning for Efficient Deformation Modeling(https://arxiv.org/abs/2601.05035)
Keywords: self-supervised
Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.

Title: DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Authors: Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.05052
Pdf URL: https://arxiv.org/pdf/2601.05052
Copy Paste: [[2601.05052]] DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights(https://arxiv.org/abs/2601.05052)
Keywords: diffusion, generative
Abstract: Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

Title: SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment

Authors: Ziyang Chen, Zhenxuan Huang, Yile Wang, Weiqin Wang, Lu Yin, Hui Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05075
Pdf URL: https://arxiv.org/pdf/2601.05075
Copy Paste: [[2601.05075]] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment(https://arxiv.org/abs/2601.05075)
Keywords: generative
Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.

Title: UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition

Authors: Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic, Felix Heide
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05105
Pdf URL: https://arxiv.org/pdf/2601.05105
Copy Paste: [[2601.05105]] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition(https://arxiv.org/abs/2601.05105)
Keywords: foundation model
Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.

Title: Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Authors: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05124
Pdf URL: https://arxiv.org/pdf/2601.05124
Copy Paste: [[2601.05124]] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing(https://arxiv.org/abs/2601.05124)
Keywords: in-context
Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Title: VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Authors: Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05138
Pdf URL: https://arxiv.org/pdf/2601.05138
Copy Paste: [[2601.05138]] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control(https://arxiv.org/abs/2601.05138)
Keywords: diffusion
Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

Title: Atlas 2 - Foundation models for clinical deployment

Authors: Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, Neelay Shah, Moritz Krügener, Aniruddh Jammoria, Jake Matras, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05148
Pdf URL: https://arxiv.org/pdf/2601.05148
Copy Paste: [[2601.05148]] Atlas 2 - Foundation models for clinical deployment(https://arxiv.org/abs/2601.05148)
Keywords: foundation model
Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

Title: FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Authors: Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05212
Pdf URL: https://arxiv.org/pdf/2601.05212
Copy Paste: [[2601.05212]] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching(https://arxiv.org/abs/2601.05212)
Keywords: diffusion, generative
Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

Title: Plenoptic Video Generation

Authors: Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05239
Pdf URL: https://arxiv.org/pdf/2601.05239
Copy Paste: [[2601.05239]] Plenoptic Video Generation(https://arxiv.org/abs/2601.05239)
Keywords: generative
Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: this https URL

Title: RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.05241
Pdf URL: https://arxiv.org/pdf/2601.05241
Copy Paste: [[2601.05241]] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation(https://arxiv.org/abs/2601.05241)
Keywords: diffusion
Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

Title: Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05246
Pdf URL: https://arxiv.org/pdf/2601.05246
Copy Paste: [[2601.05246]] Pixel-Perfect Visual Geometry Estimation(https://arxiv.org/abs/2601.05246)
Keywords: diffusion, foundation model, generative
Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

Title: Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

Authors: Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.05251
Pdf URL: https://arxiv.org/pdf/2601.05251
Copy Paste: [[2601.05251]] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video(https://arxiv.org/abs/2601.05251)
Keywords: diffusion
Abstract: We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.