2026-01-21

Title: Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study

Authors: Arnav S. Sonavane
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11612
Pdf URL: https://arxiv.org/pdf/2601.11612
Copy Paste: [[2601.11612]] Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study(https://arxiv.org/abs/2601.11612)
Keywords: self-supervised
Abstract: We investigate the impact of domain-specific self-supervised pre-training on agricultural disease classification using hierarchical vision transformers. Our key finding is that SimCLR pre-training on just 3,000 unlabeled agricultural images provides a +4.57% accuracy improvement--exceeding the +3.70% gain from hierarchical architecture design. Critically, we show this SSL benefit is architecture-agnostic: applying the same pre-training to Swin-Base yields +4.08%, to ViT-Base +4.20%, confirming practitioners should prioritize domain data collection over architectural choices. Using HierarchicalViT (HVT), a Swin-style hierarchical transformer, we evaluate on three datasets: Cotton Leaf Disease (7 classes, 90.24%), PlantVillage (38 classes, 96.3%), and PlantDoc (27 classes, 87.1%). At matched parameter counts, HVT-Base (78M) achieves 88.91% vs. Swin-Base (88M) at 87.23%, a +1.68% improvement. For deployment reliability, we report calibration analysis showing HVT achieves 3.56% ECE (1.52% after temperature scaling). Code: this https URL

Title: Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning

Authors: Jason Qiu
Subjects: cs.CV, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2601.11614
Pdf URL: https://arxiv.org/pdf/2601.11614
Copy Paste: [[2601.11614]] Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning(https://arxiv.org/abs/2601.11614)
Keywords: diffusion, generative
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder in which pathological changes begin many years before the onset of clinical symptoms, making early detection essential for timely intervention. T1-weighted (T1w) Magnetic Resonance Imaging (MRI) is routinely used in clinical practice to identify macroscopic brain alterations, but these changes typically emerge relatively late in the disease course. Diffusion MRI (dMRI), in contrast, is sensitive to earlier microstructural abnormalities by probing water diffusion in brain tissue. dMRI metrics, including fractional anisotropy (FA) and mean diffusivity (MD), provide complementary information about white matter integrity and neurodegeneration. However, dMRI acquisitions are time-consuming and susceptible to motion artifacts, limiting their routine use in clinical populations. To bridge this gap, I propose a 3D TransUNet image synthesis framework that predicts FA and MD maps directly from T1w MRI. My model generates high-fidelity maps, achieving a structural similarity index (SSIM) exceeding 0.93 and a strong Pearson correlation (>0.94) with ground-truth dMRI. When integrated into a multi-modal diagnostic model, these synthetic features boost AD classification accuracy by 5% (78.75%->83.75%) and, most importantly, improve mild cognitive impairment (MCI) detection by 12.5%. This study demonstrates that high-quality diffusion microstructural information can be inferred from routinely acquired T1w MRI, effectively transferring the benefits of multi-modality imaging to settings where diffusion data are unavailable. By reducing scan time while preserving complementary structural and microstructural information, the proposed approach has the potential to improve the accessibility, efficiency, and accuracy of AD diagnosis in clinical practice.

Title: A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow

Authors: Haonan Wei, Linyuan Wang, Nuolin Sun, Zhizhong Zheng, Lei Li, Bin Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11630
Pdf URL: https://arxiv.org/pdf/2601.11630
Copy Paste: [[2601.11630]] A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow(https://arxiv.org/abs/2601.11630)
Keywords: diffusion
Abstract: Currently, Flow matching methods aim to compress the iterative generation process of diffusion models into a few or even a single step, with MeanFlow and FreeFlow being representative achievements of one-step generation based on Ordinary Differential Equations (ODEs). We observe that the 28-layer Transformer architecture of FreeFlow can be characterized as an Euler discretization scheme for an ODE along the depth axis, where the layer index serves as the discrete time step. Therefore, we distill the number of layers of the FreeFlow model, following the same derivation logic as FreeFlow, and propose SLT (Single-Layer Transformer), which uses a single shared DiT block to approximate the depth-wise feature evolution of the 28-layer teacher. During training, it matches the teacher's intermediate features at several depth patches, fuses those patch-level representations, and simultaneously aligns the teacher's final velocity prediction. Through distillation training, we compress the 28 independent Transformer Blocks of the teacher model DiT-XL/2 into a single Transformer Block, reducing the parameter count from 675M to 4.3M. Furthermore, leveraging its minimal parameters and rapid sampling speed, SLT can screen more candidate points in the noise space within the same timeframe, thereby selecting higher-quality initial points for the teacher model FreeFlow and ultimately enhancing the quality of generated images. Experimental results demonstrate that within a time budget comparable to two random samplings of the teacher model, our method performs over 100 noise screenings and produces a high-quality sample through the teacher model using the selected points. Quality fluctuations caused by low-quality initial noise under a limited number of FreeFlow sampling calls are effectively avoided, substantially improving the stability and average generation quality of one-step generation.

Title: Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos

Authors: Anil Egin, Andrea Tangherloni, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11635
Pdf URL: https://arxiv.org/pdf/2601.11635
Copy Paste: [[2601.11635]] Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos(https://arxiv.org/abs/2601.11635)
Keywords: diffusion, generative
Abstract: Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.

Title: Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics

Authors: Aradhya Dixit
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11637
Pdf URL: https://arxiv.org/pdf/2601.11637
Copy Paste: [[2601.11637]] Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics(https://arxiv.org/abs/2601.11637)
Keywords: foundation model
Abstract: Recent progress in multimodal foundation models has enabled Vision-Language Agents (VLAs) to decompose complex visual tasks into executable tool-based plans. While recent benchmarks have begun to evaluate iterative self-correction, its quantitative limits and dominant reasoning bottlenecks remain poorly characterized. This work introduces a Diagnostic Micro-Benchmark. Our analysis decouples Task Success Rate (TSR = 62 percent) from Correction Success Rate (CSR = 25 to 33 percent), revealing that initial competence does not predict repair ability. We explicitly quantify the diminishing returns of correction, which saturates after three retries. Our Failure Taxonomy reveals a frequent factor is Semantic Drift (about 28 percent of failures), a loss of contextual state. By isolating this reasoning bottleneck, this benchmark defines a reproducible framework toward stateful, trustworthy multimodal agents.

Title: Global Optimization By Gradient from Hierarchical Score-Matching Spaces

Authors: Ming Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.11639
Pdf URL: https://arxiv.org/pdf/2601.11639
Copy Paste: [[2601.11639]] Global Optimization By Gradient from Hierarchical Score-Matching Spaces(https://arxiv.org/abs/2601.11639)
Keywords: diffusion, generative
Abstract: Gradient descent is the most commonly used optimization method, but limited to local optimality, and confined to the field of continuous differentiable problems with simple convex constraints. This work solve these limitations and restrictions by unifying all optimization problems with various complex constraints as a general hierarchical optimization objective without constraints, which is optimized by gradient obtained through score matching. By this way, global optimization by deterministic method using strict gradient is achieved for the first time, and verified through simple-constructed and complex-practical experiments. Even more importantly, it reveals the profound connection between global optimization and diffusion based generative modeling.

Title: Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Authors: Yuxi Liu, Yipeng Hu, Zekun Zhang, Kunze Jiang, Kun Yuan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11641
Pdf URL: https://arxiv.org/pdf/2601.11641
Copy Paste: [[2601.11641]] Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers(https://arxiv.org/abs/2601.11641)
Keywords: diffusion
Abstract: While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixtrue-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.

Title: Predicting When to Trust Vision-Language Models for Spatial Reasoning

Authors: Muhammad Imran, Yugyung Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11644
Pdf URL: https://arxiv.org/pdf/2601.11644
Copy Paste: [[2601.11644]] Predicting When to Trust Vision-Language Models for Spatial Reasoning(https://arxiv.org/abs/2601.11644)
Keywords: generative
Abstract: Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.

Title: Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification

Authors: Miriam Doh, Aditya Gulati, Corina Canali, Nuria Oliver
Subjects: cs.CV, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.11651
Pdf URL: https://arxiv.org/pdf/2601.11651
Copy Paste: [[2601.11651]] Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification(https://arxiv.org/abs/2601.11651)
Keywords: diffusion, generative
Abstract: This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women's faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men's; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors' views.

Title: UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AEC+FM

Authors: Amir Farzin Nikkhah, Dong Chen, Bradford Campbell, Somayeh Asadi, Arsalan Heydarian
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2601.11665
Pdf URL: https://arxiv.org/pdf/2601.11665
Copy Paste: [[2601.11665]] UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AEC+FM(https://arxiv.org/abs/2601.11665)
Keywords: anomaly
Abstract: Unmanned Aerial Vehicles (UAVs) are transforming infrastructure inspections in the Architecture, Engineering, Construction, and Facility Management (AEC+FM) domain. By synthesizing insights from over 150 studies, this review paper highlights UAV-based methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making support. Key innovations include path optimization, thermal integration, and advanced machine learning (ML) models such as YOLO and Faster R-CNN for anomaly detection. UAVs have demonstrated value in structural health monitoring (SHM), disaster response, urban infrastructure management, energy efficiency evaluations, and cultural heritage preservation. Despite these advancements, challenges in real-time processing, multimodal data fusion, and generalizability remain. A proposed workflow framework, informed by literature and a case study, integrates RGB imagery, LiDAR, and thermal sensing with transformer-based architectures to improve accuracy and reliability in detecting structural defects, thermal anomalies, and geometric inconsistencies. The proposed framework ensures precise and actionable insights by fusing multimodal data and dynamically adapting path planning for complex environments, presented as a comprehensive step-by-step guide to address these challenges effectively. This paper concludes with future research directions emphasizing lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections.

Title: Generating metamers of human scene understanding

Authors: Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras, Gregory J. Zelinsky
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11675
Pdf URL: https://arxiv.org/pdf/2601.11675
Copy Paste: [[2601.11675]] Generating metamers of human scene understanding(https://arxiv.org/abs/2601.11675)
Keywords: diffusion
Abstract: Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.

Title: Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory

Authors: Zhuoyi Shang, Jiasen Li, Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Weiping Wang
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2601.11683
Pdf URL: https://arxiv.org/pdf/2601.11683
Copy Paste: [[2601.11683]] Attesting Model Lineage by Consisted Knowledge Evolution with Fine-Tuning Trajectory(https://arxiv.org/abs/2601.11683)
Keywords: diffusion
Abstract: The fine-tuning technique in deep learning gives rise to an emerging lineage relationship among models. This lineage provides a promising perspective for addressing security concerns such as unauthorized model redistribution and false claim of model provenance, which are particularly pressing in \textcolor{blue}{open-weight model} libraries where robust lineage verification mechanisms are often lacking. Existing approaches to model lineage detection primarily rely on static architectural similarities, which are insufficient to capture the dynamic evolution of knowledge that underlies true lineage relationships. Drawing inspiration from the genetic mechanism of human evolution, we tackle the problem of model lineage attestation by verifying the joint trajectory of knowledge evolution and parameter modification. To this end, we propose a novel model lineage attestation framework. In our framework, model editing is first leveraged to quantify parameter-level changes introduced by fine-tuning. Subsequently, we introduce a novel knowledge vectorization mechanism that refines the evolved knowledge within the edited models into compact representations by the assistance of probe samples. The probing strategies are adapted to different types of model families. These embeddings serve as the foundation for verifying the arithmetic consistency of knowledge relationships across models, thereby enabling robust attestation of model lineage. Extensive experimental evaluations demonstrate the effectiveness and resilience of our approach in a variety of adversarial scenarios in the real world. Our method consistently achieves reliable lineage verification across a broad spectrum of model types, including classifiers, diffusion models, and large language models.

Title: Telling Human and Machine Handwriting Apart

Authors: Luis A. Leiva, Moises Diaz, Nuwan T. Attygalle, Miguel A. Ferrer, Rejean Plamondon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11700
Pdf URL: https://arxiv.org/pdf/2601.11700
Copy Paste: [[2601.11700]] Telling Human and Machine Handwriting Apart(https://arxiv.org/abs/2601.11700)
Keywords: diffusion, generative
Abstract: Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.

Title: jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

Authors: Ho Fung Tsoi, Dylan Rankin
Subjects: cs.LG, hep-ex
Abstract URL: https://arxiv.org/abs/2601.11719
Pdf URL: https://arxiv.org/pdf/2601.11719
Copy Paste: [[2601.11719]] jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation(https://arxiv.org/abs/2601.11719)
Keywords: self-supervised, anomaly
Abstract: Self-supervised learning is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.

Title: SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

Authors: Turhan Can Kargin, Wojciech Jasiński, Adam Pardyl, Bartosz Zieliński, Marcin Przewięźlikowski
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11729
Pdf URL: https://arxiv.org/pdf/2601.11729
Copy Paste: [[2601.11729]] SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models(https://arxiv.org/abs/2601.11729)
Keywords: foundation model
Abstract: Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.

Title: LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

Authors: George Mihaila, Suleyman Olcay Polat, Poli Nemkova, Himanshu Sharma, Namratha V. Urs, Mark V. Albert
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.11746
Pdf URL: https://arxiv.org/pdf/2601.11746
Copy Paste: [[2601.11746]] LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text(https://arxiv.org/abs/2601.11746)
Keywords: generative
Abstract: Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.

Title: studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting

Authors: Yimu Pan, Hongda Mao, Qingshuang Chen, Yelin Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11772
Pdf URL: https://arxiv.org/pdf/2601.11772
Copy Paste: [[2601.11772]] studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting(https://arxiv.org/abs/2601.11772)
Keywords: self-supervised
Abstract: Recent advance in feed-forward 3D Gaussian splatting has enable remarkable multi-view 3D scene reconstruction or single-view 3D object reconstruction but single-view 3D scene reconstruction remain under-explored due to inherited ambiguity in single-view. We present \textbf{studentSplat}, a single-view 3D Gaussian splatting method for scene reconstruction. To overcome the scale ambiguity and extrapolation problems inherent in novel-view supervision from a single input, we introduce two techniques: 1) a teacher-student architecture where a multi-view teacher model provides geometric supervision to the single-view student during training, addressing scale ambiguity and encourage geometric validity; and 2) an extrapolation network that completes missing scene context, enabling high-quality extrapolation. Extensive experiments show studentSplat achieves state-of-the-art single-view novel-view reconstruction quality and comparable performance to multi-view methods at the scene level. Furthermore, studentSplat demonstrates competitive performance as a self-supervised single-view depth estimation method, highlighting its potential for general single-view 3D understanding tasks.

Title: Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

Authors: Kaituo Zhang, Zhimeng Jiang, Na Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11776
Pdf URL: https://arxiv.org/pdf/2601.11776
Copy Paste: [[2601.11776]] Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models(https://arxiv.org/abs/2601.11776)
Keywords: generative
Abstract: Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.

Title: Shapelets-Enriched Selective Forecasting using Time Series Foundation Models

Authors: Shivani Tomar, Seshu Tirupathi, Elizabeth Daly, Ivana Dusparic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.11821
Pdf URL: https://arxiv.org/pdf/2601.11821
Copy Paste: [[2601.11821]] Shapelets-Enriched Selective Forecasting using Time Series Foundation Models(https://arxiv.org/abs/2601.11821)
Keywords: foundation model
Abstract: Time series foundation models have recently gained a lot of attention due to their ability to model complex time series data encompassing different domains including traffic, energy, and weather. Although they exhibit strong average zero-shot performance on forecasting tasks, their predictions on certain critical regions of the data are not always reliable, limiting their usability in real-world applications, especially when data exhibits unique trends. In this paper, we propose a selective forecasting framework to identify these critical segments of time series using shapelets. We learn shapelets using shift-invariant dictionary learning on the validation split of the target domain dataset. Utilizing distance-based similarity to these shapelets, we facilitate the user to selectively discard unreliable predictions and be informed of the model's realistic capabilities. Empirical results on diverse benchmark time series datasets demonstrate that our approach leveraging both zero-shot and full-shot fine-tuned models reduces the overall error by an average of 22.17% for zero-shot and 22.62% for full-shot fine-tuned model. Furthermore, our approach using zero-shot and full-shot fine-tuned models, also outperforms its random selection counterparts by up to 21.41% and 21.43% on one of the datasets.

Title: MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization

Authors: Andrea Rubbi, Amir Akbarnejad, Mohammad Vali Sanian, Aryan Yazdan Parast, Hesam Asadollahzadeh, Arian Amani, Naveed Akhtar, Sarah Cooper, Andrew Bassett, Pietro Liò, Lassi Paavolainen, Sattar Vakili, Mo Lotfollahi
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2601.11827
Pdf URL: https://arxiv.org/pdf/2601.11827
Copy Paste: [[2601.11827]] MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization(https://arxiv.org/abs/2601.11827)
Keywords: generative
Abstract: Achieving robust generalization under distribution shift remains a central challenge in conditional generative modeling, as existing conditional flow-based methods often struggle to extrapolate beyond the training conditions. We introduce MixFlow, a conditional flow-matching framework for descriptor-controlled generation that directly targets this limitation by jointly learning a descriptor-conditioned base distribution and a descriptor-conditioned flow field via shortest-path flow matching. By modeling the base distribution as a learnable, descriptor-dependent mixture, MixFlow enables smooth interpolation and extrapolation to unseen conditions, leading to substantially improved out-of-distribution generalization. We provide analytical insights into the behavior of the proposed framework and empirically demonstrate its effectiveness across multiple domains, including prediction of responses to unseen perturbations in single-cell transcriptomic data and high-content microscopy-based drug screening tasks. Across these diverse settings, MixFlow consistently outperforms standard conditional flow-matching baselines. Overall, MixFlow offers a simple yet powerful approach for achieving robust, generalizable, and controllable generative modeling across heterogeneous domains.

Title: TF-CoDiT: Conditional Time Series Synthesis with Diffusion Transformers for Treasury Futures

Authors: Yingxiao Zhang, Jiaxin Duan, Junfu Zhang, Ke Feng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11880
Pdf URL: https://arxiv.org/pdf/2601.11880
Copy Paste: [[2601.11880]] TF-CoDiT: Conditional Time Series Synthesis with Diffusion Transformers for Treasury Futures(https://arxiv.org/abs/2601.11880)
Keywords: diffusion
Abstract: Diffusion Transformers (DiT) have achieved milestones in synthesizing financial time-series data, such as stock prices and order flows. However, their performance in synthesizing treasury futures data is still underexplored. This work emphasizes the characteristics of treasury futures data, including its low volume, market dependencies, and the grouped correlations among multivariables. To overcome these challenges, we propose TF-CoDiT, the first DiT framework for language-controlled treasury futures synthesis. To facilitate low-data learning, TF-CoDiT adapts the standard DiT by transforming multi-channel 1-D time series into Discrete Wavelet Transform (DWT) coefficient matrices. A U-shape VAE is proposed to encode cross-channel dependencies hierarchically into a latent variable and bridge the latent and DWT spaces through decoding, thereby enabling latent diffusion generation. To derive prompts that cover essential conditions, we introduce the Financial Market Attribute Protocol (FinMAP) - a multi-level description system that standardizes daily$/$periodical market dynamics by recognizing 17$/$23 economic indicators from 7/8 perspectives. In our experiments, we gather four types of treasury futures data covering the period from 2015 to 2025, and define data synthesis tasks with durations ranging from one week to four months. Extensive evaluations demonstrate that TF-CoDiT can produce highly authentic data with errors at most 0.433 (MSE) and 0.453 (MAE) to the ground-truth. Further studies evidence the robustness of TF-CoDiT across contracts and temporal horizons.

Title: RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection

Authors: Yilmaz Korkmaz, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11898
Pdf URL: https://arxiv.org/pdf/2601.11898
Copy Paste: [[2601.11898]] RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection(https://arxiv.org/abs/2601.11898)
Keywords: diffusion
Abstract: Remote sensing change detection aims to localize and characterize scene changes between two time points and is central to applications such as environmental monitoring and disaster assessment. Meanwhile, visual autoregressive models (VARs) have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias. We introduce RemoteVAR, a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction. Extensive experiments on standard change detection benchmarks show that RemoteVAR delivers consistent and significant improvements over strong diffusion-based and transformer-based baselines, establishing a competitive autoregressive alternative for remote sensing change detection. Code will be available \href{this https URL}{\underline{here}}.

Title: A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

Authors: Guiying Zhu, Bowen Yang, Yin Zhuang, Tong Zhang, Guanqun Wang, Zhihao Che, He Chen, Lianlin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11910
Pdf URL: https://arxiv.org/pdf/2601.11910
Copy Paste: [[2601.11910]] A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection(https://arxiv.org/abs/2601.11910)
Keywords: foundation model
Abstract: Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.

Title: Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

Authors: Haonan An, Guang Hua, Wei Du, Hangcheng Cao, Yihang Tao, Guowen Xu, Susanto Rahardja, Yuguang Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.11952
Pdf URL: https://arxiv.org/pdf/2601.11952
Copy Paste: [[2601.11952]] Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal(https://arxiv.org/abs/2601.11952)
Keywords: generative
Abstract: Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.

Title: Hybrid IDS Using Signature-Based and Anomaly-Based Detection

Authors: Messaouda Boutassetta, Amina Makhlouf, Newfel Messaoudi, Abdelmadjid Benmachiche, Ines Boutabia
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2601.11998
Pdf URL: https://arxiv.org/pdf/2601.11998
Copy Paste: [[2601.11998]] Hybrid IDS Using Signature-Based and Anomaly-Based Detection(https://arxiv.org/abs/2601.11998)
Keywords: anomaly
Abstract: Intrusion detection systems (IDS) are essential for protecting computer systems and networks against a wide range of cyber threats that continue to evolve over time. IDS are commonly categorized into two main types, each with its own strengths and limitations, such as difficulty in detecting previously unseen attacks and the tendency to generate high false positive rates. This paper presents a comprehensive survey and a conceptual overview of Hybrid IDS, which integrate signature-based and anomaly-based detection techniques to enhance attack detection capabilities. The survey examines recent research on Hybrid IDS, classifies existing models into functional categories, and discusses their advantages, limitations, and application domains, including financial systems, air traffic control, and social networks. In addition, recent trends in Hybrid IDS research, such as machine learning-based approaches and cloud-based deployments, are reviewed. Finally, this work outlines potential future research directions aimed at developing more cost-effective Hybrid IDS solutions with improved ability to detect emerging and sophisticated cyberattacks.

Title: DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering

Authors: Guillermo Figueroa-Araneda, Iris Diana Jimenez, Florian Hofherr, Manny Ko, Hector Andrade-Loarca, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12020
Pdf URL: https://arxiv.org/pdf/2601.12020
Copy Paste: [[2601.12020]] DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering(https://arxiv.org/abs/2601.12020)
Keywords: diffusion
Abstract: Subsurface scattering (SSS) gives translucent materials -- such as wax, jade, marble, and skin -- their characteristic soft shadows, color bleeding, and diffuse glow. Modeling these effects in neural rendering remains challenging due to complex light transport and the need for densely captured multi-view, multi-light datasets (often more than 100 views and 112 OLATs). We present DIAMOND-SSS, a data-efficient framework for high-fidelity translucent reconstruction from extremely sparse supervision -- even as few as ten images. We fine-tune diffusion models for novel-view synthesis and relighting, conditioned on estimated geometry and trained on less than 7 percent of the dataset, producing photorealistic augmentations that can replace up to 95 percent of missing captures. To stabilize reconstruction under sparse or synthetic supervision, we introduce illumination-independent geometric priors: a multi-view silhouette consistency loss and a multi-view depth consistency loss. Across all sparsity regimes, DIAMOND-SSS achieves state-of-the-art quality in relightable Gaussian rendering, reducing real capture requirements by up to 90 percent compared to SSS-3DGS.

Title: Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

Authors: Ziyi Zhao, Chongming Gao, Yang Zhang, Haoyan Liu, Weinan Gan, Huifeng Guo, Yong Liu, Fuli Feng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.12034
Pdf URL: https://arxiv.org/pdf/2601.12034
Copy Paste: [[2601.12034]] Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs(https://arxiv.org/abs/2601.12034)
Keywords: foundation model
Abstract: Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.

Title: Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

Authors: Xiaomei Yang, Xizhan Gao, Antai Liu, Kang Wei, Fa Zhu, Guang Feng, Xiaofeng Qu, Sijie Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12062
Pdf URL: https://arxiv.org/pdf/2601.12062
Copy Paste: [[2601.12062]] Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2601.12062)
Keywords: diffusion
Abstract: The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.

Title: Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Authors: Zijie Lou, Xiangwei Feng, Jiaxin Wang, Xiaochao Qu, Luoqi Liu, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12066
Pdf URL: https://arxiv.org/pdf/2601.12066
Copy Paste: [[2601.12066]] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation(https://arxiv.org/abs/2601.12066)
Keywords: diffusion, generative
Abstract: Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.

Title: ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification

Authors: VSS Tejaswi Abburi, Ananya Singhal, Saurabh J. Shigwan, Nitin Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12067
Pdf URL: https://arxiv.org/pdf/2601.12067
Copy Paste: [[2601.12067]] ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification(https://arxiv.org/abs/2601.12067)
Keywords: generative
Abstract: Early detection of neurodegenerative diseases such as Alzheimer's Disease (AD) and Frontotemporal Dementia (FTD) is essential for reducing the risk of progression to severe disease stages. As AD and FTD propagate along white-matter regions in a global, graph-dependent manner, graph-based neural networks are well suited to capture these patterns. Hence, we introduce ARMARecon, a unified graph learning framework that integrates Autoregressive Moving Average (ARMA) graph filtering with a reconstruction-driven objective to enhance feature representation and improve classification accuracy. ARMARecon effectively models both local and global connectivity by leveraging 20-bin Fractional Anisotropy (FA) histogram features extracted from white-matter regions, while mitigating over-smoothing. Overall, ARMARecon achieves superior performance compared to state-of-the-art methods on the multi-site dMRI datasets ADNI and NIFD.

Title: Conditional Random Fields for Interactive Refinement of Histopathological Predictions

Authors: Tiffanie Godelaine, Maxime Zanella, Karim El Khoury, Saïd Mahmoudi, Benoît Macq, Christophe De Vleeschouwer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12082
Pdf URL: https://arxiv.org/pdf/2601.12082
Copy Paste: [[2601.12082]] Conditional Random Fields for Interactive Refinement of Histopathological Predictions(https://arxiv.org/abs/2601.12082)
Keywords: foundation model
Abstract: Assisting pathologists in the analysis of histopathological images has high clinical value, as it supports cancer detection and staging. In this context, histology foundation models have recently emerged. Among them, Vision-Language Models (VLMs) provide strong yet imperfect zero-shot predictions. We propose to refine these predictions by adapting Conditional Random Fields (CRFs) to histopathological applications, requiring no additional model training. We present HistoCRF, a CRF-based framework, with a novel definition of the pairwise potential that promotes label diversity and leverages expert annotations. We consider three experiments: without annotations, with expert annotations, and with iterative human-in-the-loop annotations that progressively correct misclassified patches. Experiments on five patch-level classification datasets covering different organs and diseases demonstrate average accuracy gains of 16.0% without annotations and 27.5% with only 100 annotations, compared to zero-shot predictions. Moreover, integrating a human in the loop reaches a further gain of 32.6% with the same number of annotations. The code will be made available on this https URL.

Title: Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models

Authors: Siru Zhong, Junjie Qiu, Yangyu Wu, Yiqiu Liu, Yuanpeng He, Zhongwen Rao, Bin Yang, Chenjuan Guo, Hao Xu, Yuxuan Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.12083
Pdf URL: https://arxiv.org/pdf/2601.12083
Copy Paste: [[2601.12083]] Learning to Factorize and Adapt: A Versatile Approach Toward Universal Spatio-Temporal Foundation Models(https://arxiv.org/abs/2601.12083)
Keywords: foundation model
Abstract: Spatio-Temporal (ST) Foundation Models (STFMs) promise cross-dataset generalization, yet joint ST pretraining is computationally expensive and grapples with the heterogeneity of domain-specific spatial patterns. Substantially extending our preliminary conference version, we present FactoST-v2, an enhanced factorized framework redesigned for full weight transfer and arbitrary-length generalization. FactoST-v2 decouples universal temporal learning from domain-specific spatial adaptation. The first stage pretrains a minimalist encoder-only backbone using randomized sequence masking to capture invariant temporal dynamics, enabling probabilistic quantile prediction across variable horizons. The second stage employs a streamlined adapter to rapidly inject spatial awareness via meta adaptive learning and prompting. Comprehensive evaluations across diverse domains demonstrate that FactoST-v2 achieves state-of-the-art accuracy with linear efficiency - significantly outperforming existing foundation models in zero-shot and few-shot scenarios while rivaling domain-specific expert baselines. This factorized paradigm offers a practical, scalable path toward truly universal STFMs. Code is available at this https URL.

Title: Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks

Authors: Pengfei Zhu, Xavier Maldague
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12149
Pdf URL: https://arxiv.org/pdf/2601.12149
Copy Paste: [[2601.12149]] Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks(https://arxiv.org/abs/2601.12149)
Keywords: self-supervised
Abstract: Terahertz (THz) systems inherently introduce frequency-dependent degradation effects, resulting in low-frequency blurring and high-frequency noise in amplitude images. Conventional image processing techniques cannot simultaneously address both issues, and manual intervention is often required due to the unknown boundary between denoising and deblurring. To tackle this challenge, we propose a principal component analysis (PCA)-based THz self-supervised denoising and deblurring network (THz-SSDD). The network employs a Recorrupted-to-Recorrupted self-supervised learning strategy to capture the intrinsic features of noise by exploiting invariance under repeated corruption. PCA decomposition and reconstruction are then applied to restore images across both low and high frequencies. The performance of the THz-SSDD network was evaluated on four types of samples. Training requires only a small set of unlabeled noisy images, and testing across samples with different material properties and measurement modes demonstrates effective denoising and deblurring. Quantitative analysis further validates the network feasibility, showing improvements in image quality while preserving the physical characteristics of the original signals.

Title: Enhanced Diagnostic Performance via Large-Resolution Inference Optimization for Pathology Foundation Models

Authors: Mengxuan Hu, Zihan Guan, John Kang, Sheng Li, Zhongliang Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12150
Pdf URL: https://arxiv.org/pdf/2601.12150
Copy Paste: [[2601.12150]] Enhanced Diagnostic Performance via Large-Resolution Inference Optimization for Pathology Foundation Models(https://arxiv.org/abs/2601.12150)
Keywords: foundation model
Abstract: Despite their prominent performance on tasks such as ROI classification and segmentation, many pathology foundation models remain constrained by a specific input size e.g. 224 x 224, creating substantial inefficiencies when applied to whole-slide images (WSIs), which span thousands of resolutions. A naive strategy is to either enlarge inputs or downsample the WSIs. However, enlarging inputs results in prohibitive GPU memory consumption, while downsampling alters the microns-per-pixel resolution and obscures critical morphological details. To overcome these limitations, we propose an space- and time- efficient inference strategy that sparsifies attention using spatially aware neighboring blocks and filters out non-informative tokens through global attention scores. This design substantially reduces GPU memory and runtime during high-resolution WSI inference while preserving and even improving the downstream performance, enabling inference at higher resolutions under the same GPU budget. The experimental results show that our method can achieves up to an 7.67% improvement in the ROI classification and compatible results in segmentation.

Title: Wavelet-Driven Masked Multiscale Reconstruction for PPG Foundation Models

Authors: Megha Thukral, Cyrus Tanade, Simon A. Lee, Juhyeon Lee, Hao Zhou, Keum San Chun, Migyeong Gwak, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Mehrab Bin Morshed, Subramaniam Venkatraman, Sharanya Arcot Desai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12215
Pdf URL: https://arxiv.org/pdf/2601.12215
Copy Paste: [[2601.12215]] Wavelet-Driven Masked Multiscale Reconstruction for PPG Foundation Models(https://arxiv.org/abs/2601.12215)
Keywords: self-supervised, foundation model
Abstract: Wearable foundation models have the potential to transform digital health by learning transferable representations from large-scale biosignals collected in everyday settings. While recent progress has been made in large-scale pretraining, most approaches overlook the spectral structure of photoplethysmography (PPG) signals, wherein physiological rhythms unfold across multiple frequency bands. Motivated by the insight that many downstream health-related tasks depend on multi-resolution features spanning fine-grained waveform morphology to global rhythmic dynamics, we introduce Masked Multiscale Reconstruction (MMR) for PPG representation learning - a self-supervised pretraining framework that explicitly learns from hierarchical time-frequency scales of PPG data. The pretraining task is designed to reconstruct randomly masked out coefficients obtained from a wavelet-based multiresolution decomposition of PPG signals, forcing the transformer encoder to integrate information across temporal and spectral scales. We pretrain our model with MMR using ~17 million unlabeled 10-second PPG segments from ~32,000 smartwatch users. On 17 of 19 diverse health-related tasks, MMR trained on large-scale wearable PPG data improves over or matches state-of-the-art open-source PPG foundation models, time-series foundation models, and other self-supervised baselines. Extensive analysis of our learned embeddings and systematic ablations underscores the value of wavelet-based representations, showing that they capture robust and physiologically-grounded features. Together, these results highlight the potential of MMR as a step toward generalizable PPG foundation models.

Title: Learning Longitudinal Health Representations from EHR and Wearable Data

Authors: Yuanyun Zhang, Han Zhou, Li Feng, Yilin Hong, Shi Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.12227
Pdf URL: https://arxiv.org/pdf/2601.12227
Copy Paste: [[2601.12227]] Learning Longitudinal Health Representations from EHR and Wearable Data(https://arxiv.org/abs/2601.12227)
Keywords: foundation model
Abstract: Foundation models trained on electronic health records show strong performance on many clinical prediction tasks but are limited by sparse and irregular documentation. Wearable devices provide dense continuous physiological signals but lack semantic grounding. Existing methods usually model these data sources separately or combine them through late fusion. We propose a multimodal foundation model that jointly represents electronic health records and wearable data as a continuous time latent process. The model uses modality specific encoders and a shared temporal backbone pretrained with self supervised and cross modal objectives. This design produces representations that are temporally coherent and clinically grounded. Across forecasting physiological and risk modeling tasks the model outperforms strong electronic health record only and wearable only baselines especially at long horizons and under missing data. These results show that joint electronic health record and wearable pretraining yields more faithful representations of longitudinal health.

Title: Wavelet-Aware Anomaly Detection in Multi-Channel User Logs via Deviation Modulation and Resolution-Adaptive Attention

Authors: Kaichuan Kong, Dongjie Liu, Xiaobo Jin, Shijie Xu, Guanggang Geng
Subjects: cs.LG, cs.CR, stat.CO
Abstract URL: https://arxiv.org/abs/2601.12231
Pdf URL: https://arxiv.org/pdf/2601.12231
Copy Paste: [[2601.12231]] Wavelet-Aware Anomaly Detection in Multi-Channel User Logs via Deviation Modulation and Resolution-Adaptive Attention(https://arxiv.org/abs/2601.12231)
Keywords: anomaly
Abstract: Insider threat detection is a key challenge in enterprise security, relying on user activity logs that capture rich and complex behavioral patterns. These logs are often multi-channel, non-stationary, and anomalies are rare, making anomaly detection challenging. To address these issues, we propose a novel framework that integrates wavelet-aware modulation, multi-resolution wavelet decomposition, and resolution-adaptive attention for robust anomaly detection. Our approach first applies a deviation-aware modulation scheme to suppress routine behaviors while amplifying anomalous deviations. Next, discrete wavelet transform (DWT) decomposes the log signals into multi-resolution representations, capturing both long-term trends and short-term anomalies. Finally, a learnable attention mechanism dynamically reweights the most discriminative frequency bands for detection. On the CERT r4.2 benchmark, our approach consistently outperforms existing baselines in precision, recall, and F1 score across various time granularities and scenarios.

Title: DiffusionQC: Artifact Detection in Histopathology via Diffusion Model

Authors: Zhenzhen Wang, Zhongliang Zhou, Zhuoyu Wen, Jeong Hwan Kook, John B Wojcik, John Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12233
Pdf URL: https://arxiv.org/pdf/2601.12233
Copy Paste: [[2601.12233]] DiffusionQC: Artifact Detection in Histopathology via Diffusion Model(https://arxiv.org/abs/2601.12233)
Keywords: diffusion
Abstract: Digital pathology plays a vital role across modern medicine, offering critical insights for disease diagnosis, prognosis, and treatment. However, histopathology images often contain artifacts introduced during slide preparation and digitization. Detecting and excluding them is essential to ensure reliable downstream analysis. Traditional supervised models typically require large annotated datasets, which is resource-intensive and not generalizable to novel artifact types. To address this, we propose DiffusionQC, which detects artifacts as outliers among clean images using a diffusion model. It requires only a set of clean images for training rather than pixel-level artifact annotations and predefined artifact types. Furthermore, we introduce a contrastive learning module to explicitly enlarge the distribution separation between artifact and clean images, yielding an enhanced version of our method. Empirical results demonstrate superior performance to state-of-the-art and offer cross-stain generalization capacity, with significantly less data and annotations.

Title: Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

Authors: Miao Li, Hanyang Jiang, Sikai Chen, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12247
Pdf URL: https://arxiv.org/pdf/2601.12247
Copy Paste: [[2601.12247]] Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models(https://arxiv.org/abs/2601.12247)
Keywords: diffusion
Abstract: Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

Title: Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

Authors: Fadlullah Raji, John Murray-Bruce
Subjects: cs.CV, cs.AI, cs.CG, cs.GR
Abstract URL: https://arxiv.org/abs/2601.12257
Pdf URL: https://arxiv.org/pdf/2601.12257
Copy Paste: [[2601.12257]] Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy(https://arxiv.org/abs/2601.12257)
Keywords: diffusion
Abstract: Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging or to localizing a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into \textit{light-occluding} and \textit{non-light-occluding} components to yield a separable non-linear least squares (SNLLS) inverse problem. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Moreover, SSD is trained in simulation but generalizes well to unseen classes in simulation and real-world NLOS scenes. SSD also shows surprising robustness to noise and ambient illumination.

Title: Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Authors: Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12263
Pdf URL: https://arxiv.org/pdf/2601.12263
Copy Paste: [[2601.12263]] Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers(https://arxiv.org/abs/2601.12263)
Keywords: generative
Abstract: Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.

Title: AgenticPruner: MAC-Constrained Neural Network Compression via LLM-Driven Strategy Search

Authors: Shahrzad Esmat, Mahdi Banisharif, Ali Jannesari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12272
Pdf URL: https://arxiv.org/pdf/2601.12272
Copy Paste: [[2601.12272]] AgenticPruner: MAC-Constrained Neural Network Compression via LLM-Driven Strategy Search(https://arxiv.org/abs/2601.12272)
Keywords: in-context
Abstract: Neural network pruning remains essential for deploying deep learning models on resource-constrained devices, yet existing approaches primarily target parameter reduction without directly controlling computational cost. This yields unpredictable inference latency in deployment scenarios where strict Multiply-Accumulate (MAC) operation budgets must be met. We propose AgenticPruner, a framework utilizing large language models to achieve MAC-constrained optimization through iterative strategy learning. Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet that learns optimal strategies from historical attempts. Through in-context learning, the Analysis Agent improves convergence success rate from 48% to 71% compared to grid search. Building upon isomorphic pruning's graph-based structural grouping, our method adds context-aware adaptation by analyzing patterns across pruning iterations, enabling automatic convergence to target MAC budgets within user-defined tolerance bands. We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures. On CNNs, our approach achieves MAC targeting while maintaining or improving accuracy: ResNet-50 reaches 1.77G MACs with 77.04% accuracy (+0.91% vs baseline); ResNet-101 achieves 4.22G MACs with 78.94% accuracy (+1.56% vs baseline). For ConvNeXt-Small, pruning to 8.17G MACs yields 1.41x GPU and 1.07x CPU speedup with 45% parameter reduction. On Vision Transformers, we demonstrate MAC-budget compliance within user-defined tolerance bands (typically +1% to +5% overshoot, -5% to -15% undershoot), establishing feasibility for deployment scenarios requiring strict computational guarantees.

Title: SDiT: Semantic Region-Adaptive for Diffusion Transformers

Authors: Bowen Lin, Fanjiang Ye, Yihua Liu, Zhenghui Guo, Boyuan Zhang, Weijian Zheng, Yufan Xu, Tiancheng Xing, Yuke Wang, Chengming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12283
Pdf URL: https://arxiv.org/pdf/2601.12283
Copy Paste: [[2601.12283]] SDiT: Semantic Region-Adaptive for Diffusion Transformers(https://arxiv.org/abs/2601.12283)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention. In this work, we observe that denoising dynamics are spatially non-uniform-background regions converge rapidly while edges and textured areas evolve much more actively. Building on this insight, we propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity. SDiT introduces a training-free framework combining (1) semantic-aware clustering via fast Quickshift-based segmentation, (2) complexity-driven regional scheduling to selectively update informative areas, and (3) boundary-aware refinement to maintain spatial coherence. Without any model retraining or architectural modification, SDiT achieves up to 3.0x acceleration while preserving nearly identical perceptual and semantic quality to full-attention inference.

Title: Conversational Context Classification: A Representation Engineering Approach

Authors: Jonathan Pan
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2601.12286
Pdf URL: https://arxiv.org/pdf/2601.12286
Copy Paste: [[2601.12286]] Conversational Context Classification: A Representation Engineering Approach(https://arxiv.org/abs/2601.12286)
Keywords: anomaly, in-context
Abstract: The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM's hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM's internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.

Title: S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection

Authors: Xiangyu Hu, Yicheng Hong, Hongchuang Zheng, Wenjun Zeng, Bingyao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12313
Pdf URL: https://arxiv.org/pdf/2601.12313
Copy Paste: [[2601.12313]] S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection(https://arxiv.org/abs/2601.12313)
Keywords: generative
Abstract: The rapid development of generative models has imposed an urgent demand for detection schemes with strong generalization capabilities. However, existing detection methods generally suffer from overfitting to specific source models, leading to significant performance degradation when confronted with unseen generative architectures. To address these challenges, this paper proposes a cross-model detection framework called S 2 F-Net, whose core lies in exploring and leveraging the inherent spectral discrepancies between real and synthetic textures. Considering that upsampling operations leave unique and distinguishable frequency fingerprints in both texture-poor and texture-rich regions, we focus our research on the detection of frequency-domain artifacts, aiming to fundamentally improve the generalization performance of the model. Specifically, we introduce a learnable frequency attention module that adaptively weights and enhances discriminative frequency bands by synergizing spatial texture analysis and spectral this http URL the AIGCDetectBenchmark, which includes 17 categories of generative models, S 2 F-Net achieves a detection accuracy of 90.49%, significantly outperforming various existing baseline methods in cross-domain detection scenarios.

Title: Turbo-GoDec: Exploiting the Cluster Sparsity Prior for Hyperspectral Anomaly Detection

Authors: Jiahui Sheng, Xiaorun Li, Shuhan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12337
Pdf URL: https://arxiv.org/pdf/2601.12337
Copy Paste: [[2601.12337]] Turbo-GoDec: Exploiting the Cluster Sparsity Prior for Hyperspectral Anomaly Detection(https://arxiv.org/abs/2601.12337)
Keywords: anomaly
Abstract: As a key task in hyperspectral image processing, hyperspectral anomaly detection has garnered significant attention and undergone extensive research. Existing methods primarily relt on two prior assumption: low-rank background and sparse anomaly, along with additional spatial assumptions of the background. However, most methods only utilize the sparsity prior assumption for anomalies and rarely expand on this hypothesis. From observations of hyperspectral images, we find that anomalous pixels exhibit certain spatial distribution characteristics: they often manifest as small, clustered groups in space, which we refer to as cluster sparsity of anomalies. Then, we combined the cluster sparsity prior with the classical GoDec algorithm, incorporating the cluster sparsity prior into the S-step of GoDec. This resulted in a new hyperspectral anomaly detection method, which we called Turbo-GoDec. In this approach, we modeled the cluster sparsity prior of anomalies using a Markov random field and computed the marginal probabilities of anomalies through message passing on a factor graph. Locations with high anomalous probabilities were treated as the sparse component in the Turbo-GoDec. Experiments are conducted on three real hyperspectral image (HSI) datasets which demonstrate the superior performance of the proposed Turbo-GoDec method in detecting small-size anomalies comparing with the vanilla GoDec (LSMAD) and state-of-the-art anomaly detection methods. The code is available at this https URL.

Title: Time-Continuous Modeling for Temporal Affective Pattern Recognition in LLMs

Authors: Rezky Kam, Coddy N. Siswanto
Subjects: cs.LG, cs.AI, cs.ET, cs.HC, eess.SY
Abstract URL: https://arxiv.org/abs/2601.12341
Pdf URL: https://arxiv.org/pdf/2601.12341
Copy Paste: [[2601.12341]] Time-Continuous Modeling for Temporal Affective Pattern Recognition in LLMs(https://arxiv.org/abs/2601.12341)
Keywords: in-context
Abstract: This paper introduces a dataset and conceptual framework for LLMs to mimic real world emotional dynamics through time and in-context learning leveraging physics-informed neural network, opening a possibility for interpretable dialogue modeling.

Title: From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

Authors: Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2601.12358
Pdf URL: https://arxiv.org/pdf/2601.12358
Copy Paste: [[2601.12358]] From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles(https://arxiv.org/abs/2601.12358)
Keywords: in-context
Abstract: Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

Title: DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data

Authors: Jiafei Zhang, Songliang Cao, Binghui Xu, Yanan Li, Weiwei Jia, Tingting Wu, Hao Lu, Weijuan Hu, Zhiguo Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12366
Pdf URL: https://arxiv.org/pdf/2601.12366
Copy Paste: [[2601.12366]] DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data(https://arxiv.org/abs/2601.12366)
Keywords: foundation model
Abstract: DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel-level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross-species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two-stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general-purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night-time environment (86.90% mIoU), high-density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.

Title: LR-DWM: Efficient Watermarking for Diffusion Language Models

Authors: Ofek Raban, Ethan Fetaya, Gal Chechik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12376
Pdf URL: https://arxiv.org/pdf/2601.12376
Copy Paste: [[2601.12376]] LR-DWM: Efficient Watermarking for Diffusion Language Models(https://arxiv.org/abs/2601.12376)
Keywords: diffusion
Abstract: Watermarking (WM) is a critical mechanism for detecting and attributing AI-generated content. Current WM methods for Large Language Models (LLMs) are predominantly tailored for autoregressive (AR) models: They rely on tokens being generated sequentially, and embed stable signals within the generated sequence based on the previously sampled text. Diffusion Language Models (DLMs) generate text via non-sequential iterative denoising, which requires significant modification to use WM methods designed for AR models. Recent work proposed to watermark DLMs by inverting the process when needed, but suffers significant computational or memory overhead. We introduce Left-Right Diffusion Watermarking (LR-DWM), a scheme that biases the generated token based on both left and right neighbors, when they are available. LR-DWM incurs minimal runtime and memory overhead, remaining close to the non-watermarked baseline DLM while enabling reliable statistical detection under standard evaluation settings. Our results demonstrate that DLMs can be watermarked efficiently, achieving high detectability with negligible computational and memory overhead.

Title: Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection

Authors: Jiahui Sheng, Yidan Shi, Shu Xiang, Xiaorun Li, Shuhan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12379
Pdf URL: https://arxiv.org/pdf/2601.12379
Copy Paste: [[2601.12379]] Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection(https://arxiv.org/abs/2601.12379)
Keywords: generative, anomaly
Abstract: Hyperspectral images (HSIs) are a type of image that contains abundant spectral information. As a type of real-world data, the high-dimensional spectra in hyperspectral images are actually determined by only a few factors, such as chemical composition and illumination. Thus, spectra in hyperspectral images are highly likely to satisfy the manifold hypothesis. Based on the hyperspectral manifold hypothesis, we propose a novel hyperspectral anomaly detection method (named ScoreAD) that leverages the time-dependent gradient field of the data distribution (i.e., the score), as learned by a score-based generative model (SGM). Our method first trains the SGM on the entire set of spectra from the hyperspectral image. At test time, each spectrum is passed through a perturbation kernel, and the resulting perturbed spectrum is fed into the trained SGM to obtain the estimated score. The manifold hypothesis of HSIs posits that background spectra reside on one or more low-dimensional manifolds. Conversely, anomalous spectra, owing to their unique spectral signatures, are considered outliers that do not conform to the background manifold. Based on this fundamental discrepancy in their manifold distributions, we leverage a generative SGM to achieve hyperspectral anomaly detection. Experiments on the four hyperspectral datasets demonstrate the effectiveness of the proposed method. The code is available at this https URL.

Title: A Hierarchical Benchmark of Foundation Models for Dermatology

Authors: Furkan Yuceyalcin, Abdurrahim Yilmaz, Burak Temelkuran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12382
Pdf URL: https://arxiv.org/pdf/2601.12382
Copy Paste: [[2601.12382]] A Hierarchical Benchmark of Foundation Models for Dermatology(https://arxiv.org/abs/2601.12382)
Keywords: foundation model
Abstract: Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model's ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a "granularity gap" in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.

Title: Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation

Authors: Dasith de Silva Edirimuni, Ajmal Saeed Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12391
Pdf URL: https://arxiv.org/pdf/2601.12391
Copy Paste: [[2601.12391]] Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation(https://arxiv.org/abs/2601.12391)
Keywords: diffusion
Abstract: Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering $\textit{class-partitioned codebook}$ where codevectors are labeled by class. To address the problem of $\textit{codebook collapse}$, we propose a $\textit{class-aware}$ running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE's class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.

Title: Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation

Authors: Jinmei Liu, Haoru Li, Zhenhong Sun, Chaofeng Chen, Yatao Bian, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12401
Pdf URL: https://arxiv.org/pdf/2601.12401
Copy Paste: [[2601.12401]] Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation(https://arxiv.org/abs/2601.12401)
Keywords: diffusion, generative
Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for fine-tuning large-scale generative models, such as diffusion and flow models, to align with complex human preferences and user-specified tasks. A fundamental limitation remains \textit{the curse of diversity collapse}, where the objective formulation and optimization landscape inherently collapse the policy to a Dirac delta distribution. To address this challenge, we propose \textbf{DRIFT} (\textbf{D}ive\textbf{R}sity-\textbf{I}ncentivized Reinforcement \textbf{F}ine-\textbf{T}uning for Versatile Image Generation), an innovative framework that systematically incentivizes output diversity throughout the on-policy fine-tuning process, reconciling strong task alignment with high generation diversity to enhance versatility essential for applications that demand diverse candidate generations. We approach the problem across three representative perspectives: i) \textbf{sampling} a reward-concentrated subset that filters out reward outliers to prevent premature collapse; ii) \textbf{prompting} with stochastic variations to expand the conditioning space, and iii) \textbf{optimization} of the intra-group diversity with a potential-based reward shaping mechanism. Experimental results show that DRIFT achieves superior Pareto dominance regarding task alignment and generation diversity, yielding a $ 9.08\%\!\sim\! 43.46\%$ increase in diversity at equivalent alignment levels and a $ 59.65\% \!\sim\! 65.86\%$ increase in alignment at equivalent levels of diversity.

Title: Graph Attention Networks with Physical Constraints for Anomaly Detection

Authors: Mohammadhossein Homaei, Iman Khazrak, Ruben Molano, Andres Caro, Mar Avila
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2601.12426
Pdf URL: https://arxiv.org/pdf/2601.12426
Copy Paste: [[2601.12426]] Graph Attention Networks with Physical Constraints for Anomaly Detection(https://arxiv.org/abs/2601.12426)
Keywords: anomaly
Abstract: Water distribution systems (WDSs) face increasing cyber-physical risks, which make reliable anomaly detection essential. Many data-driven models ignore network topology and are hard to interpret, while model-based ones depend strongly on parameter accuracy. This work proposes a hydraulic-aware graph attention network using normalized conservation law violations as features. It combines mass and energy balance residuals with graph attention and bidirectional LSTM to learn spatio-temporal patterns. A multi-scale module aggregates detection scores from node to network level. On the BATADAL dataset, it reaches $F1=0.979$, showing $3.3$pp gain and high robustness under $15\%$ parameter noise.

Title: Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

Authors: Marcus Ma, Jordan Prescott, Emily Zhou, Tiantian Feng, Kleanthis Avramidis, Gabor Mihaly Toth, Shrikanth Narayanan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12534
Pdf URL: https://arxiv.org/pdf/2601.12534
Copy Paste: [[2601.12534]] Encoding Emotion Through Self-Supervised Eye Movement Reconstruction(https://arxiv.org/abs/2601.12534)
Keywords: self-supervised
Abstract: The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation's Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model's encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.

Title: Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning

Authors: Ahmed Attia, Alham Fikri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12535
Pdf URL: https://arxiv.org/pdf/2601.12535
Copy Paste: [[2601.12535]] Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning(https://arxiv.org/abs/2601.12535)
Keywords: self-supervised
Abstract: Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many potential methods for improving low-resource MT remain unexplored. We investigate a self-supervised reinforcement-learning-based fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving.

Title: Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

Authors: Elham Tajik, Conrad Borchers, Bahar Shahrokhian, Sebastian Simon, Ali Keramati, Sonika Pal, Sreecharan Sankaranarayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12618
Pdf URL: https://arxiv.org/pdf/2601.12618
Copy Paste: [[2601.12618]] Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems(https://arxiv.org/abs/2601.12618)
Keywords: generative
Abstract: Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi-agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub-functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter-rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning-trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.

Title: VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

Authors: Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, Yi Zhang, Zhi-Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12672
Pdf URL: https://arxiv.org/pdf/2601.12672
Copy Paste: [[2601.12672]] VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness(https://arxiv.org/abs/2601.12672)
Keywords: generative
Abstract: The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

Title: S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

Authors: Lin Zhao, Yushu Wu, Aleksei Lebedev, Dishani Lahiri, Meng Dong, Arpit Sahni, Michael Vasilkovsky, Hao Chen, Ju Hu, Aliaksandr Siarohin, Sergey Tulyakov, Yanzhi Wang, Anil Kag, Yanyu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12719
Pdf URL: https://arxiv.org/pdf/2601.12719
Copy Paste: [[2601.12719]] S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation(https://arxiv.org/abs/2601.12719)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

Title: DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition

Authors: Hanyu Zhu, Zhihao Zhan, Yuhang Ming, Liang Li, Dibo Hou, Javier Civera, Wanzeng Kong
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2601.12729
Pdf URL: https://arxiv.org/pdf/2601.12729
Copy Paste: [[2601.12729]] DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition(https://arxiv.org/abs/2601.12729)
Keywords: foundation model
Abstract: One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

Title: A Graph Prompt Fine-Tuning Method for WSN Spatio-Temporal Correlation Anomaly Detection

Authors: Miao Ye, Jing Cui, Yuan huang, Qian He, Yong Wang, Jiwen Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12745
Pdf URL: https://arxiv.org/pdf/2601.12745
Copy Paste: [[2601.12745]] A Graph Prompt Fine-Tuning Method for WSN Spatio-Temporal Correlation Anomaly Detection(https://arxiv.org/abs/2601.12745)
Keywords: self-supervised, anomaly
Abstract: Anomaly detection of multi-temporal modal data in Wireless Sensor Network (WSN) can provide an important guarantee for reliable network operation. Existing anomaly detection methods in multi-temporal modal data scenarios have the problems of insufficient extraction of spatio-temporal correlation features, high cost of anomaly sample category annotation, and imbalance of anomaly samples. In this paper, a graph neural network anomaly detection backbone network incorporating spatio-temporal correlation features and a multi-task self-supervised training strategy of "pre-training - graph prompting - fine-tuning" are designed for the characteristics of WSN graph structure data. First, the anomaly detection backbone network is designed by improving the Mamba model based on a multi-scale strategy and inter-modal fusion method, and combining it with a variational graph convolution module, which is capable of fully extracting spatio-temporal correlation features in the multi-node, multi-temporal modal scenarios of WSNs. Secondly, we design a three-subtask learning "pre-training" method with no-negative comparative learning, prediction, and reconstruction to learn generic features of WSN data samples from unlabeled data, and design a "graph prompting-fine-tuning" mechanism to guide the pre-trained self-supervised learning. The model is fine-tuned through the "graph prompting-fine-tuning" mechanism to guide the pre-trained self-supervised learning model to complete the parameter fine-tuning, thereby reducing the training cost and enhancing the detection generalization performance. The F1 metrics obtained from experiments on the public dataset and the actual collected dataset are up to 91.30% and 92.31%, respectively, which provides better detection performance and generalization ability than existing methods designed by the method.

Title: SSPFormer: Self-Supervised Pretrained Transformer for MRI Images

Authors: Jingkai Li, Xiaoze Tian, Yuhang Shen, Jia Wang, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12747
Pdf URL: https://arxiv.org/pdf/2601.12747
Copy Paste: [[2601.12747]] SSPFormer: Self-Supervised Pretrained Transformer for MRI Images(https://arxiv.org/abs/2601.12747)
Keywords: self-supervised
Abstract: The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.

Title: Moaw: Unleashing Motion Awareness for Video Diffusion Models

Authors: Tianqi Zhang, Ziyi Wang, Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Zhengyang Huang, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12761
Pdf URL: https://arxiv.org/pdf/2601.12761
Copy Paste: [[2601.12761]] Moaw: Unleashing Motion Awareness for Video Diffusion Models(https://arxiv.org/abs/2601.12761)
Keywords: diffusion, generative
Abstract: Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.

Title: Towards Unbiased Source-Free Object Detection via Vision Foundation Models

Authors: Zhi Cai, Yingjie Gao, Yanan Zhang, Xinzhu Ma, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12765
Pdf URL: https://arxiv.org/pdf/2601.12765
Copy Paste: [[2601.12765]] Towards Unbiased Source-Free Object Detection via Vision Foundation Models(https://arxiv.org/abs/2601.12765)
Keywords: foundation model
Abstract: Source-Free Object Detection (SFOD) has garnered much attention in recent years by eliminating the need of source-domain data in cross-domain tasks, but existing SFOD methods suffer from the Source Bias problem, i.e. the adapted model remains skewed towards the source domain, leading to poor generalization and error accumulation during self-training. To overcome this challenge, we propose Debiased Source-free Object Detection (DSOD), a novel VFM-assisted SFOD framework that can effectively mitigate source bias with the help of powerful VFMs. Specifically, we propose Unified Feature Injection (UFI) module that integrates VFM features into the CNN backbone through Simple-Scale Extension (SSE) and Domain-aware Adaptive Weighting (DAAW). Then, we propose Semantic-aware Feature Regularization (SAFR) that constrains feature learning to prevent overfitting to source domain characteristics. Furthermore, we propose a VFM-free variant, termed DSOD-distill for computation-restricted scenarios through a novel Dual-Teacher distillation scheme. Extensive experiments on multiple benchmarks demonstrate that DSOD outperforms state-of-the-art SFOD methods, achieving 48.1% AP on Normal-to-Foggy weather adaptation, 39.3% AP on Cross-scene adaptation, and 61.4% AP on Synthetic-to-Real adaptation.

Title: Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image

Authors: Shuling Zhao, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12770
Pdf URL: https://arxiv.org/pdf/2601.12770
Copy Paste: [[2601.12770]] Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image(https://arxiv.org/abs/2601.12770)
Keywords: generative
Abstract: Building 3D animatable head avatars from a single image is an important yet challenging problem. Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars. In this work, we propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass, enabling real-time animation and simultaneous 360$^\circ$ rendering views. To facilitate efficient animation control, we model 3D head avatars with Gaussian primitives embedded on the surface of a parametric face model within the UV space. To obtain knowledge of full-head geometry and textures, we leverage rich 3D full-head priors within a pretrained 3D generative adversarial network (GAN) for global full-head feature extraction and multi-view supervision. To increase the fidelity of the 3D reconstruction of the input image, we take advantage of the symmetric nature of the UV space and human faces to fuse local fine-grained input image features with the global full-head textures. Extensive experiments demonstrate the effectiveness of our method, achieving high-quality 3D full-head modeling as well as real-time animation, thereby improving the realism of 3D talking avatars.

Title: Distilling Time Series Foundation Models for Efficient Forecasting

Authors: Yuqi Li, Kuiye Ding, Chuanguang Yang, Szu-Yu Chen, Yingli Tian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.12785
Pdf URL: https://arxiv.org/pdf/2601.12785
Copy Paste: [[2601.12785]] Distilling Time Series Foundation Models for Efficient Forecasting(https://arxiv.org/abs/2601.12785)
Keywords: foundation model
Abstract: Time Series foundation models (TSFMs) deliver strong forecasting performance through large-scale pretraining, but their large parameter sizes make deployment costly. While knowledge distillation offers a natural and effective approach for model compression, techniques developed for general machine learning tasks are not directly applicable to time series forecasting due to the unique characteristics. To address this, we present DistilTS, the first distillation framework specifically designed for TSFMs. DistilTS addresses two key challenges: (1) task difficulty discrepancy, specific to forecasting, where uniform weighting makes optimization dominated by easier short-term horizons, while long-term horizons receive weaker supervision; and (2) architecture discrepancy, a general challenge in distillation, for which we design an alignment mechanism in the time series forecasting. To overcome these issues, DistilTS introduces horizon-weighted objectives to balance learning across horizons, and a temporal alignment strategy that reduces architectural mismatch, enabling compact models. Experiments on multiple benchmarks demonstrate that DistilTS achieves forecasting performance comparable to full-sized TSFMs, while reducing parameters by up to 1/150 and accelerating inference by up to 6000x. Code is available at: this https URL.

Title: A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling

Authors: Wei Chen, Liang Wu, Shuyi Lu, Yuanyuan Sun, Wenkai Bi, Zilong Yuan, Yaoyao He, Feng Wang, Junchi Ma, Shuyong Liu, Zhaoping Cheng, Xiaoyan Hu, Jianfeng Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12820
Pdf URL: https://arxiv.org/pdf/2601.12820
Copy Paste: [[2601.12820]] A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling(https://arxiv.org/abs/2601.12820)
Keywords: foundation model
Abstract: Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.

Title: Knowledge-Integrated Representation Learning for Crypto Anomaly Detection under Extreme Label Scarcity; Relational Domain-Logic Integration with Retrieval-Grounded Context and Path-Level Explanations

Authors: Gyuyeon Na, Minjung Park, Soyoun Kim, Jungbin Shin, Sangmi Chai
Subjects: cs.LG, q-fin.RM
Abstract URL: https://arxiv.org/abs/2601.12839
Pdf URL: https://arxiv.org/pdf/2601.12839
Copy Paste: [[2601.12839]] Knowledge-Integrated Representation Learning for Crypto Anomaly Detection under Extreme Label Scarcity; Relational Domain-Logic Integration with Retrieval-Grounded Context and Path-Level Explanations(https://arxiv.org/abs/2601.12839)
Keywords: anomaly
Abstract: Detecting anomalous trajectories in decentralized crypto networks is fundamentally challenged by extreme label scarcity and the adaptive evasion strategies of illicit actors. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they struggle to internalize multi hop, logic driven motifs such as fund dispersal and layering that characterize sophisticated money laundering, limiting their forensic accountability under regulations like the FATF Travel Rule. To address this limitation, we propose Relational Domain Logic Integration (RDLI), a framework that embeds expert derived heuristics as differentiable, logic aware latent signals within representation learning. Unlike static rule based approaches, RDLI enables the detection of complex transactional flows that evade standard message passing. To further account for market volatility, we incorporate a Retrieval Grounded Context (RGC) module that conditions anomaly scoring on regulatory and macroeconomic context, mitigating false positives caused by benign regime shifts. Under extreme label scarcity (0.01%), RDLI outperforms state of the art GNN baselines by 28.9% in F1 score. A micro expert user study further confirms that RDLI path level explanations significantly improve trustworthiness, perceived usefulness, and clarity compared to existing methods, highlighting the importance of integrating domain logic with contextual grounding for both accuracy and explainability.

Title: Generating Cyclic Conformers with Flow Matching in Cremer-Pople Coordinates

Authors: Luca Schaufelberger, Aline Hartgers, Kjell Jorner
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2601.12859
Pdf URL: https://arxiv.org/pdf/2601.12859
Copy Paste: [[2601.12859]] Generating Cyclic Conformers with Flow Matching in Cremer-Pople Coordinates(https://arxiv.org/abs/2601.12859)
Keywords: generative
Abstract: Cyclic molecules are ubiquitous across applications in chemistry and biology. Their restricted conformational flexibility provides structural pre-organization that is key to their function in drug discovery and catalysis. However, reliably sampling the conformer ensembles of ring systems remains challenging. Here, we introduce PuckerFlow, a generative machine learning model that performs flow matching on the Cremer-Pople space, a low-dimensional internal coordinate system capturing the relevant degrees of freedom of rings. Our approach enables generation of valid closed rings by design and demonstrates strong performance in generating conformers that are both diverse and precise. We show that PuckerFlow outperforms other conformer generation methods on nearly all quantitative metrics and illustrate the potential of PuckerFlow for ring systems relevant to chemical applications, particularly in catalysis and drug discovery. This work enables efficient and reliable conformer generation of cyclic structures, paving the way towards modeling structure-property relationships and the property-guided generation of rings across a wide range of applications in chemistry and biology.

Title: PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection

Authors: Sharmila S P
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12866
Pdf URL: https://arxiv.org/pdf/2601.12866
Copy Paste: [[2601.12866]] PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection(https://arxiv.org/abs/2601.12866)
Keywords: anomaly
Abstract: The increasing prevalence of malicious Portable Document Format (PDF) files necessitates robust and comprehensive feature extraction techniques for effective detection and analysis. This work presents a unified framework that integrates graph-based, structural, and metadata-driven analysis to generate a rich feature representation for each PDF document. The system extracts text from PDF pages and constructs undirected graphs based on pairwise word relationships, enabling the computation of graph-theoretic features such as node count, edge density, and clustering coefficient. Simultaneously, the framework parses embedded metadata to quantify character distributions, entropy patterns, and inconsistencies across fields such as author, title, and producer. Temporal features are derived from creation and modification timestamps to capture behavioral signatures, while structural elements including, object streams, fonts, and embedded images, are quantified to reflect document complexity. Boolean flags for potentially malicious PDF constructs (e.g., JavaScript, launch actions) are also extracted. Together, these features form a high-dimensional vector representation (170 dimensions) that is well-suited for downstream tasks such as malware classification, anomaly detection, and forensic analysis. The proposed approach is scalable, extensible, and designed to support real-world PDF threat intelligence workflows.6

Title: TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents

Authors: Chan Naseeb, Adeel Ashraf Cheema, Hassan Sami, Tayyab Afzal, Muhammad Omair, Usman Habib
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12895
Pdf URL: https://arxiv.org/pdf/2601.12895
Copy Paste: [[2601.12895]] TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents(https://arxiv.org/abs/2601.12895)
Keywords: generative
Abstract: The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.

Title: GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation

Authors: Riccardo Catalini, Davide Di Nucci, Guido Borghi, Davide Davoli, Lorenzo Garattoni, Giampiero Francesca, Yuki Kawana, Roberto Vezzani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12948
Pdf URL: https://arxiv.org/pdf/2601.12948
Copy Paste: [[2601.12948]] GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation(https://arxiv.org/abs/2601.12948)
Keywords: diffusion
Abstract: We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at this https URL.

Title: StyMam: A Mamba-Based Generator for Artistic Style Transfer

Authors: Zhou Hong, Rongsheng Hu, Yicheng Di, Xiaolong Xu, Ning Dong, Yihua Shao, Run Ling, Yun Wang, Juqin Wang, Zhanjie Zhang, Ao Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12954
Pdf URL: https://arxiv.org/pdf/2601.12954
Copy Paste: [[2601.12954]] StyMam: A Mamba-Based Generator for Artistic Style Transfer(https://arxiv.org/abs/2601.12954)
Keywords: diffusion, generative
Abstract: Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.

Title: Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

Authors: John Waithaka, Gustave Bwirayesu, Moise Busogi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.12964
Pdf URL: https://arxiv.org/pdf/2601.12964
Copy Paste: [[2601.12964]] Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation(https://arxiv.org/abs/2601.12964)
Keywords: self-supervised
Abstract: Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

Title: Deterministic Dynamics of Sampling Processes in Score-Based Diffusion Models with Multiplicative Noise Conditioning

Authors: Doheon Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.12965
Pdf URL: https://arxiv.org/pdf/2601.12965
Copy Paste: [[2601.12965]] Deterministic Dynamics of Sampling Processes in Score-Based Diffusion Models with Multiplicative Noise Conditioning(https://arxiv.org/abs/2601.12965)
Keywords: diffusion
Abstract: Score-based diffusion models generate new samples by learning the score function associated with a diffusion process. While the effectiveness of these models can be theoretically explained using differential equations related to the sampling process, previous work by Song and Ermon (2020) demonstrated that neural networks using multiplicative noise conditioning can still generate satisfactory samples. In this setup, the model is expressed as the product of two functions: one depending on the spatial variable and the other on the noise magnitude. This structure limits the model's ability to represent a more general relationship between the spatial variable and the noise, indicating that it cannot fully learn the correct score. Despite this limitation, the models perform well in practice. In this work, we provide a theoretical explanation for this phenomenon by studying the deterministic dynamics of the associated differential equations, offering insight into how the model operates.

Title: The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

Authors: Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.12979
Pdf URL: https://arxiv.org/pdf/2601.12979
Copy Paste: [[2601.12979]] The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check(https://arxiv.org/abs/2601.12979)
Keywords: diffusion
Abstract: The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

Title: Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers

Authors: Sulaiman Khan, Md. Rafiul Biswas, Zubair Shah
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.12981
Pdf URL: https://arxiv.org/pdf/2601.12981
Copy Paste: [[2601.12981]] Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers(https://arxiv.org/abs/2601.12981)
Keywords: generative
Abstract: This study introduces a novel approach for early Type 2 Diabetes Mellitus (T2DM) risk prediction using a tabular transformer (TabTrans) architecture to analyze longitudinal patient data. By processing patients` longitudinal health records and bone-related tabular data, our model captures complex, long-range dependencies in disease progression that conventional methods often overlook. We validated our TabTrans model on a retrospective Qatar BioBank (QBB) cohort of 1,382 subjects, comprising 725 men (146 diabetic, 579 healthy) and 657 women (133 diabetic, 524 healthy). The study integrated electronic health records (EHR) with dual-energy X-ray absorptiometry (DXA) data. To address class imbalance, we employed SMOTE and SMOTE-ENN resampling techniques. The proposed model`s performance is evaluated against conventional machine learning (ML) and generative AI models, including Claude 3.5 Sonnet (Anthropic`s constitutional AI), GPT-4 (OpenAI`s generative pre-trained transformer), and Gemini Pro (Google`s multimodal language model). Our TabTrans model demonstrated superior predictive performance, achieving ROC AUC $\geq$ 79.7 % for T2DM prediction compared to both generative AI models and conventional ML approaches. Feature interpretation analysis identified key risk indicators, with visceral adipose tissue (VAT) mass and volume, ward bone mineral density (BMD) and bone mineral content (BMC), T and Z-scores, and L1-L4 scores emerging as the most important predictors associated with diabetes development in Qatari adults. These findings demonstrate the significant potential of TabTrans for analyzing complex tabular healthcare data, providing a powerful tool for proactive T2DM management and personalized clinical interventions in the Qatari population. Index Terms: tabular transformers, multimodal data, DXA data, diabetes, T2DM, feature interpretation, tabular data

Title: PrivFly: A Privacy-Preserving Self-Supervised Framework for Rare Attack Detection in IoFT

Authors: Safaa Menssouri, El Mehdi Amhoud
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2601.13003
Pdf URL: https://arxiv.org/pdf/2601.13003
Copy Paste: [[2601.13003]] PrivFly: A Privacy-Preserving Self-Supervised Framework for Rare Attack Detection in IoFT(https://arxiv.org/abs/2601.13003)
Keywords: self-supervised
Abstract: The Internet of Flying Things (IoFT) plays a vital role in modern applications such as aerial surveillance and smart mobility. However, it remains highly vulnerable to cyberattacks that threaten the confidentiality, integrity, and availability of sensitive data. Developing effective intrusion detection systems (IDS) for IoFT networks faces key challenges, including data imbalance, privacy concerns, and the limited capability of traditional models to detect rare but potentially damaging cyber threats. In this work, we propose PrivFly, a privacy-preserving IDS framework that integrates self-supervised representation learning and differential privacy (DP) to enhance detection performance in imbalanced IoFT network traffic. We propose a masked feature reconstruction module for self-supervised pretraining, improving feature representations and boosting rare-class detection. Differential privacy is applied during training to protect sensitive information without significantly compromising model performance. In addition, we conduct a SHapley additive explanations (SHAP)-based analysis to evaluate the impact of DP on feature importance and model behavior. Experimental results on the ECU-IoFT dataset show that PrivFly achieves up to 98% accuracy and 99% F1-score, effectively balancing privacy and detection performance for secure IoFT systems.

Title: PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain

Authors: Sung Ju Lee, Nam Ik Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13128
Pdf URL: https://arxiv.org/pdf/2601.13128
Copy Paste: [[2601.13128]] PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain(https://arxiv.org/abs/2601.13128)
Keywords: diffusion
Abstract: The proliferation of hyper-realistic images from Latent Diffusion Models (LDMs) demands robust watermarking, yet existing post-hoc methods are prohibitively slow due to iterative optimization or inversion processes. We introduce PhaseMark, a single-shot, optimization-free framework that directly modulates the phase in the VAE latent frequency domain. This approach makes PhaseMark thousands of times faster than optimization-based techniques while achieving state-of-the-art resilience against severe attacks, including regeneration, without degrading image quality. We analyze four modulation variants, revealing a clear performance-quality trade-off. PhaseMark demonstrates a new paradigm where efficient, resilient watermarking is achieved by exploiting intrinsic latent properties.

Title: CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

Authors: Mingshuang Luo, Ruibing Hou, Bo Chao, Hong Chang, Zimo Liu, Yaowei Wang, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13133
Pdf URL: https://arxiv.org/pdf/2601.13133
Copy Paste: [[2601.13133]] CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks(https://arxiv.org/abs/2601.13133)
Keywords: self-supervised
Abstract: Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

Title: TVWorld: Foundations for Remote-Control TV Agents

Authors: Zhantao Ma, Quanfeng Lu, Shuai Zhong, Dahai Yu, Ping Luo, Michael K. Ng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.13142
Pdf URL: https://arxiv.org/pdf/2601.13142
Copy Paste: [[2601.13142]] TVWorld: Foundations for Remote-Control TV Agents(https://arxiv.org/abs/2601.13142)
Keywords: foundation model
Abstract: Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

Title: From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models

Authors: Pedro M. Gordaliza, Jaume Banus, Benoît Gérin, Maxence Wynen, Nataliia Molchanova, Jonas Richiardi, Meritxell Bach Cuadra
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13166
Pdf URL: https://arxiv.org/pdf/2601.13166
Copy Paste: [[2601.13166]] From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models(https://arxiv.org/abs/2601.13166)
Keywords: foundation model
Abstract: Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U-Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1-2 orders of magnitude faster and were 10 times smaller than competing transformer-based approaches. Models are available here: this https URL.

Title: LAViG-FLOW: Latent Autoregressive Video Generation for Fluid Flow Simulations

Authors: Vittoria De Pellegrini, Tariq Alkhalifah
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2601.13190
Pdf URL: https://arxiv.org/pdf/2601.13190
Copy Paste: [[2601.13190]] LAViG-FLOW: Latent Autoregressive Video Generation for Fluid Flow Simulations(https://arxiv.org/abs/2601.13190)
Keywords: diffusion
Abstract: Modeling and forecasting subsurface multiphase fluid flow fields underpin applications ranging from geological CO2 sequestration (GCS) operations to geothermal production. This is essential for ensuring both operational performance and long-term safety. While high fidelity multiphase simulators are widely used for this purpose, they become prohibitively expensive once many forward runs are required for inversion purposes and quantify uncertainty. To tackle this challenge we propose LAViG-FLOW, a latent autoregressive video generation diffusion framework that explicitly learns the coupled evolution of saturation and pressure fields. Each state variable is compressed by a dedicated 2D autoencoder, and a Video Diffusion Transformer (VDiT) models their coupled distribution across time. We first train the model on a given time horizon to learn their coupled relationship and then fine-tune it autoregressively so it can extrapolate beyond the observed time window. Evaluated on an open-source CO2 sequestration dataset, LAViG-FLOW generates saturation and pressure fields that stay consistent across time while running orders of magnitude faster than traditional numerical solvers.

Title: Diffusion-Driven Synthetic Tabular Data Generation for Enhanced DoS/DDoS Attack Classification

Authors: Aravind B, Anirud R.S., Sai Surya Teja N, Bala Subrahmanya Sriranga Navaneeth A, Karthika R, Mohankumar N
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13197
Pdf URL: https://arxiv.org/pdf/2601.13197
Copy Paste: [[2601.13197]] Diffusion-Driven Synthetic Tabular Data Generation for Enhanced DoS/DDoS Attack Classification(https://arxiv.org/abs/2601.13197)
Keywords: diffusion
Abstract: Class imbalance refers to a situation where certain classes in a dataset have significantly fewer samples than oth- ers, leading to biased model performance. Class imbalance in network intrusion detection using Tabular Denoising Diffusion Probability Models (TabDDPM) for data augmentation is ad- dressed in this paper. Our approach synthesizes high-fidelity minority-class samples from the CIC-IDS2017 dataset through iterative denoising processes. For the minority classes that have smaller samples, synthetic samples were generated and merged with the original dataset. The augmented training data enables an ANN classifier to achieve near-perfect recall on previously underrepresented attack classes. These results establish diffusion models as an effective solution for tabular data imbalance in security domains, with potential applications in fraud detection and medical diagnostics.

Title: Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation

Authors: Tianqi Du, Lizhe Fang, Weijie Yang, Chenheng Zhang, Zeming Wei, Yifei Wang, Yisen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13228
Pdf URL: https://arxiv.org/pdf/2601.13228
Copy Paste: [[2601.13228]] Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation(https://arxiv.org/abs/2601.13228)
Keywords: diffusion
Abstract: Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.

Title: Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

Authors: Junyi Zhang, Yiming Wang, Yunhong Lu, Qichao Wang, Wenzhe Qian, Xiaoyin Xu, David Gu, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13371
Pdf URL: https://arxiv.org/pdf/2601.13371
Copy Paste: [[2601.13371]] Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations(https://arxiv.org/abs/2601.13371)
Keywords: diffusion, generative
Abstract: A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

Title: Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study

Authors: A. Nieto Juscafresa, Á. Mazcuñán Herreros, J. Sullivan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13416
Pdf URL: https://arxiv.org/pdf/2601.13416
Copy Paste: [[2601.13416]] Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study(https://arxiv.org/abs/2601.13416)
Keywords: diffusion, self-supervised, generative
Abstract: Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.

Title: SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement

Authors: Yujian Xiong, Xuanzhao Dong, Wenhui Zhu, Xin Li, Oana Dumitrascu, Yalin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13417
Pdf URL: https://arxiv.org/pdf/2601.13417
Copy Paste: [[2601.13417]] SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement(https://arxiv.org/abs/2601.13417)
Keywords: diffusion
Abstract: Retinal fundus photography is indispensable for ophthalmic screening and diagnosis, yet image quality is often degraded by noise, artifacts, and uneven illumination. Recent GAN- and diffusion-based enhancement methods improve perceptual quality by aligning degraded images with high-quality distributions, but our analysis shows that this focus can distort intra-class geometry: clinically related samples become dispersed, disease-class boundaries blur, and downstream tasks such as grading or lesion detection are harmed. The Gromov Wasserstein (GW) discrepancy offers a principled solution by aligning distributions through internal pairwise distances, naturally preserving intra-class structure, but its high computational cost restricts practical use. To overcome this, we propose SGW-GAN, the first framework to incorporate Sliced GW (SGW) into retinal image enhancement. SGW approximates GW via random projections, retaining relational fidelity while greatly reducing cost. Experiments on public datasets show that SGW-GAN produces visually compelling enhancements, achieves superior diabetic retinopathy grading, and reports the lowest GW discrepancy across disease labels, demonstrating both efficiency and clinical fidelity for unpaired medical image enhancement.

Title: Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation

Authors: Mohit Kakda, Mirudula Shri Muthukumaran, Uttapreksha Patel, Lawrence Swaminathan Xavier Prince
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13440
Pdf URL: https://arxiv.org/pdf/2601.13440
Copy Paste: [[2601.13440]] Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation(https://arxiv.org/abs/2601.13440)
Keywords: anomaly
Abstract: Vision-Language Models (VLMs), particularly CLIP, have revolutionized anomaly detection by enabling zero-shot and few-shot defect identification without extensive labeled datasets. By learning aligned representations of images and text, VLMs facilitate anomaly classification and segmentation through natural language descriptions of normal and abnormal states, eliminating traditional requirements for task-specific training or defect examples. This project presents a comprehensive analysis of VLM-based approaches for anomaly classification (AC) and anomaly segmentation (AS). We systematically investigate key architectural paradigms including sliding window-based dense feature extraction (WinCLIP), multi-stage feature alignment with learnable projections (AprilLab framework), and compositional prompt ensemble strategies. Our analysis evaluates these methods across critical dimensions: feature extraction mechanisms, text-visual alignment strategies, prompt engineering techniques, zero-shot versus few-shot trade-offs, computational efficiency, and cross-domain generalization. Through rigorous experimentation on benchmarks such as MVTec AD and VisA, we compare classification accuracy, segmentation precision, and inference efficiency. The primary contribution is a foundational understanding of how and why VLMs succeed in anomaly detection, synthesizing practical insights for method selection and identifying current limitations. This work aims to facilitate informed adoption of VLM-based methods in industrial quality control and guide future research directions.

Title: BladeSDF : Unconditional and Conditional Generative Modeling of Representative Blade Geometries Using Signed Distance Functions

Authors: Ashish S. Nair, Sandipp Krishnan Ravi, Itzel Salgado, Changjie Sun, Sayan Ghosh, Liping Wang
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2601.13445
Pdf URL: https://arxiv.org/pdf/2601.13445
Copy Paste: [[2601.13445]] BladeSDF : Unconditional and Conditional Generative Modeling of Representative Blade Geometries Using Signed Distance Functions(https://arxiv.org/abs/2601.13445)
Keywords: generative
Abstract: Generative AI has emerged as a transformative paradigm in engineering design, enabling automated synthesis and reconstruction of complex 3D geometries while preserving feasibility and performance relevance. This paper introduces a domain-specific implicit generative framework for turbine blade geometry using DeepSDF, addressing critical gaps in performance-aware modeling and manufacturable design generation. The proposed method leverages a continuous signed distance function (SDF) representation to reconstruct and generate smooth, watertight geometries with quantified accuracy. It establishes an interpretable, near-Gaussian latent space that aligns with blade-relevant parameters, such as taper and chord ratios, enabling controlled exploration and unconditional synthesis through interpolation and Gaussian sampling. In addition, a compact neural network maps engineering descriptors, such as maximum directional strains, to latent codes, facilitating the generation of performance-informed geometry. The framework achieves high reconstruction fidelity, with surface distance errors concentrated within $1\%$ of the maximum blade dimension, and demonstrates robust generalization to unseen designs. By integrating constraints, objectives, and performance metrics, this approach advances beyond traditional 2D-guided or unconstrained 3D pipelines, offering a practical and interpretable solution for data-driven turbine blade modeling and concept generation.

Title: Preconditioning Benefits of Spectral Orthogonalization in Muon

Authors: Jianhao Ma, Yu Huang, Yuejie Chi, Yuxin Chen
Subjects: cs.LG, cs.AI, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2601.13474
Pdf URL: https://arxiv.org/pdf/2601.13474
Copy Paste: [[2601.13474]] Preconditioning Benefits of Spectral Orthogonalization in Muon(https://arxiv.org/abs/2601.13474)
Keywords: in-context
Abstract: The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon -- particularly the role of gradient orthogonalization -- remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon's effectiveness in these matrix optimization problems and potentially beyond.

Title: GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models

Authors: Yang Yu, Yunze Deng, Yige Zhang, Yanjie Xiao, Youkun Ou, Wenhao Hu, Mingchao Li, Bin Feng, Wenyu Liu, Dandan Zheng, Jingdong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13524
Pdf URL: https://arxiv.org/pdf/2601.13524
Copy Paste: [[2601.13524]] GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models(https://arxiv.org/abs/2601.13524)
Keywords: diffusion
Abstract: Existing Image-based virtual try-on (VTON) methods primarily focus on single-layer or multi-garment VTON, neglecting multi-layer VTON (ML-VTON), which involves dressing multiple layers of garments onto the human body with realistic deformation and layering to generate visually plausible outcomes. The main challenge lies in accurately modeling occlusion relationships between inner and outer garments to reduce interference from redundant inner garment features. To address this, we propose GO-MLVTON, the first multi-layer VTON method, introducing the Garment Occlusion Learning module to learn occlusion relationships and the StableDiffusion-based Garment Morphing & Fitting module to deform and fit garments onto the human body, producing high-quality multi-layer try-on results. Additionally, we present the MLG dataset for this task and propose a new metric named Layered Appearance Coherence Difference (LACD) for evaluation. Extensive experiments demonstrate the state-of-the-art performance of GO-MLVTON. Project page: this https URL.

Title: DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

Authors: Feng Ding, Wenhui Yi, Xinan He, Mengyao Xiao, Jianfeng Xu, Jianqiang Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13551
Pdf URL: https://arxiv.org/pdf/2601.13551
Copy Paste: [[2601.13551]] DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis(https://arxiv.org/abs/2601.13551)
Keywords: diffusion, generative
Abstract: Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at this https URL.

Title: Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework

Authors: Yanheng Li, Zhichen Pu, Lijiang Yang, Zehao Zhou, Yi Qin Gao
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2601.13564
Pdf URL: https://arxiv.org/pdf/2601.13564
Copy Paste: [[2601.13564]] Multi-objective fluorescent molecule design with a data-physics dual-driven generative framework(https://arxiv.org/abs/2601.13564)
Keywords: diffusion, generative
Abstract: Designing fluorescent small molecules with tailored optical and physicochemical properties requires navigating vast, underexplored chemical space while satisfying multiple objectives and constraints. Conventional generate-score-screen approaches become impractical under such realistic design specifications, owing to their low search efficiency, unreliable generalizability of machine-learning prediction, and the prohibitive cost of quantum chemical calculation. Here we present LUMOS, a data-and-physics driven framework for inverse design of fluorescent molecules. LUMOS couples generator and predictor within a shared latent representation, enabling direct specification-to-molecule design and efficient exploration. Moreover, LUMOS combines neural networks with a fast time-dependent density functional theory (TD-DFT) calculation workflow to build a suite of complementary predictors spanning different trade-offs in speed, accuracy, and generalizability, enabling reliable property prediction across diverse scenarios. Finally, LUMOS employs a property-guided diffusion model integrated with multi-objective evolutionary algorithms, enabling de novo design and molecular optimization under multiple objectives and constraints. Across comprehensive benchmarks, LUMOS consistently outperforms baseline models in terms of accuracy, generalizability and physical plausibility for fluorescence property prediction, and demonstrates superior performance in multi-objective scaffold- and fragment-level molecular optimization. Further validation using TD-DFT and molecular dynamics (MD) simulations demonstrates that LUMOS can generate valid fluorophores that meet various target specifications. Overall, these results establish LUMOS as a data-physics dual-driven framework for general fluorophore inverse design.

Title: Diffusion In Diffusion: Breaking the Autoregressive Bottleneck in Block Diffusion Models

Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13599
Pdf URL: https://arxiv.org/pdf/2601.13599
Copy Paste: [[2601.13599]] Diffusion In Diffusion: Breaking the Autoregressive Bottleneck in Block Diffusion Models(https://arxiv.org/abs/2601.13599)
Keywords: diffusion, generative
Abstract: Block diffusion language models, operating as semi-autoregressive paradigms, combine the strengths of both autoregressive and diffusion paradigms. However, their strict unidirectional block dependencies introduce irreversibility and sacrifice the global planning capabilities for which diffusion models are renowned. In order to address these issues, we propose Diffusion in Diffusion, a draft-then-refine framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilise snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model's global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using just 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.

Title: Towards Token-Level Text Anomaly Detection

Authors: Yang Cao, Bicheng Yu, Sikun Yang, Ming Liu, Yujiu Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13644
Pdf URL: https://arxiv.org/pdf/2601.13644
Copy Paste: [[2601.13644]] Towards Token-Level Text Anomaly Detection(https://arxiv.org/abs/2601.13644)
Keywords: anomaly
Abstract: Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on this https URL.

Title: VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

Authors: Tiancheng Fang, Bowen Pan, Lingxi Chen, Jiangjing Lyu, Chengfei Lyu, Chaoyue Niu, Fan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13664
Pdf URL: https://arxiv.org/pdf/2601.13664
Copy Paste: [[2601.13664]] VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement(https://arxiv.org/abs/2601.13664)
Keywords: foundation model
Abstract: We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.

Title: CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks

Authors: Jiayu Lin, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13669
Pdf URL: https://arxiv.org/pdf/2601.13669
Copy Paste: [[2601.13669]] CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks(https://arxiv.org/abs/2601.13669)
Keywords: foundation model
Abstract: Large language models (LLMs) alignment ensures model behaviors reflect human value. Existing alignment strategies primarily follow two paths: one assumes a universal value set for a unified goal (i.e., one-size-fits-all), while the other treats every individual as unique to customize models (i.e., individual-level). However, assuming a monolithic value space marginalizes minority norms, while tailoring individual models is prohibitively expensive. Recognizing that human society is organized into social clusters with high intra-group value alignment, we propose community-level alignment as a "middle ground". Practically, we introduce CommunityBench, the first large-scale benchmark for community-level alignment evaluation, featuring four tasks grounded in Common Identity and Common Bond theory. With CommunityBench, we conduct a comprehensive evaluation of various foundation models on CommunityBench, revealing that current LLMs exhibit limited capacity to model community-specific preferences. Furthermore, we investigate the potential of community-level alignment in facilitating individual modeling, providing a promising direction for scalable and pluralistic alignment.

Title: Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Authors: Boyuan Cao, Xingbo Yao, Chenhui Wang, Jiaxin Ye, Yujie Wei, Hongming Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13683
Pdf URL: https://arxiv.org/pdf/2601.13683
Copy Paste: [[2601.13683]] Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation(https://arxiv.org/abs/2601.13683)
Keywords: diffusion, generative
Abstract: Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.

Title: Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction

Authors: Sayeed Shafayet Chowdhury, Snehasis Mukhopadhyay, Shiaofen Fang, Vijay R. Ramakrishnan
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2601.13710
Pdf URL: https://arxiv.org/pdf/2601.13710
Copy Paste: [[2601.13710]] Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction(https://arxiv.org/abs/2601.13710)
Keywords: generative
Abstract: Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.

Title: HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection

Authors: Daniel Kyselica, Jonáš Herec, Oliver Kutis, Rado Pitoňák
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13751
Pdf URL: https://arxiv.org/pdf/2601.13751
Copy Paste: [[2601.13751]] HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection(https://arxiv.org/abs/2601.13751)
Keywords: foundation model
Abstract: Natural disaster monitoring through continuous satellite observation requires processing multi-temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99\% of original image size. Moreover, testing on the STTORM-CD flood dataset confirms that the HiT mechanism within the Prithvi-tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite-based continuous monitoring of natural disasters, supporting real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture as well as model checkpoints is available at this https URL

Title: Orthogonium : A Unified, Efficient Library of Orthogonal and 1-Lipschitz Building Blocks

Authors: Thibaut Boissin (IRIT-MISFIT), Franck Mamalet, Valentin Lafargue (ANITI, IMT), Mathieu Serrurier (IRIT-MISFIT)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.13776
Pdf URL: https://arxiv.org/pdf/2601.13776
Copy Paste: [[2601.13776]] Orthogonium : A Unified, Efficient Library of Orthogonal and 1-Lipschitz Building Blocks(https://arxiv.org/abs/2601.13776)
Keywords: generative
Abstract: Orthogonal and 1-Lipschitz neural network layers are essential building blocks in robust deep learning architectures, crucial for certified adversarial robustness, stable generative models, and reliable recurrent networks. Despite significant advancements, existing implementations remain fragmented, limited, and computationally demanding. To address these issues, we introduce Orthogonium , a unified, efficient, and comprehensive PyTorch library providing orthogonal and 1-Lipschitz layers. Orthogonium provides access to standard convolution features-including support for strides, dilation, grouping, and transposed-while maintaining strict mathematical guarantees. Its optimized implementations reduce overhead on large scale benchmarks such as ImageNet. Moreover, rigorous testing within the library has uncovered critical errors in existing implementations, emphasizing the importance of standardized and reliable tools. Orthogonium thus significantly lowers adoption barriers, enabling scalable experimentation and integration across diverse applications requiring orthogonality and robust Lipschitz constraints. Orthogonium is available at this https URL.

Title: Principled Latent Diffusion for Graphs via Laplacian Autoencoders

Authors: Antoine Siraudin, Christopher Morris
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13780
Pdf URL: https://arxiv.org/pdf/2601.13780
Copy Paste: [[2601.13780]] Principled Latent Diffusion for Graphs via Laplacian Autoencoders(https://arxiv.org/abs/2601.13780)
Keywords: diffusion
Abstract: Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes -- and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion there. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps each node into a fixed-dimensional embedding from which the full adjacency is provably recoverable, enabling near-lossless reconstruction for both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, eliminating the quadratic bottleneck and making it feasible to train larger and more expressive models. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models, while achieving up to $1000\times$ speed-up.

Title: Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders

Authors: Kai Wittenmayer, Sukrut Rao, Amin Parchami-Araghi, Bernt Schiele, Jonas Fischer
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.13798
Pdf URL: https://arxiv.org/pdf/2601.13798
Copy Paste: [[2601.13798]] Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders(https://arxiv.org/abs/2601.13798)
Keywords: foundation model
Abstract: Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at this https URL.

Title: Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Authors: Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang, Kai Yu, Chunyu Qiang, Chen Zhang, Xie Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.13802
Pdf URL: https://arxiv.org/pdf/2601.13802
Copy Paste: [[2601.13802]] Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis(https://arxiv.org/abs/2601.13802)
Keywords: in-context
Abstract: A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at this https URL .

Title: The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

Authors: Sam OConnor Russell, Delphine Charuau, Naomi Harte
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.13835
Pdf URL: https://arxiv.org/pdf/2601.13835
Copy Paste: [[2601.13835]] The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations(https://arxiv.org/abs/2601.13835)
Keywords: self-supervised
Abstract: Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.

Title: FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

Authors: Xinya Ji, Sebastian Weiss, Manuel Kansy, Jacek Naruniec, Xun Cao, Barbara Solenthaler, Derek Bradley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13837
Pdf URL: https://arxiv.org/pdf/2601.13837
Copy Paste: [[2601.13837]] FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation(https://arxiv.org/abs/2601.13837)
Keywords: diffusion
Abstract: Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.

Title: Inverting Self-Organizing Maps: A Unified Activation-Based Framework

Authors: Alessandro Londei, Matteo Benati, Denise Lanzieri, Vittorio Loreto
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.13851
Pdf URL: https://arxiv.org/pdf/2601.13851
Copy Paste: [[2601.13851]] Inverting Self-Organizing Maps: A Unified Activation-Based Framework(https://arxiv.org/abs/2601.13851)
Keywords: generative
Abstract: Self-Organizing Maps provide topology-preserving projections of high-dimensional data and have been widely used for visualization, clustering, and vector quantization. In this work, we show that the activation pattern of a SOM - the squared distances to its prototypes - can be inverted to recover the exact input under mild geometric conditions. This follows from a classical fact in Euclidean distance geometry: a point in $D$ dimensions is uniquely determined by its distances to $D{+}1$ affinely independent references. We derive the corresponding linear system and characterize the conditions under which the inversion is well-posed. Building upon this mechanism, we introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule, which enables controlled, semantically meaningful trajectories in latent space. MUSIC modifies squared distances to selected prototypes while preserving others, resulting in a deterministic geometric flow aligned with the SOM's piecewise-linear structure. Tikhonov regularization stabilizes the update rule and ensures smooth motion on high-dimensional datasets. Unlike variational or probabilistic generative models, MUSIC does not rely on sampling, latent priors, or encoder-decoder architectures. If no perturbation is applied, inversion recovers the exact input; when a target cluster or prototype is specified, MUSIC produces coherent semantic variations while remaining on the data manifold. This leads to a new perspective on data augmentation and controllable latent exploration based solely on prototype geometry. We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset. Across all settings, MUSIC produces smooth, interpretable trajectories that reveal the underlying geometry of the learned manifold, illustrating the advantages of SOM-based inversion over unsupervised clustering.

Title: OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting

Authors: Michail Spanakis, Iason Oikonomidis, Antonis Argyros
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13871
Pdf URL: https://arxiv.org/pdf/2601.13871
Copy Paste: [[2601.13871]] OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting(https://arxiv.org/abs/2601.13871)
Keywords: foundation model
Abstract: Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at this https URL.

Title: Revisiting Multi-Task Visual Representation Learning

Authors: Shangzhe Di, Zhonghua Zhai, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13886
Pdf URL: https://arxiv.org/pdf/2601.13886
Copy Paste: [[2601.13886]] Revisiting Multi-Task Visual Representation Learning(https://arxiv.org/abs/2601.13886)
Keywords: self-supervised
Abstract: Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

Title: Multi-Objective Hierarchical Optimization with Large Language Models

Authors: Andrej Schwanke, Lyubomir Ivanov, David Salinas, Frank Hutter, Arber Zela
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.13892
Pdf URL: https://arxiv.org/pdf/2601.13892
Copy Paste: [[2601.13892]] Multi-Objective Hierarchical Optimization with Large Language Models(https://arxiv.org/abs/2601.13892)
Keywords: generative
Abstract: Despite their widespread adoption in various domains, especially due to their powerful reasoning capabilities, Large Language Models (LLMs) are not the off-the-shelf choice to drive multi-objective optimization yet. Conventional strategies rank high in benchmarks due to their intrinsic capabilities to handle numerical inputs and careful modelling choices that balance exploration and Pareto-front exploitation, as well as handle multiple (conflicting) objectives. In this paper, we close this gap by leveraging LLMs as surrogate models and candidate samplers inside a structured hierarchical search strategy. By adaptively partitioning the input space into disjoint hyperrectangular regions and ranking them with a composite score function, we restrict the generative process of the LLM to specific, high-potential sub-spaces, hence making the problem easier to solve as the LLM doesn't have to reason about the global structure of the problem, but only locally instead. We show that under standard regularity assumptions, our algorithm generates candidate solutions that converge to the true Pareto set in Hausdorff distance. Empirically, it consistently outperforms the global LLM-based multi-objective optimizer and is on par with standard evolutionary and Bayesian optimization algorithm on synthetic and real-world benchmarks.

Title: VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

Authors: Shengyi Wu, Yan Hong, Shengyao Chen, Zheng Wang, Xianbing Sun, Jiahui Zhan, Jun Lan, Jianfu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.13951
Pdf URL: https://arxiv.org/pdf/2601.13951
Copy Paste: [[2601.13951]] VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content(https://arxiv.org/abs/2601.13951)
Keywords: generative
Abstract: With the rapid advancement of generative AI, virtual try-on (VTON) systems are becoming increasingly common in e-commerce and digital entertainment. However, the growing realism of AI-generated try-on content raises pressing concerns about authenticity and responsible use. To address this, we present VTONGuard, a large-scale benchmark dataset containing over 775,000 real and synthetic try-on images. The dataset covers diverse real-world conditions, including variations in pose, background, and garment styles, and provides both authentic and manipulated examples. Based on this benchmark, we conduct a systematic evaluation of multiple detection paradigms under unified training and testing protocols. Our results reveal each method's strengths and weaknesses and highlight the persistent challenge of cross-paradigm generalization. To further advance detection, we design a multi-task framework that integrates auxiliary segmentation to enhance boundary-aware feature learning, achieving the best overall performance on VTONGuard. We expect this benchmark to enable fair comparisons, facilitate the development of more robust detection models, and promote the safe and responsible deployment of VTON technologies in practice.

Title: RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning

Authors: Cheol-Hui Lee, Hwa-Yeon Lee, Dong-Joo Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2601.13964
Pdf URL: https://arxiv.org/pdf/2601.13964
Copy Paste: [[2601.13964]] RL-BioAug: Label-Efficient Reinforcement Learning for Self-Supervised EEG Representation Learning(https://arxiv.org/abs/2601.13964)
Keywords: self-supervised
Abstract: The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10\%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69\% and 8.80\% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task -- for example, Time Masking with a 62\% probability for sleep stage classification and Crop \& Resize with a 77\% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \href{this https URL}{this https URL}.

Title: Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution

Authors: Samuel W. Remedios, Zhangxing Bian, Shuwen Wei, Aaron Carass, Jerry L. Prince, Blake E. Dewey
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14030
Pdf URL: https://arxiv.org/pdf/2601.14030
Copy Paste: [[2601.14030]] Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution(https://arxiv.org/abs/2601.14030)
Keywords: diffusion, generative
Abstract: Diffusion models are the current state-of-the-art for solving inverse problems in imaging. Their impressive generative capability allows them to approximate sampling from a prior distribution, which alongside a known likelihood function permits posterior sampling without retraining the model. While recent methods have made strides in advancing the accuracy of posterior sampling, the majority focuses on single-image inverse problems. However, for modalities such as magnetic resonance imaging (MRI), it is common to acquire multiple complementary measurements, each low-resolution along a different axis. In this work, we generalize common diffusion-based inverse single-image problem solvers for multi-image super-resolution (MISR) MRI. We show that the DPS likelihood correction allows an exactly-separable gradient decomposition across independently acquired measurements, enabling MISR without constructing a joint operator, modifying the diffusion model, or increasing network function evaluations. We derive MISR versions of DPS, DMAP, DPPS, and diffusion-based PnP/ADMM, and demonstrate substantial gains over SISR across $4\times/8\times/16\times$ anisotropic degradations. Our results achieve state-of-the-art super-resolution of anisotropic MRI volumes and, critically, enable reconstruction of near-isotropic anatomy from routine 2D multi-slice acquisitions, which are otherwise highly degraded in orthogonal views.

Title: RM-Distiller: Exploiting Generative LLM for Reward Model Distillation

Authors: Hongli Zhou, Hui Huang, Wei Liu, Chenglong Wang, Xingyuan Bu, Lvyuan Han, Fuhai Song, Muyun Yang, Wenhao Jiang, Hailong Cao, Tiejun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.14032
Pdf URL: https://arxiv.org/pdf/2601.14032
Copy Paste: [[2601.14032]] RM-Distiller: Exploiting Generative LLM for Reward Model Distillation(https://arxiv.org/abs/2601.14032)
Keywords: generative
Abstract: Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.

Title: Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants

Authors: Yunhe Wang, Kai Han, Huiling Zhen, Yuchuan Tian, Hanting Chen, Yongbing Huang, Yufei Cui, Yingte Shu, Shan Gao, Ismail Elezi, Roy Vaughan Miles, Songcen Xu, Feng Wen, Chao Xu, Sinan Zeng, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14041
Pdf URL: https://arxiv.org/pdf/2601.14041
Copy Paste: [[2601.14041]] Top 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants(https://arxiv.org/abs/2601.14041)
Keywords: diffusion
Abstract: The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.

Title: LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

Authors: Badri N. Patro, Vijay S. Agneeswaran
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, eess.IV
Abstract URL: https://arxiv.org/abs/2601.14053
Pdf URL: https://arxiv.org/pdf/2601.14053
Copy Paste: [[2601.14053]] LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems(https://arxiv.org/abs/2601.14053)
Keywords: generative
Abstract: The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at <$0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.

Title: POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion

Authors: Andrea Rigo, Luca Stornaiuolo, Weijie Wang, Mauro Martino, Bruno Lepri, Nicu Sebe
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14056
Pdf URL: https://arxiv.org/pdf/2601.14056
Copy Paste: [[2601.14056]] POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion(https://arxiv.org/abs/2601.14056)
Keywords: diffusion, generative
Abstract: We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.

Title: VERIDAH: Solving Enumeration Anomaly Aware Vertebra Labeling across Imaging Sequences

Authors: Hendrik Möller, Hanna Schoen, Robert Graf, Matan Atad, Nathan Molinier, Anjany Sekuboyina, Bettina K. Budai, Fabian Bamberg, Steffen Ringhof, Christopher Schlett, Tobias Pischon, Thoralf Niendorf, Josua A. Decker, Marc-André Weber, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14066
Pdf URL: https://arxiv.org/pdf/2601.14066
Copy Paste: [[2601.14066]] VERIDAH: Solving Enumeration Anomaly Aware Vertebra Labeling across Imaging Sequences(https://arxiv.org/abs/2601.14066)
Keywords: anomaly
Abstract: The human spine commonly consists of seven cervical, twelve thoracic, and five lumbar vertebrae. However, enumeration anomalies may result in individuals having eleven or thirteen thoracic vertebrae and four or six lumbar vertebrae. Although the identification of enumeration anomalies has potential clinical implications for chronic back pain and operation planning, the thoracolumbar junction is often poorly assessed and rarely described in clinical reports. Additionally, even though multiple deep-learning-based vertebra labeling algorithms exist, there is a lack of methods to automatically label enumeration anomalies. Our work closes that gap by introducing "Vertebra Identification with Anomaly Handling" (VERIDAH), a novel vertebra labeling algorithm based on multiple classification heads combined with a weighted vertebra sequence prediction algorithm. We show that our approach surpasses existing models on T2w TSE sagittal (98.30% vs. 94.24% of subjects with all vertebrae correctly labeled, p < 0.001) and CT imaging (99.18% vs. 77.26% of subjects with all vertebrae correctly labeled, p < 0.001) and works in arbitrary field-of-view images. VERIDAH correctly labeled the presence 2 Möller et al. of thoracic enumeration anomalies in 87.80% and 96.30% of T2w and CT images, respectively, and lumbar enumeration anomalies in 94.48% and 97.22% for T2w and CT, respectively. Our code and models are available at: this https URL.

Title: Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing

Authors: Xiaolu Liu, Yicong Li, Qiyuan He, Jiayin Zhu, Wei Ji, Angela Yao, Jianke Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14103
Pdf URL: https://arxiv.org/pdf/2601.14103
Copy Paste: [[2601.14103]] Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing(https://arxiv.org/abs/2601.14103)
Keywords: generative
Abstract: Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at this https URL.

Title: Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic

Authors: Saad Mankarious, Aya Zirikly
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14124
Pdf URL: https://arxiv.org/pdf/2601.14124
Copy Paste: [[2601.14124]] Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic(https://arxiv.org/abs/2601.14124)
Keywords: diffusion
Abstract: Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.

Title: One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Authors: Yitong Dong, Qi Zhang, Minchao Jiang, Zhiqiang Wu, Qingnan Fan, Ying Feng, Huaqi Zhang, Hujun Bao, Guofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14161
Pdf URL: https://arxiv.org/pdf/2601.14161
Copy Paste: [[2601.14161]] One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion(https://arxiv.org/abs/2601.14161)
Keywords: diffusion, generative
Abstract: We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

Title: Progressive self-supervised blind-spot denoising method for LDCT denoising

Authors: Yichao Liu, Yueyang Teng, Junwen Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14180
Pdf URL: https://arxiv.org/pdf/2601.14180
Copy Paste: [[2601.14180]] Progressive self-supervised blind-spot denoising method for LDCT denoising(https://arxiv.org/abs/2601.14180)
Keywords: self-supervised
Abstract: Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

Title: IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Authors: Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14188
Pdf URL: https://arxiv.org/pdf/2601.14188
Copy Paste: [[2601.14188]] IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models(https://arxiv.org/abs/2601.14188)
Keywords: in-context
Abstract: Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

Title: Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment

Authors: Punit Kumar, Vaibhav Saran, Divyesh Patel, Nitin Kulkarni, Alina Vereshchaka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2601.14228
Pdf URL: https://arxiv.org/pdf/2601.14228
Copy Paste: [[2601.14228]] Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment(https://arxiv.org/abs/2601.14228)
Keywords: diffusion
Abstract: Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.

Title: Q-learning with Adjoint Matching

Authors: Qiyang Li, Sergey Levine
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2601.14234
Pdf URL: https://arxiv.org/pdf/2601.14234
Copy Paste: [[2601.14234]] Q-learning with Adjoint Matching(https://arxiv.org/abs/2601.14234)
Keywords: diffusion, generative
Abstract: We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

Title: Soft Tail-dropping for Adaptive Visual Tokenization

Authors: Zeyuan Chen, Kai Zhang, Zhuowen Tu, Yuanjun Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14246
Pdf URL: https://arxiv.org/pdf/2601.14246
Copy Paste: [[2601.14246]] Soft Tail-dropping for Adaptive Visual Tokenization(https://arxiv.org/abs/2601.14246)
Keywords: generative
Abstract: We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

Title: VideoMaMa: Mask-Guided Video Matting via Generative Prior

Authors: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2601.14255
Pdf URL: https://arxiv.org/pdf/2601.14255
Copy Paste: [[2601.14255]] VideoMaMa: Mask-Guided Video Matting via Generative Prior(https://arxiv.org/abs/2601.14255)
Keywords: diffusion, generative
Abstract: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

Title: Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Authors: Matthew Gwilliam, Xiao Wang, Xuefeng Hu, Zhenheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2601.14256
Pdf URL: https://arxiv.org/pdf/2601.14256
Copy Paste: [[2601.14256]] Implicit Neural Representation Facilitates Unified Universal Vision Encoding(https://arxiv.org/abs/2601.14256)
Keywords: generative
Abstract: Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.