2025-09-23

Title: Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception

Authors: Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Jilong Guo, Jun Yang, Xinyu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16277
Pdf URL: https://arxiv.org/pdf/2509.16277
Copy Paste: [[2509.16277]] Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception(https://arxiv.org/abs/2509.16277)
Keywords: anomaly
Abstract: Deep perception networks in autonomous driving traditionally rely on data-intensive training regimes and post-hoc anomaly detection, often disregarding fundamental information-theoretic constraints governing stable information processing. We reconceptualize deep neural encoders as hierarchical communication chains that incrementally compress raw sensory inputs into task-relevant latent features. Within this framework, we establish two theoretically justified design principles for robust perception: (D1) smooth variation of mutual information between consecutive layers, and (D2) monotonic decay of latent entropy with network depth. Our analysis shows that, under realistic architectural assumptions, particularly blocks comprising repeated layers of similar capacity, enforcing smooth information flow (D1) naturally encourages entropy decay (D2), thus ensuring stable compression. Guided by these insights, we propose Eloss, a novel entropy-based regularizer designed as a lightweight, plug-and-play training objective. Rather than marginal accuracy improvements, this approach represents a conceptual shift: it unifies information-theoretic stability with standard perception tasks, enabling explicit, principled detection of anomalous sensor inputs through entropy deviations. Experimental validation on large-scale 3D object detection benchmarks (KITTI and nuScenes) demonstrates that incorporating Eloss consistently achieves competitive or improved accuracy while dramatically enhancing sensitivity to anomalies, amplifying distribution-shift signals by up to two orders of magnitude. This stable information-compression perspective not only improves interpretability but also establishes a solid theoretical foundation for safer, more robust autonomous driving perception systems.

Title: Estimating Clinical Lab Test Result Trajectories from PPG using Physiological Foundation Model and Patient-Aware State Space Model -- a UNIPHY+ Approach

Authors: Minxiao Wang, Runze Yan, Carol Li, Saurabh Kataria, Xiao Hu, Matthew Clark, Timothy Ruchti, Timothy G. Buchman, Sivasubramanium V Bhavani, Randall J. Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16345
Pdf URL: https://arxiv.org/pdf/2509.16345
Copy Paste: [[2509.16345]] Estimating Clinical Lab Test Result Trajectories from PPG using Physiological Foundation Model and Patient-Aware State Space Model -- a UNIPHY+ Approach(https://arxiv.org/abs/2509.16345)
Keywords: foundation model
Abstract: Clinical laboratory tests provide essential biochemical measurements for diagnosis and treatment, but are limited by intermittent and invasive sampling. In contrast, photoplethysmogram (PPG) is a non-invasive, continuously recorded signal in intensive care units (ICUs) that reflects cardiovascular dynamics and can serve as a proxy for latent physiological changes. We propose UNIPHY+Lab, a framework that combines a large-scale PPG foundation model for local waveform encoding with a patient-aware Mamba model for long-range temporal modeling. Our architecture addresses three challenges: (1) capturing extended temporal trends in laboratory values, (2) accounting for patient-specific baseline variation via FiLM-modulated initial states, and (3) performing multi-task estimation for interrelated biomarkers. We evaluate our method on the two ICU datasets for predicting the five key laboratory tests. The results show substantial improvements over the LSTM and carry-forward baselines in MAE, RMSE, and $R^2$ among most of the estimation targets. This work demonstrates the feasibility of continuous, personalized lab value estimation from routine PPG monitoring, offering a pathway toward non-invasive biochemical surveillance in critical care.

Title: From Canopy to Ground via ForestGen3D: Learning Cross-Domain Generation of 3D Forest Structure from Aerial-to-Terrestrial LiDAR

Authors: Juan Castorena, E. Louise Loudermilk, Scott Pokswinski, Rodman Linn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16346
Pdf URL: https://arxiv.org/pdf/2509.16346
Copy Paste: [[2509.16346]] From Canopy to Ground via ForestGen3D: Learning Cross-Domain Generation of 3D Forest Structure from Aerial-to-Terrestrial LiDAR(https://arxiv.org/abs/2509.16346)
Keywords: diffusion, generative
Abstract: The 3D structure of living and non-living components in ecosystems plays a critical role in determining ecological processes and feedbacks from both natural and human-driven disturbances. Anticipating the effects of wildfire, drought, disease, or atmospheric deposition depends on accurate characterization of 3D vegetation structure, yet widespread measurement remains prohibitively expensive and often infeasible. We introduce ForestGen3D, a novel generative modeling framework that synthesizes high-fidelity 3D forest structure using only aerial LiDAR (ALS) inputs. ForestGen3D is based on conditional denoising diffusion probabilistic models (DDPMs) trained on co-registered ALS/TLS (terrestrial LiDAR) data. The model learns to generate TLS-like 3D point clouds conditioned on sparse ALS observations, effectively reconstructing occluded sub-canopy detail at scale. To ensure ecological plausibility, we introduce a geometric containment prior based on the convex hull of ALS observations and provide theoretical and empirical guarantees that generated structures remain spatially consistent. We evaluate ForestGen3D at tree, plot, and landscape scales using real-world data from mixed conifer ecosystems, and show that it produces high-fidelity reconstructions that closely match TLS references in terms of geometric similarity and biophysical metrics, such as tree height, DBH, crown diameter and crown volume. Additionally, we demonstrate that the containment property can serve as a practical proxy for generation quality in settings where TLS ground truth is unavailable. Our results position ForestGen3D as a scalable tool for ecological modeling, wildfire simulation, and structural fuel characterization in ALS-only environments.

Title: Guided Sequence-Structure Generative Modeling for Iterative Antibody Optimization

Authors: Aniruddh Raghu, Sebastian Ober, Maxwell Kazman, Hunter Elliott
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16357
Pdf URL: https://arxiv.org/pdf/2509.16357
Copy Paste: [[2509.16357]] Guided Sequence-Structure Generative Modeling for Iterative Antibody Optimization(https://arxiv.org/abs/2509.16357)
Keywords: diffusion, generative
Abstract: Therapeutic antibody candidates often require extensive engineering to improve key functional and developability properties before clinical development. This can be achieved through iterative design, where starting molecules are optimized over several rounds of in vitro experiments. While protein structure can provide a strong inductive bias, it is rarely used in iterative design due to the lack of structural data for continually evolving lead molecules over the course of optimization. In this work, we propose a strategy for iterative antibody optimization that leverages both sequence and structure as well as accumulating lab measurements of binding and developability. Building on prior work, we first train a sequence-structure diffusion generative model that operates on antibody-antigen complexes. We then outline an approach to use this model, together with carefully predicted antibody-antigen complexes, to optimize lead candidates throughout the iterative design process. Further, we describe a guided sampling approach that biases generation toward desirable properties by integrating models trained on experimental data from iterative design. We evaluate our approach in multiple in silico and in vitro experiments, demonstrating that it produces high-affinity binders at multiple stages of an active antibody optimization campaign.

Title: Introducing Resizable Region Packing Problem in Image Generation, with a Heuristic Solution

Authors: Hrishikesh Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16363
Pdf URL: https://arxiv.org/pdf/2509.16363
Copy Paste: [[2509.16363]] Introducing Resizable Region Packing Problem in Image Generation, with a Heuristic Solution(https://arxiv.org/abs/2509.16363)
Keywords: generative, anomaly
Abstract: The problem of image data generation in computer vision has traditionally been a harder problem to solve, than discriminative problems. Such data generation entails placing relevant objects of appropriate sizes each, at meaningful location in a scene canvas. There have been two classes of popular approaches to such generation: graphics based, and generative models-based. Optimization problems are known to lurk in the background for both these classes of approaches. In this paper, we introduce a novel, practically useful manifestation of the classical Bin Packing problem in the context of generation of synthetic image data. We conjecture that the newly introduced problem, Resizable Anchored Region Packing(RARP) Problem, is NP-hard, and provide detailed arguments about our conjecture. As a first solution, we present a novel heuristic algorithm that is generic enough and therefore scales and packs arbitrary number of arbitrary-shaped regions at arbitrary locations, into an image canvas. The algorithm follows greedy approach to iteratively pack region pairs in a careful way, while obeying the optimization constraints. The algorithm is validated by an implementation that was used to generate a large-scale synthetic anomaly detection dataset, with highly varying degree of bin packing parameters per image sample i.e. RARP instance. Visual inspection of such data and checking of the correctness of each solution proves the effectiveness of our algorithm. With generative modeling being on rise in deep learning, and synthetic data generation poised to become mainstream, we expect that the newly introduced problem will be valued in the imaging scientific community.

Title: StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes

Authors: Zhengri Wu, Yiran Wang, Yu Wen, Zeyu Zhang, Biao Wu, Hao Tang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.16415
Pdf URL: https://arxiv.org/pdf/2509.16415
Copy Paste: [[2509.16415]] StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes(https://arxiv.org/abs/2509.16415)
Keywords: self-supervised
Abstract: Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach. Code: this https URL. Website: this https URL.

Title: TractoTransformer: Diffusion MRI Streamline Tractography using CNN and Transformer Networks

Authors: Itzik Waizman, Yakov Gusakov, Itay Benou, Tammy Riklin Raviv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16429
Pdf URL: https://arxiv.org/pdf/2509.16429
Copy Paste: [[2509.16429]] TractoTransformer: Diffusion MRI Streamline Tractography using CNN and Transformer Networks(https://arxiv.org/abs/2509.16429)
Keywords: diffusion
Abstract: White matter tractography is an advanced neuroimaging technique that reconstructs the 3D white matter pathways of the brain from diffusion MRI data. It can be framed as a pathfinding problem aiming to infer neural fiber trajectories from noisy and ambiguous measurements, facing challenges such as crossing, merging, and fanning white-matter configurations. In this paper, we propose a novel tractography method that leverages Transformers to model the sequential nature of white matter streamlines, enabling the prediction of fiber directions by integrating both the trajectory context and current diffusion MRI measurements. To incorporate spatial information, we utilize CNNs that extract microstructural features from local neighborhoods around each voxel. By combining these complementary sources of information, our approach improves the precision and completeness of neural pathway mapping compared to traditional tractography models. We evaluate our method with the Tractometer toolkit, achieving competitive performance against state-of-the-art approaches, and present qualitative results on the TractoInferno dataset, demonstrating strong generalization to real-world data.

Title: Improved mmFormer for Liver Fibrosis Staging via Missing-Modality Compensation

Authors: Zhejia Zhang, Junjie Wang, Le Zhang (University of Birmingham, UK)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16436
Pdf URL: https://arxiv.org/pdf/2509.16436
Copy Paste: [[2509.16436]] Improved mmFormer for Liver Fibrosis Staging via Missing-Modality Compensation(https://arxiv.org/abs/2509.16436)
Keywords: diffusion
Abstract: In real-world clinical settings, magnetic resonance imaging (MRI) frequently suffers from missing modalities due to equipment variability or patient cooperation issues, which can significantly affect model performance. To address this issue, we propose a multimodal MRI classification model based on the mmFormer architecture with an adaptive module for handling arbitrary combinations of missing modalities. Specifically, this model retains the hybrid modality-specific encoders and the modality-correlated encoder from mmFormer to extract consistent lesion features across available modalities. In addition, we integrate a missing-modality compensation module which leverages zero-padding, modality availability masks, and a Delta Function with learnable statistical parameters to dynamically synthesize proxy features for recovering missing information. To further improve prediction performance, we adopt a cross-validation ensemble strategy by training multiple models on different folds and applying soft voting during inference. This method is evaluated on the test set of Comprehensive Analysis & Computing of REal-world medical images (CARE) 2025 challenge, targeting the Liver Fibrosis Staging (LiFS) task based on non-contrast dynamic MRI scans including T1-weighted imaging (T1WI), T2-weighted imaging (T2WI), and diffusion-weighted imaging (DWI). For Cirrhosis Detection and Substantial Fibrosis Detection on in-distribution vendors, our model obtains accuracies of 66.67%, and 74.17%, and corresponding area under the curve (AUC) scores of 71.73% and 68.48%, respectively.

Title: Local Mechanisms of Compositional Generalization in Conditional Diffusion

Authors: Arwen Bradley
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16447
Pdf URL: https://arxiv.org/pdf/2509.16447
Copy Paste: [[2509.16447]] Local Mechanisms of Compositional Generalization in Conditional Diffusion(https://arxiv.org/abs/2509.16447)
Keywords: diffusion
Abstract: Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure ("conditional projective composition") (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners ("local conditional scores"). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate feature-space compositionality in color-conditioned CLEVR, and find preliminary evidence of compositional structure in SDXL.

Title: Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations

Authors: Yunzhe Wang, Gale M. Lucas, Burcin Becerik-Gerber, Volkan Ustun
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.16457
Pdf URL: https://arxiv.org/pdf/2509.16457
Copy Paste: [[2509.16457]] Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations(https://arxiv.org/abs/2509.16457)
Keywords: generative
Abstract: Language-driven generative agents have enabled large-scale social simulations with transformative uses, from interpersonal training to aiding global policy-making. However, recent studies indicate that generative agent behaviors often deviate from expert expectations and real-world data--a phenomenon we term the Behavior-Realism Gap. To address this, we introduce a theoretical framework called Persona-Environment Behavioral Alignment (PEBA), formulated as a distribution matching problem grounded in Lewin's behavior equation stating that behavior is a function of the person and their environment. Leveraging PEBA, we propose PersonaEvolve (PEvo), an LLM-based optimization algorithm that iteratively refines agent personas, implicitly aligning their collective behaviors with realistic expert benchmarks within a specified environmental context. We validate PEvo in an active shooter incident simulation we developed, achieving an 84% average reduction in distributional divergence compared to no steering and a 34% improvement over explicit instruction baselines. Results also show PEvo-refined personas generalize to novel, related simulation scenarios. Our method greatly enhances behavioral realism and reliability in high-stakes social simulations. More broadly, the PEBA-PEvo framework provides a principled approach to developing trustworthy LLM-driven social simulations.

Title: Cross-Corpus and Cross-domain Handwriting Assessment of NeuroDegenerative Diseases via Time-Series-to-Image Conversion

Authors: Gabrielle Chavez, Laureano Moro-Velazquez, Ankur Butala, Najim Dehak, Thomas Thebaud
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16474
Pdf URL: https://arxiv.org/pdf/2509.16474
Copy Paste: [[2509.16474]] Cross-Corpus and Cross-domain Handwriting Assessment of NeuroDegenerative Diseases via Time-Series-to-Image Conversion(https://arxiv.org/abs/2509.16474)
Keywords: generative
Abstract: Handwriting is significantly affected by neurological disorders (ND) such as Parkinson's disease (PD) and Alzheimer's disease (AD). Prior works have analyzed handwriting tasks using feature-based approaches or computer-vision techniques, but these methods have struggled to generalize across multiple datasets, particularly between temporal features represented as time-series and images. We propose a framework that leverages both time-series and images of handwriting through a joint classifier, based on a ResNet50 pretrained on ImageNet-1k. Binary classification experiments demonstrate state-of-the-art performances on existing time-series and image datasets, with significant improvement on specific drawing and writing tasks from the NeuroLogical Signals (NLS) dataset. In particular, the proposed model demonstrates improved performance on Draw Clock and Spiral tasks. Additionally, cross-dataset and multi-dataset experiments were consistently able to achieve high F1 scores, up to 98 for PD detection, highlighting the potential of the proposed model to generalize over different forms of handwriting signals, and enhance the detection of motor deficits in ND.

Title: Octree Latent Diffusion for Semantic 3D Scene Generation and Completion

Authors: Xujia Zhang, Brendan Crowe, Christoffer Heckman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16483
Pdf URL: https://arxiv.org/pdf/2509.16483
Copy Paste: [[2509.16483]] Octree Latent Diffusion for Semantic 3D Scene Generation and Completion(https://arxiv.org/abs/2509.16483)
Keywords: diffusion
Abstract: The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them oneoff. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g. indoor vs. outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxellevel semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate highquality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.

Title: FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG

Authors: Lovely Yeswanth Panchumarthi, Saurabh Kataria, Yi Wu, Xiao Hu, Alex Fedorov, Hyunjung Gloria Kwak
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2509.16491
Pdf URL: https://arxiv.org/pdf/2509.16491
Copy Paste: [[2509.16491]] FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG(https://arxiv.org/abs/2509.16491)
Keywords: foundation model
Abstract: Foundation models pretrained on physiological data such as photoplethysmography (PPG) signals are increasingly used to improve heart rate (HR) prediction across diverse settings. Fine-tuning these models for local deployment is often seen as a practical and scalable strategy. However, its impact on demographic fairness particularly under domain shifts remains underexplored. We fine-tune PPG-GPT a transformer-based foundation model pretrained on intensive care unit (ICU) data across three heterogeneous datasets (ICU, wearable, smartphone) and systematically evaluate the effects on HR prediction accuracy and gender fairness. While fine-tuning substantially reduces mean absolute error (up to 80%), it can simultaneously widen fairness gaps, especially in larger models and under significant distributional characteristics shifts. To address this, we introduce FairTune, a bias-aware fine-tuning framework in which we benchmark three mitigation strategies: class weighting based on inverse group frequency (IF), Group Distributionally Robust Optimization (GroupDRO), and adversarial debiasing (ADV). We find that IF and GroupDRO significantly reduce fairness gaps without compromising accuracy, with effectiveness varying by deployment domain. Representation analyses further reveal that mitigation techniques reshape internal embeddings to reduce demographic clustering. Our findings highlight that fairness does not emerge as a natural byproduct of fine-tuning and that explicit mitigation is essential for equitable deployment of physiological foundation models.

Title: A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Authors: Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16499
Pdf URL: https://arxiv.org/pdf/2509.16499
Copy Paste: [[2509.16499]] A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective(https://arxiv.org/abs/2509.16499)
Keywords: diffusion
Abstract: The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

Title: RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation

Authors: Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng-zhong Xu, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16500
Pdf URL: https://arxiv.org/pdf/2509.16500
Copy Paste: [[2509.16500]] RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation(https://arxiv.org/abs/2509.16500)
Keywords: diffusion
Abstract: Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12.7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.

Title: OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution

Authors: Hanting Li, Huaao Tang, Jianhong Han, Tianxiong Zhou, Jiulong Cui, Haizhen Xie, Yan Chen, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16507
Pdf URL: https://arxiv.org/pdf/2509.16507
Copy Paste: [[2509.16507]] OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution(https://arxiv.org/abs/2509.16507)
Keywords: diffusion
Abstract: Recently, latent diffusion models has demonstrated promising performance in real-world video super-resolution (VSR) task, which can reconstruct high-quality videos from distorted low-resolution input through multiple diffusion steps. Compared to image super-resolution (ISR), VSR methods needs to process each frame in a video, which poses challenges to its inference efficiency. However, video quality and inference efficiency have always been a trade-off for the diffusion-based VSR methods. In this work, we propose One-Step Diffusion model for real-world Video Super-Resolution, namely OS-DiffVSR. Specifically, we devise a novel adjacent frame adversarial training paradigm, which can significantly improve the quality of synthetic videos. Besides, we devise a multi-frame fusion mechanism to maintain inter-frame temporal consistency and reduce the flicker in video. Extensive experiments on several popular VSR benchmarks demonstrate that OS-DiffVSR can even achieve better quality than existing diffusion-based VSR methods that require dozens of sampling steps.

Title: SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging

Authors: Haijin Zeng, Xuan Lu, Yurong Zhang, Yongyong Chen, Jingyong Su, Jie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16509
Pdf URL: https://arxiv.org/pdf/2509.16509
Copy Paste: [[2509.16509]] SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging(https://arxiv.org/abs/2509.16509)
Keywords: self-supervised
Abstract: Humans learn in two complementary ways: a slow, cumulative process that builds broad, general knowledge, and a fast, on-the-fly process that captures specific experiences. Existing deep-unfolding methods for spectral compressive imaging (SCI) mirror only the slow component-relying on heavy pre-training with many unfolding stages-yet they lack the rapid adaptation needed to handle new optical configurations. As a result, they falter on out-of-distribution cameras, especially in bespoke spectral setups unseen during training. This depth also incurs heavy computation and slow inference. To bridge this gap, we introduce SlowFast-SCI, a dual-speed framework seamlessly integrated into any deep unfolding network beyond SCI systems. During slow learning, we pre-train or reuse a priors-based backbone and distill it via imaging guidance into a compact fast-unfolding model. In the fast learning stage, lightweight adaptation modules are embedded within each block and trained self-supervised at test time via a dual-domain loss-without retraining the backbone. To the best of our knowledge, SlowFast-SCI is the first test-time adaptation-driven deep unfolding framework for efficient, self-adaptive spectral reconstruction. Its dual-stage design unites offline robustness with on-the-fly per-sample calibration-yielding over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and a 4x faster adaptation speed. In addition, its modularity integrates with any deep-unfolding network, paving the way for self-adaptive, field-deployable imaging and expanded computational imaging modalities. Code and models are available at this https URL.

Title: FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers

Authors: Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar
Subjects: cs.CV, cs.AR
Abstract URL: https://arxiv.org/abs/2509.16518
Pdf URL: https://arxiv.org/pdf/2509.16518
Copy Paste: [[2509.16518]] FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers(https://arxiv.org/abs/2509.16518)
Keywords: diffusion
Abstract: Generating realistic videos with diffusion transformers demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a transformer over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting sparsity in the attention layers to reduce computation. However, these works typically rely on block-sparse attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit sparsity in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a sparse attention mechanism for long-context diffusion transformers that leverages sparsity at a fine granularity. Unlike block-sparse attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed sparse attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a sparse set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU's shared memory. Only a sparse set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block-sparse attention. Our fine-grained sparse attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.

Title: Efficient Rectified Flow for Image Fusion

Authors: Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, Jinyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16549
Pdf URL: https://arxiv.org/pdf/2509.16549
Copy Paste: [[2509.16549]] Efficient Rectified Flow for Image Fusion(https://arxiv.org/abs/2509.16549)
Keywords: diffusion
Abstract: Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at this https URL.

Title: ViTCAE: ViT-based Class-conditioned Autoencoder

Authors: Vahid Jebraeeli, Hamid Krim, Derya Cansever
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.16554
Pdf URL: https://arxiv.org/pdf/2509.16554
Copy Paste: [[2509.16554]] ViTCAE: ViT-based Class-conditioned Autoencoder(https://arxiv.org/abs/2509.16554)
Keywords: generative
Abstract: Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head's influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.

Title: V-CECE: Visual Counterfactual Explanations via Conceptual Edits

Authors: Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16567
Pdf URL: https://arxiv.org/pdf/2509.16567
Copy Paste: [[2509.16567]] V-CECE: Visual Counterfactual Explanations via Conceptual Edits(https://arxiv.org/abs/2509.16567)
Keywords: diffusion
Abstract: Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.

Title: A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Authors: Antonio Scardace, Lemuel Puglisi, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16582
Pdf URL: https://arxiv.org/pdf/2509.16582
Copy Paste: [[2509.16582]] A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis(https://arxiv.org/abs/2509.16582)
Keywords: diffusion, self-supervised, generative
Abstract: Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: this https URL.

Title: Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs

Authors: Yukuan Wei, Xudong Li, Lin F. Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.16586
Pdf URL: https://arxiv.org/pdf/2509.16586
Copy Paste: [[2509.16586]] Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs(https://arxiv.org/abs/2509.16586)
Keywords: generative
Abstract: Recent advances have significantly improved our understanding of the sample complexity of learning in average-reward Markov decision processes (AMDPs) under the generative model. However, much less is known about the constrained average-reward MDP (CAMDP), where policies must satisfy long-run average constraints. In this work, we address this gap by studying the sample complexity of learning an $\epsilon$-optimal policy in CAMDPs under a generative model. We propose a model-based algorithm that operates under two settings: (i) relaxed feasibility, which allows small constraint violations, and (ii) strict feasibility, where the output policy satisfies the constraint. We show that our algorithm achieves sample complexities of $\tilde{O}\left(\frac{S A (B+H)}{ \epsilon^2}\right)$ and $\tilde{O} \left(\frac{S A (B+H)}{\epsilon^2 \zeta^2} \right)$ under the relaxed and strict feasibility settings, respectively. Here, $\zeta$ is the Slater constant indicating the size of the feasible region, $H$ is the span bound of the bias function, and $B$ is the transient time bound. Moreover, a matching lower bound of $\tilde{\Omega}\left(\frac{S A (B+H)}{ \epsilon^2\zeta^2}\right)$ for the strict feasibility case is established, thus providing the first minimax-optimal bounds for CAMDPs. Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.

Title: SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

Authors: Haiming Zhang, Yiyao Zhu, Wending Zhou, Xu Yan, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2509.16588
Pdf URL: https://arxiv.org/pdf/2509.16588
Copy Paste: [[2509.16588]] SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving(https://arxiv.org/abs/2509.16588)
Keywords: self-supervised
Abstract: Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i.e., +1.3 mIoU on occupancy prediction and +1.0 NDS on 3D detection).

Title: FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

Authors: Minji Heo, Simon S. Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16602
Pdf URL: https://arxiv.org/pdf/2509.16602
Copy Paste: [[2509.16602]] FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection(https://arxiv.org/abs/2509.16602)
Keywords: diffusion
Abstract: Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{this https URL}.

Title: Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model

Authors: David Kreismann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16617
Pdf URL: https://arxiv.org/pdf/2509.16617
Copy Paste: [[2509.16617]] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model(https://arxiv.org/abs/2509.16617)
Keywords: foundation model
Abstract: As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data. However, predictive analytics methods based on conventional machine learning models and limited data infrastructure often provide inaccurate predictions, especially in underserved areas. In this context, geospatial foundation models trained on unstructured global data demonstrate strong generalization and require minimal fine-tuning, offering an alternative for predictions where traditional approaches are limited. This study fine-tunes a geospatial foundation model to predict urban land surface temperatures under future climate scenarios and explores its response to land cover changes using simulated vegetation strategies. The fine-tuned model achieved pixel-wise downscaling errors below 1.74 °C and aligned with ground truth patterns, demonstrating an extrapolation capacity up to 3.62 °C.

Title: Self-Supervised Learning of Graph Representations for Network Intrusion Detection

Authors: Lorenzo Guerra, Thomas Chapuis, Guillaume Duc, Pavlo Mozharovskyi, Van-Tam Nguyen
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2509.16625
Pdf URL: https://arxiv.org/pdf/2509.16625
Copy Paste: [[2509.16625]] Self-Supervised Learning of Graph Representations for Network Intrusion Detection(https://arxiv.org/abs/2509.16625)
Keywords: self-supervised, anomaly
Abstract: Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.

Title: Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation

Authors: Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Zhifeng Li, Wei Liu, Linfeng Zhang, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16630
Pdf URL: https://arxiv.org/pdf/2509.16630
Copy Paste: [[2509.16630]] Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation(https://arxiv.org/abs/2509.16630)
Keywords: diffusion
Abstract: We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in this https URL.

Title: Unlocking Hidden Potential in Point Cloud Networks with Attention-Guided Grouping-Feature Coordination

Authors: Shangzhuo Xie, Qianqian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16639
Pdf URL: https://arxiv.org/pdf/2509.16639
Copy Paste: [[2509.16639]] Unlocking Hidden Potential in Point Cloud Networks with Attention-Guided Grouping-Feature Coordination(https://arxiv.org/abs/2509.16639)
Keywords: self-supervised
Abstract: Point cloud analysis has evolved with diverse network architectures, while existing works predominantly focus on introducing novel structural designs. However, conventional point-based architectures - processing raw points through sequential sampling, grouping, and feature extraction layers - demonstrate underutilized potential. We notice that substantial performance gains can be unlocked through strategic module integration rather than structural modifications. In this paper, we propose the Grouping-Feature Coordination Module (GF-Core), a lightweight separable component that simultaneously regulates both grouping layer and feature extraction layer to enable more nuanced feature aggregation. Besides, we introduce a self-supervised pretraining strategy specifically tailored for point-based inputs to enhance model robustness in complex point cloud analysis scenarios. On ModelNet40 dataset, our method elevates baseline networks to 94.0% accuracy, matching advanced frameworks' performance while preserving architectural simplicity. On three variants of the ScanObjectNN dataset, we obtain improvements of 2.96%, 6.34%, and 6.32% respectively.

Title: InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention

Authors: Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16691
Pdf URL: https://arxiv.org/pdf/2509.16691
Copy Paste: [[2509.16691]] InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention(https://arxiv.org/abs/2509.16691)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules.

Title: Animalbooth: multimodal feature enhancement for animal subject personalization

Authors: Chen Liu, Haitao Wu, Kafeng Wang, Xiaowang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16702
Pdf URL: https://arxiv.org/pdf/2509.16702
Copy Paste: [[2509.16702]] Animalbooth: multimodal feature enhancement for animal subject personalization(https://arxiv.org/abs/2509.16702)
Keywords: diffusion
Abstract: Personalized animal image generation is challenging due to rich appearance cues and large morphological variability. Existing approaches often exhibit feature misalignment across domains, which leads to identity drift. We present AnimalBooth, a framework that strengthens identity preservation with an Animal Net and an adaptive attention module, mitigating cross domain alignment errors. We further introduce a frequency controlled feature integration module that applies Discrete Cosine Transform filtering in the latent space to guide the diffusion process, enabling a coarse to fine progression from global structure to detailed texture. To advance research in this area, we curate AnimalBench, a high resolution dataset for animal personalization. Extensive experiments show that AnimalBooth consistently outperforms strong baselines on multiple benchmarks and improves both identity fidelity and perceptual quality.

Title: A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse

Authors: Xiaohan Ding, Kaike Ping, Buse Çarık, Eugenia Rho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16722
Pdf URL: https://arxiv.org/pdf/2509.16722
Copy Paste: [[2509.16722]] A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse(https://arxiv.org/abs/2509.16722)
Keywords: generative
Abstract: Understanding causal language in informal discourse is a core yet underexplored challenge in NLP. Existing datasets largely focus on explicit causality in structured text, providing limited support for detecting implicit causal expressions, particularly those found in informal, user-generated social media posts. We introduce CausalTalk, a multi-level dataset of five years of Reddit posts (2020-2024) discussing public health related to the COVID-19 pandemic, among which 10120 posts are annotated across four causal tasks: (1) binary causal classification, (2) explicit vs. implicit causality, (3) cause-effect span extraction, and (4) causal gist generation. Annotations comprise both gold-standard labels created by domain experts and silver-standard labels generated by GPT-4o and verified by human annotators. CausalTalk bridges fine-grained causal detection and gist-based reasoning over informal text. It enables benchmarking across both discriminative and generative models, and provides a rich resource for studying causal reasoning in social media contexts.

Title: Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment

Authors: Xin Lei Lin, Soroush Mehraban, Abhishek Moturu, Babak Taati
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16727
Pdf URL: https://arxiv.org/pdf/2509.16727
Copy Paste: [[2509.16727]] Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment(https://arxiv.org/abs/2509.16727)
Keywords: diffusion, generative
Abstract: Automated pain assessment from facial expressions is crucial for non-communicative patients, such as those with dementia. Progress has been limited by two challenges: (i) existing datasets exhibit severe demographic and label imbalance due to ethical constraints, and (ii) current generative models cannot precisely control facial action units (AUs), facial structure, or clinically validated pain levels. We present 3DPain, a large-scale synthetic dataset specifically designed for automated pain assessment, featuring unprecedented annotation richness and demographic diversity. Our three-stage framework generates diverse 3D meshes, textures them with diffusion models, and applies AU-driven face rigging to synthesize multi-view faces with paired neutral and pain images, AU configurations, PSPI scores, and the first dataset-level annotations of pain-region heatmaps. The dataset comprises 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity. We further introduce ViTPain, a Vision Transformer based cross-modal distillation framework in which a heatmap-trained teacher guides a student trained on RGB images, enhancing accuracy, interpretability, and clinical reliability. Together, 3DPain and ViTPain establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment.

Title: Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees

Authors: Yuchen Liang, Yingbin Liang, Lifeng Lai, Ness Shroff
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2509.16756
Pdf URL: https://arxiv.org/pdf/2509.16756
Copy Paste: [[2509.16756]] Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees(https://arxiv.org/abs/2509.16756)
Keywords: diffusion
Abstract: Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $\tau$-leaping samplers have become particularly popular due to their empirical success. However, existing theoretical analyses of $\tau$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $\tau$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $\tau$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.

Title: DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images

Authors: Ozgur Kara, Harris Nisar, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16767
Pdf URL: https://arxiv.org/pdf/2509.16767
Copy Paste: [[2509.16767]] DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images(https://arxiv.org/abs/2509.16767)
Keywords: diffusion
Abstract: Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: this https URL

Title: MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation

Authors: Omid Bonakdar, Nasser Mozayani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16768
Pdf URL: https://arxiv.org/pdf/2509.16768
Copy Paste: [[2509.16768]] MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation(https://arxiv.org/abs/2509.16768)
Keywords: generative
Abstract: Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step's prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.

Title: Looking in the mirror: A faithful counterfactual explanation method for interpreting deep image classification models

Authors: Townim Faisal Chowdhury, Vu Minh Hieu Phan, Kewen Liao, Nanyu Dong, Minh-Son To, Anton Hengel, Johan Verjans, Zhibin Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16822
Pdf URL: https://arxiv.org/pdf/2509.16822
Copy Paste: [[2509.16822]] Looking in the mirror: A faithful counterfactual explanation method for interpreting deep image classification models(https://arxiv.org/abs/2509.16822)
Keywords: generative
Abstract: Counterfactual explanations (CFE) for deep image classifiers aim to reveal how minimal input changes lead to different model decisions, providing critical insights for model interpretation and improvement. However, existing CFE methods often rely on additional image encoders and generative models to create plausible images, neglecting the classifier's own feature space and decision boundaries. As such, they do not explain the intrinsic feature space and decision boundaries learned by the classifier. To address this limitation, we propose Mirror-CFE, a novel method that generates faithful counterfactual explanations by operating directly in the classifier's feature space, treating decision boundaries as mirrors that ``reflect'' feature representations in the mirror. Mirror-CFE learns a mapping function from feature space to image space while preserving distance relationships, enabling smooth transitions between source images and their counterfactuals. Through extensive experiments on four image datasets, we demonstrate that Mirror-CFE achieves superior performance in validity while maintaining input resemblance compared to state-of-the-art explanation methods. Finally, mirror-CFE provides interpretable visualization of the classifier's decision process by generating step-wise transitions that reveal how features evolve as classification confidence changes.

Title: $\mathtt{M^3VIR}$: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation

Authors: Yuanzhi Li, Lebin Zhou, Nam Ling, Zhenghao Chen, Wei Wang, Wei Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16873
Pdf URL: https://arxiv.org/pdf/2509.16873
Copy Paste: [[2509.16873]] $\mathtt{M^3VIR}$: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation(https://arxiv.org/abs/2509.16873)
Keywords: generative
Abstract: The gaming and entertainment industry is rapidly evolving, driven by immersive experiences and the integration of generative AI (GAI) technologies. Training such models effectively requires large-scale datasets that capture the diversity and context of gaming environments. However, existing datasets are often limited to specific domains or rely on artificial degradations, which do not accurately capture the unique characteristics of gaming content. Moreover, benchmarks for controllable video generation remain absent. To address these limitations, we introduce $\mathtt{M^3VIR}$, a large-scale, multi-modal, multi-view dataset specifically designed to overcome the shortcomings of current resources. Unlike existing datasets, $\mathtt{M^3VIR}$ provides diverse, high-fidelity gaming content rendered with Unreal Engine 5, offering authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. It includes $\mathtt{M^3VIR\_MR}$ for super-resolution (SR), novel view synthesis (NVS), and combined NVS+SR tasks, and $\mathtt{M^3VIR\_{MS}}$, the first multi-style, object-level ground-truth set enabling research on controlled video generation. Additionally, we benchmark several state-of-the-art SR and NVS methods to establish performance baselines. While no existing approaches directly handle controlled video generation, $\mathtt{M^3VIR}$ provides a benchmark for advancing this area. By releasing the dataset, we aim to facilitate research in AI-powered restoration, compression, and controllable content generation for next-generation cloud gaming and entertainment.

Title: PRISM: Precision-Recall Informed Data-Free Knowledge Distillation via Generative Diffusion

Authors: Xuewan He, Jielei Wang, Zihan Cheng, Yuchen Su, Shiyue Huang, Guoming Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16897
Pdf URL: https://arxiv.org/pdf/2509.16897
Copy Paste: [[2509.16897]] PRISM: Precision-Recall Informed Data-Free Knowledge Distillation via Generative Diffusion(https://arxiv.org/abs/2509.16897)
Keywords: diffusion, generative
Abstract: Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access to the real in-distribution (ID) data. While existing methods perform well on small-scale images, they suffer from mode collapse when synthesizing large-scale images, resulting in limited knowledge transfer. Recently, leveraging advanced generative models to synthesize photorealistic images has emerged as a promising alternative. Nevertheless, directly using off-the-shelf diffusion to generate datasets faces the precision-recall challenges: 1) ensuring synthetic data aligns with the real distribution, and 2) ensuring coverage of the real ID manifold. In response, we propose PRISM, a precision-recall informed synthesis method. Specifically, we introduce Energy-guided Distribution Alignment to avoid the generation of out-of-distribution samples, and design the Diversified Prompt Engineering to enhance coverage of the real ID manifold. Extensive experiments on various large-scale image datasets demonstrate the superiority of PRISM. Moreover, we demonstrate that models trained with PRISM exhibit strong domain generalization.

Title: Parameter-efficient fine-tuning (PEFT) of Vision Foundation Models for Atypical Mitotic Figure Classification

Authors: Lavish Ramchandani, Gunjan Deotale, Dev Kumar Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16935
Pdf URL: https://arxiv.org/pdf/2509.16935
Copy Paste: [[2509.16935]] Parameter-efficient fine-tuning (PEFT) of Vision Foundation Models for Atypical Mitotic Figure Classification(https://arxiv.org/abs/2509.16935)
Keywords: foundation model
Abstract: Atypical mitotic figures (AMFs) are rare abnormal cell divisions associated with tumor aggressiveness and poor prognosis. Their detection remains a significant challenge due to subtle morphological cues, class imbalance, and inter-observer variability among pathologists. The MIDOG 2025 challenge introduced a dedicated track for atypical mitosis classification, enabling systematic evaluation of deep learning methods. In this study, we investigated the use of large vision foundation models, including Virchow, Virchow2, and UNI, with Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. We conducted extensive experiments with different LoRA ranks, as well as random and group-based data splits, to analyze robustness under varied conditions. Our best approach, Virchow with LoRA rank 8 and ensemble of three-fold cross-validation, achieved a balanced accuracy of 88.37% on the preliminary test set, ranking joint 9th in the challenge leaderboard. These results highlight the promise of foundation models with efficient adaptation strategies for the classification of atypical mitosis, while underscoring the need for improvements in specificity and domain generalization.

Title: VidCLearn: A Continual Learning Approach for Text-to-Video Generation

Authors: Luca Zanchetta, Lorenzo Papa, Luca Maiano, Irene Amerini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16956
Pdf URL: https://arxiv.org/pdf/2509.16956
Copy Paste: [[2509.16956]] VidCLearn: A Continual Learning Approach for Text-to-Video Generation(https://arxiv.org/abs/2509.16956)
Keywords: diffusion, generative
Abstract: Text-to-video generation is an emerging field in generative AI, enabling the creation of realistic, semantically accurate videos from text prompts. While current models achieve impressive visual quality and alignment with input text, they typically rely on static knowledge, making it difficult to incorporate new data without retraining from scratch. To address this limitation, we propose VidCLearn, a continual learning framework for diffusion-based text-to-video generation. VidCLearn features a student-teacher architecture where the student model is incrementally updated with new text-video pairs, and the teacher model helps preserve previously learned knowledge through generative replay. Additionally, we introduce a novel temporal consistency loss to enhance motion smoothness and a video retrieval module to provide structural guidance at inference. Our architecture is also designed to be more computationally efficient than existing models while retaining satisfactory generation performance. Experimental results show VidCLearn's superiority over baseline methods in terms of visual quality, semantic alignment, and temporal coherence.

Title: Penalizing Boundary Activation for Object Completeness in Diffusion Models

Authors: Haoyang Xu, Tianhao Zhao, Sibei Yang, Yutian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16968
Pdf URL: https://arxiv.org/pdf/2509.16968
Copy Paste: [[2509.16968]] Penalizing Boundary Activation for Object Completeness in Diffusion Models(https://arxiv.org/abs/2509.16968)
Keywords: diffusion
Abstract: Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts undermine the model's performance in downstream applications. In this study, we conduct an in-depth analysis of the incompleteness issue and reveal that the primary factor behind incomplete object generation is the usage of RandomCrop during model training. This widely used data augmentation method, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.

Title: VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation

Authors: Feng Han, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16986
Pdf URL: https://arxiv.org/pdf/2509.16986
Copy Paste: [[2509.16986]] VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation(https://arxiv.org/abs/2509.16986)
Keywords: diffusion
Abstract: Recently, autoregressive image generation models have wowed audiences with their remarkable capability in creating surprisingly realistic images. Models such as GPT-4o and LlamaGen can not only produce images that faithfully mimic renowned artistic styles like Ghibli, Van Gogh, or Picasso, but also potentially generate Not-Safe-For-Work (NSFW) content, raising significant concerns regarding copyright infringement and ethical use. Despite these concerns, methods to safeguard autoregressive text-to-image models remain underexplored. Previous concept erasure methods, primarily designed for diffusion models that operate in denoising latent space, are not directly applicable to autoregressive models that generate images token by token. To address this critical gap, we propose Visual Contrast Exploitation (VCE), a novel framework comprising: (1) an innovative contrastive image pair construction paradigm that precisely decouples unsafe concepts from their associated content semantics, and (2) a sophisticated DPO-based training approach that enhances the model's ability to identify and leverage visual contrastive features from image pairs, enabling precise concept erasure. Our comprehensive experiments across three challenging tasks-artist style erasure, explicit content erasure, and object removal-demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts. The code and models are available at this https URL.

Title: Advancing Speech Understanding in Speech-Aware Language Models with GRPO

Authors: Avishai Elmakies, Hagai Aronowitz, Nimrod Shabtay, Eli Schwartz, Ron Hoory, Avihu Dekel
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.16990
Pdf URL: https://arxiv.org/pdf/2509.16990
Copy Paste: [[2509.16990]] Advancing Speech Understanding in Speech-Aware Language Models with GRPO(https://arxiv.org/abs/2509.16990)
Keywords: generative
Abstract: In this paper, we introduce a Group Relative Policy Optimization (GRPO)-based method for training Speech-Aware Large Language Models (SALLMs) on open-format speech understanding tasks, such as Spoken Question Answering and Automatic Speech Translation. SALLMs have proven highly effective for speech understanding tasks. GRPO has recently gained traction for its efficiency in training LLMs, and prior work has explored its application to SALLMs, primarily in multiple-choice tasks. Building on this, we focus on open-format tasks that better reflect the generative abilities of the models. Our approach leverages GRPO with BLEU as the reward signal to optimize SALLMs, and we demonstrate empirically that it surpasses standard SFT across several key metrics. Finally, we explore the potential of incorporating off-policy samples within GRPO for these tasks, highlighting avenues for further improvement and further research.

Title: When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration

Authors: Wenxuan Fang, Jili Fan, Chao Wang, Xiantao Hu, Jiangwei Weng, Ying Tai, Jian Yang, Jun Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17024
Pdf URL: https://arxiv.org/pdf/2509.17024
Copy Paste: [[2509.17024]] When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration(https://arxiv.org/abs/2509.17024)
Keywords: diffusion
Abstract: Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbf{LCDiff}, a novel framework comprising two key components: \textit{Lumina-Chroma Decomposition Network} (LCDN) and \textit{Lumina-Guided Diffusion Model} (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textit{Dynamic Time Step Loss} to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: this https URL.

Title: Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition

Authors: Junhao Jia, Yunyou Liu, Yifei Sun, Huangwei Chen, Feiwei Qin, Changmiao Wang, Yong Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17050
Pdf URL: https://arxiv.org/pdf/2509.17050
Copy Paste: [[2509.17050]] Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition(https://arxiv.org/abs/2509.17050)
Keywords: diffusion
Abstract: Nonlinear manifolds are widespread in deep visual features, where Euclidean distances often fail to capture true similarity. This limitation becomes particularly severe in prototype-based interpretable fine-grained recognition, where subtle semantic distinctions are essential. To address this challenge, we propose a novel paradigm for prototype-based recognition that anchors similarity within the intrinsic geometry of deep features. Specifically, we distill the latent manifold structure of each class into a diffusion space and introduce a differentiable Nyström interpolation, making the geometry accessible to both unseen samples and learnable prototypes. To ensure efficiency, we employ compact per-class landmark sets with periodic updates. This design keeps the embedding aligned with the evolving backbone, enabling fast and scalable inference. Extensive experiments on the CUB-200-2011 and Stanford Cars datasets show that our GeoProto framework produces prototypes focusing on semantically aligned parts, significantly outperforming Euclidean prototype networks.

Title: TSGym: Design Choices for Deep Multivariate Time-Series Forecasting

Authors: Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Minqi Jiang, Songqiao Han, Hailiang Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17063
Pdf URL: https://arxiv.org/pdf/2509.17063
Copy Paste: [[2509.17063]] TSGym: Design Choices for Deep Multivariate Time-Series Forecasting(https://arxiv.org/abs/2509.17063)
Keywords: foundation model
Abstract: Recently, deep learning has driven significant advancements in multivariate time series forecasting (MTSF) tasks. However, much of the current research in MTSF tends to evaluate models from a holistic perspective, which obscures the individual contributions and leaves critical issues unaddressed. Adhering to the current modeling paradigms, this work bridges these gaps by systematically decomposing deep MTSF methods into their core, fine-grained components like series-patching tokenization, channel-independent strategy, attention modules, or even Large Language Models and Time-series Foundation Models. Through extensive experiments and component-level analysis, our work offers more profound insights than previous benchmarks that typically discuss models as a whole. Furthermore, we propose a novel automated solution called TSGym for MTSF tasks. Unlike traditional hyperparameter tuning, neural architecture searching or fixed model selection, TSGym performs fine-grained component selection and automated model construction, which enables the creation of more effective solutions tailored to diverse time series data, therefore enhancing model transferability across different data sources and robustness against distribution shifts. Extensive experiments indicate that TSGym significantly outperforms existing state-of-the-art MTSF and AutoML methods. All code is publicly available on this https URL.

Title: Informative Text-Image Alignment for Visual Affordance Learning with Foundation Models

Authors: Qian Zhang, Lin Zhang, Xing Fang, Mingxin Zhang, Zhiyuan Wei, Ran Song, Wei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17074
Pdf URL: https://arxiv.org/pdf/2509.17074
Copy Paste: [[2509.17074]] Informative Text-Image Alignment for Visual Affordance Learning with Foundation Models(https://arxiv.org/abs/2509.17074)
Keywords: foundation model
Abstract: Visual affordance learning is crucial for robots to understand and interact effectively with the physical world. Recent advances in this field attempt to leverage pre-trained knowledge of vision-language foundation models to learn affordance properties with limited training data, providing a novel paradigm for visual affordance learning. However, these methods overlook the significance of maintaining feature alignment between visual images and language descriptions for identifying affordance areas with textual guidance, and thus may lead to suboptimal results. In this paper, we present an informative framework for text-guided affordance learning, which involves information-based constraints to achieve text-image alignment at feature level. Specifically, we design an affordance mutual information constraint that helps learn appropriate textual prompts and task-oriented visual features simultaneously by maximizing the mutual information between the features of the affordance areas in the input images and the corresponding textual prompts. In addition, we propose an object-level information constraint that maximizes the mutual information between the visual features of a given object and the text features of the category it belongs to. This enables the model to capture high-quality representations for the object, providing more reliable semantic priors for identifying affordance regions. Experimental results on the AGD20K dataset show that the proposed method outperforms existing approaches and achieves the new state-of-the-art in one-shot affordance learning.

Title: AlignedGen: Aligning Style Across Generated Images

Authors: Jiexuan Zhang, Yiheng Du, Qian Wang, Weiqi Li, Yu Gu, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17088
Pdf URL: https://arxiv.org/pdf/2509.17088
Copy Paste: [[2509.17088]] AlignedGen: Aligning Style Across Generated Images(https://arxiv.org/abs/2509.17088)
Keywords: diffusion, generative
Abstract: Despite their generative power, diffusion models struggle to maintain style consistency across images conditioned on the same style prompt, hindering their practical deployment in creative workflows. While several training-free methods attempt to solve this, they are constrained to the U-Net architecture, which not only leads to low-quality results and artifacts like object repetition but also renders them incompatible with superior Diffusion Transformer (DiT). To address these issues, we introduce AlignedGen, a novel training-free framework that enhances style consistency across images generated by DiT models. Our work first reveals a critical insight: naive attention sharing fails in DiT due to conflicting positional signals from improper position embeddings. We introduce Shifted Position Embedding (ShiftPE), an effective solution that resolves this conflict by allocating a non-overlapping set of positional indices to each image. Building on this foundation, we develop Advanced Attention Sharing (AAS), a suite of three techniques meticulously designed to fully unleash the potential of attention sharing within the DiT. Furthermore, to broaden the applicability of our method, we present an efficient query, key, and value feature extraction algorithm, enabling our method to seamlessly incorporate external images as style references. Extensive experimental results validate that our method effectively enhances style consistency across generated images while maintaining precise text-to-image alignment.

Title: Uncertainty-Supervised Interpretable and Robust Evidential Segmentation

Authors: Yuzhu Li, An Sui, Fuping Wu, Xiahai Zhuang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17098
Pdf URL: https://arxiv.org/pdf/2509.17098
Copy Paste: [[2509.17098]] Uncertainty-Supervised Interpretable and Robust Evidential Segmentation(https://arxiv.org/abs/2509.17098)
Keywords: self-supervised
Abstract: Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via this https URL.

Title: ScenGAN: Attention-Intensive Generative Model for Uncertainty-Aware Renewable Scenario Forecasting

Authors: Yifei Wu, Bo Wang, Jingshi Cui, Pei-chun Lin, Junzo Watada
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17119
Pdf URL: https://arxiv.org/pdf/2509.17119
Copy Paste: [[2509.17119]] ScenGAN: Attention-Intensive Generative Model for Uncertainty-Aware Renewable Scenario Forecasting(https://arxiv.org/abs/2509.17119)
Keywords: generative
Abstract: To address the intermittency of renewable energy source (RES) generation, scenario forecasting offers a series of stochastic realizations for predictive objects with superior flexibility and direct views. Based on a long time-series perspective, this paper explores uncertainties in the realms of renewable power and deep learning. Then, an uncertainty-aware model is meticulously designed for renewable scenario forecasting, which leverages an attention mechanism and generative adversarial networks (GANs) to precisely capture complex spatial-temporal dynamics. To improve the interpretability of uncertain behavior in RES generation, Bayesian deep learning and adaptive instance normalization (AdaIN) are incorporated to simulate typical patterns and variations. Additionally, the integration of meteorological information, forecasts, and historical trajectories in the processing layer improves the synergistic forecasting capability for multiscale periodic regularities. Numerical experiments and case analyses demonstrate that the proposed approach provides an appropriate interpretation for renewable uncertainty representation, including both aleatoric and epistemic uncertainties, and shows superior performance over state-of-the-art methods.

Title: Stencil: Subject-Driven Generation with Context Guidance

Authors: Gordon Chen, Ziqi Huang, Cheston Tan, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17120
Pdf URL: https://arxiv.org/pdf/2509.17120
Copy Paste: [[2509.17120]] Stencil: Subject-Driven Generation with Context Guidance(https://arxiv.org/abs/2509.17120)
Keywords: diffusion
Abstract: Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts. One major limitation of current fine-tuning approaches is the inherent trade-off between quality and efficiency. Fine-tuning large models improves fidelity but is computationally expensive, while fine-tuning lightweight models improves efficiency but compromises image fidelity. Moreover, fine-tuning pre-trained models on a small set of images of the subject can damage the existing priors, resulting in suboptimal results. To this end, we present Stencil, a novel framework that jointly employs two diffusion models during inference. Stencil efficiently fine-tunes a lightweight model on images of the subject, while a large frozen pre-trained model provides contextual guidance during inference, injecting rich priors to enhance generation with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.

Title: SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction

Authors: Djamel Eddine Boukhari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17172
Pdf URL: https://arxiv.org/pdf/2509.17172
Copy Paste: [[2509.17172]] SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction(https://arxiv.org/abs/2509.17172)
Keywords: diffusion, generative
Abstract: The automated prediction of facial beauty is a benchmark task in affective computing that requires a sophisticated understanding of both local aesthetic details (e.g., skin texture) and global facial harmony (e.g., symmetry, proportions). Existing models, based on either Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), exhibit inherent architectural biases that limit their performance; CNNs excel at local feature extraction but struggle with long-range dependencies, while ViTs model global relationships at a significant computational cost. This paper introduces the \textbf{Mamba-Diffusion Network (MD-Net)}, a novel dual-stream architecture that resolves this trade-off by delegating specialized roles to state-of-the-art models. The first stream leverages a frozen U-Net encoder from a pre-trained latent diffusion model, providing a powerful generative prior for fine-grained aesthetic qualities. The second stream employs a Vision Mamba (Vim), a modern state-space model, to efficiently capture global facial structure with linear-time complexity. By synergistically integrating these complementary representations through a cross-attention mechanism, MD-Net creates a holistic and nuanced feature space for prediction. Evaluated on the SCUT-FBP5500 benchmark, MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of \textbf{0.9235} and demonstrating the significant potential of hybrid architectures that fuse generative and sequential modeling paradigms for complex visual assessment tasks.

Title: Ambiguous Medical Image Segmentation Using Diffusion Schrödinger Bridge

Authors: Lalith Bharadwaj Baru, Kamalaker Dadi, Tapabrata Chakraborti, Raju S. Bapi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17187
Pdf URL: https://arxiv.org/pdf/2509.17187
Copy Paste: [[2509.17187]] Ambiguous Medical Image Segmentation Using Diffusion Schrödinger Bridge(https://arxiv.org/abs/2509.17187)
Keywords: diffusion
Abstract: Accurate segmentation of medical images is challenging due to unclear lesion boundaries and mask variability. We introduce \emph{Segmentation Schödinger Bridge (SSB)}, the first application of Schödinger Bridge for ambiguous medical image segmentation, modelling joint image-mask dynamics to enhance performance. SSB preserves structural integrity, delineates unclear boundaries without additional guidance, and maintains diversity using a novel loss function. We further propose the \emph{Diversity Divergence Index} ($D_{DDI}$) to quantify inter-rater variability, capturing both diversity and consensus. SSB achieves state-of-the-art performance on LIDC-IDRI, COCA, and RACER (in-house) datasets.

Title: Echo-Path: Pathology-Conditioned Echo Video Generation

Authors: Kabir Hamzah Muhammad, Marawan Elbatel, Yi Qin, Xiaomeng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17190
Pdf URL: https://arxiv.org/pdf/2509.17190
Copy Paste: [[2509.17190]] Echo-Path: Pathology-Conditioned Echo Video Generation(https://arxiv.org/abs/2509.17190)
Keywords: generative
Abstract: Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7\% and 8\% respectively. Code, weights and dataset are available here this https URL

Title: SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing

Authors: Junlong Ke, Qiying Hu, Shenghai Yuan, Yuecong Xu, Jianfei Yang
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2509.17197
Pdf URL: https://arxiv.org/pdf/2509.17197
Copy Paste: [[2509.17197]] SignalLLM: A General-Purpose LLM Agent Framework for Automated Signal Processing(https://arxiv.org/abs/2509.17197)
Keywords: in-context
Abstract: Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.

Title: Conditional Policy Generator for Dynamic Constraint Satisfaction and Optimization

Authors: Wook Lee, Frans A. Oliehoek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17205
Pdf URL: https://arxiv.org/pdf/2509.17205
Copy Paste: [[2509.17205]] Conditional Policy Generator for Dynamic Constraint Satisfaction and Optimization(https://arxiv.org/abs/2509.17205)
Keywords: generative
Abstract: Leveraging machine learning methods to solve constraint satisfaction problems has shown promising, but they are mostly limited to a static situation where the problem description is completely known and fixed from the beginning. In this work we present a new approach to constraint satisfaction and optimization in dynamically changing environments, particularly when variables in the problem are statistically independent. We frame it as a reinforcement learning problem and introduce a conditional policy generator by borrowing the idea of class conditional generative adversarial networks (GANs). Assuming that the problem includes both static and dynamic constraints, the former are used in a reward formulation to guide the policy training such that it learns to map to a probabilistic distribution of solutions satisfying static constraints from a noise prior, which is similar to a generator in GANs. On the other hand, dynamic constraints in the problem are encoded to different class labels and fed with the input noise. The policy is then simultaneously updated for maximum likelihood of correctly classifying given the dynamic conditions in a supervised manner. We empirically demonstrate a proof-of-principle experiment with a multi-modal constraint satisfaction problem and compare between unconditional and conditional cases.

Title: Guided and Unguided Conditional Diffusion Mechanisms for Structured and Semantically-Aware 3D Point Cloud Generation

Authors: Gunner Stone, Sushmita Sarker, Alireza Tavakkoli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17206
Pdf URL: https://arxiv.org/pdf/2509.17206
Copy Paste: [[2509.17206]] Guided and Unguided Conditional Diffusion Mechanisms for Structured and Semantically-Aware 3D Point Cloud Generation(https://arxiv.org/abs/2509.17206)
Keywords: diffusion, generative
Abstract: Generating realistic 3D point clouds is a fundamental problem in computer vision with applications in remote sensing, robotics, and digital object modeling. Existing generative approaches primarily capture geometry, and when semantics are considered, they are typically imposed post hoc through external segmentation or clustering rather than integrated into the generative process itself. We propose a diffusion-based framework that embeds per-point semantic conditioning directly within generation. Each point is associated with a conditional variable corresponding to its semantic label, which guides the diffusion dynamics and enables the joint synthesis of geometry and semantics. This design produces point clouds that are both structurally coherent and segmentation-aware, with object parts explicitly represented during synthesis. Through a comparative analysis of guided and unguided diffusion processes, we demonstrate the significant impact of conditional variables on diffusion dynamics and generation quality. Extensive experiments validate the efficacy of our approach, producing detailed and accurate 3D point clouds tailored to specific parts and features.

Title: DT-NeRF: A Diffusion and Transformer-Based Optimization Approach for Neural Radiance Fields in 3D Reconstruction

Authors: Bo Liu, Runlong Li, Li Zhou, Yan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17232
Pdf URL: https://arxiv.org/pdf/2509.17232
Copy Paste: [[2509.17232]] DT-NeRF: A Diffusion and Transformer-Based Optimization Approach for Neural Radiance Fields in 3D Reconstruction(https://arxiv.org/abs/2509.17232)
Keywords: diffusion, generative
Abstract: This paper proposes a Diffusion Model-Optimized Neural Radiance Field (DT-NeRF) method, aimed at enhancing detail recovery and multi-view consistency in 3D scene reconstruction. By combining diffusion models with Transformers, DT-NeRF effectively restores details under sparse viewpoints and maintains high accuracy in complex geometric scenes. Experimental results demonstrate that DT-NeRF significantly outperforms traditional NeRF and other state-of-the-art methods on the Matterport3D and ShapeNet datasets, particularly in metrics such as PSNR, SSIM, Chamfer Distance, and Fidelity. Ablation experiments further confirm the critical role of the diffusion and Transformer modules in the model's performance, with the removal of either module leading to a decline in performance. The design of DT-NeRF showcases the synergistic effect between modules, providing an efficient and accurate solution for 3D scene reconstruction. Future research may focus on further optimizing the model, exploring more advanced generative models and network architectures to enhance its performance in large-scale dynamic scenes.

Title: Prospective Multi-Graph Cohesion for Multivariate Time Series Anomaly Detection

Authors: Jiazhen Chen, Mingbin Feng, Tony S. Wirjanto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17235
Pdf URL: https://arxiv.org/pdf/2509.17235
Copy Paste: [[2509.17235]] Prospective Multi-Graph Cohesion for Multivariate Time Series Anomaly Detection(https://arxiv.org/abs/2509.17235)
Keywords: anomaly
Abstract: Anomaly detection in high-dimensional time series data is pivotal for numerous industrial applications. Recent advances in multivariate time series anomaly detection (TSAD) have increasingly leveraged graph structures to model inter-variable relationships, typically employing Graph Neural Networks (GNNs). Despite their promising results, existing methods often rely on a single graph representation, which are insufficient for capturing the complex, diverse relationships inherent in multivariate time series. To address this, we propose the Prospective Multi-Graph Cohesion (PMGC) framework for multivariate TSAD. PMGC exploits spatial correlations by integrating a long-term static graph with a series of short-term instance-wise dynamic graphs, regulated through a graph cohesion loss function. Our theoretical analysis shows that this loss function promotes diversity among dynamic graphs while aligning them with the stable long-term relationships encapsulated by the static graph. Additionally, we introduce a "prospective graphing" strategy to mitigate the limitations of traditional forecasting-based TSAD methods, which often struggle with unpredictable future variations. This strategy allows the model to accurately reflect concurrent inter-series relationships under normal conditions, thereby enhancing anomaly detection efficacy. Empirical evaluations on real-world datasets demonstrate the superior performance of our method compared to existing TSAD techniques.

Title: SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Authors: Ranran Huang, Krystian Mikolajczyk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17246
Pdf URL: https://arxiv.org/pdf/2509.17246
Copy Paste: [[2509.17246]] SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views(https://arxiv.org/abs/2509.17246)
Keywords: self-supervised
Abstract: We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: this https URL.

Title: Graph Signal Generative Diffusion Models

Authors: Yigit Berkay Uslu, Samar Hadou, Sergio Rozada, Shirin Saeedi Bidokhti, Alejandro Ribeiro
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2509.17250
Pdf URL: https://arxiv.org/pdf/2509.17250
Copy Paste: [[2509.17250]] Graph Signal Generative Diffusion Models(https://arxiv.org/abs/2509.17250)
Keywords: diffusion, generative
Abstract: We introduce U-shaped encoder-decoder graph neural networks (U-GNNs) for stochastic graph signal generation using denoising diffusion processes. The architecture learns node features at different resolutions with skip connections between the encoder and decoder paths, analogous to the convolutional U-Net for image generation. The U-GNN is prominent for a pooling operation that leverages zero-padding and avoids arbitrary graph coarsening, with graph convolutions layered on top to capture local dependencies. This technique permits learning feature embeddings for sampled nodes at deeper levels of the architecture that remain convolutional with respect to the original graph. Applied to stock price prediction -- where deterministic forecasts struggle to capture uncertainties and tail events that are paramount -- we demonstrate the effectiveness of the diffusion model in probabilistic forecasting of stock prices.

Title: GraphWeave: Interpretable and Robust Graph Generation via Random Walk Trajectories

Authors: Rahul Nandakumar, Deepayan Chakrabarti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17291
Pdf URL: https://arxiv.org/pdf/2509.17291
Copy Paste: [[2509.17291]] GraphWeave: Interpretable and Robust Graph Generation via Random Walk Trajectories(https://arxiv.org/abs/2509.17291)
Keywords: diffusion
Abstract: Given a set of graphs from some unknown family, we want to generate new graphs from that family. Recent methods use diffusion on either graph embeddings or the discrete space of nodes and edges. However, simple changes to embeddings (say, adding noise) can mean uninterpretable changes in the graph. In discrete-space diffusion, each step may add or remove many nodes/edges. It is hard to predict what graph patterns we will observe after many diffusion steps. Our proposed method, called GraphWeave, takes a different approach. We separate pattern generation and graph construction. To find patterns in the training graphs, we see how they transform vectors during random walks. We then generate new graphs in two steps. First, we generate realistic random walk "trajectories" which match the learned patterns. Then, we find the optimal graph that fits these trajectories. The optimization infers all edges jointly, which improves robustness to errors. On four simulated and five real-world benchmark datasets, GraphWeave outperforms existing methods. The most significant differences are on large-scale graph structures such as PageRank, cuts, communities, degree distributions, and flows. GraphWeave is also 10x faster than its closest competitor. Finally, GraphWeave is simple, needing only a transformer and standard optimizers.

Title: DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking

Authors: Buyin Deng, Lingxin Huang, Kai Luo, Fei Teng, Kailun Yang
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2509.17323
Pdf URL: https://arxiv.org/pdf/2509.17323
Copy Paste: [[2509.17323]] DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking(https://arxiv.org/abs/2509.17323)
Keywords: foundation model
Abstract: Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at this https URL.

Title: Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

Authors: Haoyang Chen, Kumiko Tanaka-Ishii
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17367
Pdf URL: https://arxiv.org/pdf/2509.17367
Copy Paste: [[2509.17367]] Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs(https://arxiv.org/abs/2509.17367)
Keywords: generative
Abstract: We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $\beta$ (vocabulary growth), Taylor's exponent $\alpha$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $\beta$) and higher term consistency (higher $\alpha$) than general texts. Within legal domain, statutory codes have the lowest $\beta$ and highest $\alpha$, reflecting strict drafting conventions, while cases and deeds show higher $\beta$ and lower $\alpha$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.

Title: Diff-GNSS: Diffusion-based Pseudorange Error Estimation

Authors: Jiaqi Zhu, Shouyi Lu, Ziyao Li, Guirong Zhuo, Lu Xiong
Subjects: cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2509.17397
Pdf URL: https://arxiv.org/pdf/2509.17397
Copy Paste: [[2509.17397]] Diff-GNSS: Diffusion-based Pseudorange Error Estimation(https://arxiv.org/abs/2509.17397)
Keywords: diffusion, generative
Abstract: Global Navigation Satellite Systems (GNSS) are vital for reliable urban positioning. However, multipath and non-line-of-sight reception often introduce large measurement errors that degrade accuracy. Learning-based methods for predicting and compensating pseudorange errors have gained traction, but their performance is limited by complex error distributions. To address this challenge, we propose Diff-GNSS, a coarse-to-fine GNSS measurement (pseudorange) error estimation framework that leverages a conditional diffusion model to capture such complex distributions. Firstly, a Mamba-based module performs coarse estimation to provide an initial prediction with appropriate scale and trend. Then, a conditional denoising diffusion layer refines the estimate, enabling fine-grained modeling of pseudorange errors. To suppress uncontrolled generative diversity and achieve controllable synthesis, three key features related to GNSS measurement quality are used as conditions to precisely guide the reverse denoising process. We further incorporate per-satellite uncertainty modeling within the diffusion stage to assess the reliability of the predicted errors. We have collected and publicly released a real-world dataset covering various scenes. Experiments on public and self-collected datasets show that DiffGNSS consistently outperforms state-of-the-art baselines across multiple metrics. To the best of our knowledge, this is the first application of diffusion models to pseudorange error estimation. The proposed diffusion-based refinement module is plug-and-play and can be readily integrated into existing networks to markedly improve estimation accuracy.

Title: Robust Anomaly Detection Under Normality Distribution Shift in Dynamic Graphs

Authors: Xiaoyang Xu, Xiaofeng Lin, Koh Takeuchi, Kyohei Atarashi, Hisashi Kashima
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17400
Pdf URL: https://arxiv.org/pdf/2509.17400
Copy Paste: [[2509.17400]] Robust Anomaly Detection Under Normality Distribution Shift in Dynamic Graphs(https://arxiv.org/abs/2509.17400)
Keywords: anomaly
Abstract: Anomaly detection in dynamic graphs is a critical task with broad real-world applications, including social networks, e-commerce, and cybersecurity. Most existing methods assume that normal patterns remain stable over time; however, this assumption often fails in practice due to the phenomenon we refer to as normality distribution shift (NDS), where normal behaviors evolve over time. Ignoring NDS can lead models to misclassify shifted normal instances as anomalies, degrading detection performance. To tackle this issue, we propose WhENDS, a novel unsupervised anomaly detection method that aligns normal edge embeddings across time by estimating distributional statistics and applying whitening transformations. Extensive experiments on four widely-used dynamic graph datasets show that WhENDS consistently outperforms nine strong baselines, achieving state-of-the-art results and underscoring the importance of addressing NDS in dynamic graph anomaly detection.

Title: Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

Authors: Manish Acharya, David Hyde
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17405
Pdf URL: https://arxiv.org/pdf/2509.17405
Copy Paste: [[2509.17405]] Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization(https://arxiv.org/abs/2509.17405)
Keywords: generative
Abstract: The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead.

Title: Single-Image Depth from Defocus with Coded Aperture and Diffusion Posterior Sampling

Authors: Hodaka Kawachi, Jose Reinaldo Cunha Santos A. V. Silva Neto, Yasushi Yagi, Hajime Nagahara, Tomoya Nakamura
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17427
Pdf URL: https://arxiv.org/pdf/2509.17427
Copy Paste: [[2509.17427]] Single-Image Depth from Defocus with Coded Aperture and Diffusion Posterior Sampling(https://arxiv.org/abs/2509.17427)
Keywords: diffusion
Abstract: We propose a single-snapshot depth-from-defocus (DFD) reconstruction method for coded-aperture imaging that replaces hand-crafted priors with a learned diffusion prior used purely as regularization. Our optimization framework enforces measurement consistency via a differentiable forward model while guiding solutions with the diffusion prior in the denoised image domain, yielding higher accuracy and stability than clas- sical optimization. Unlike U-Net-style regressors, our approach requires no paired defocus-RGBD training data and does not tie training to a specific camera configuration. Experiments on comprehensive simulations and a prototype camera demonstrate consistently strong RGBD reconstructions across noise levels, outperforming both U-Net baselines and a classical coded- aperture DFD method.

Title: Emergent 3D Correspondence from Neural Shape Representation

Authors: Keyu Du, Jingyu Hu, Haipeng Li, Hao Xu, Haibing Huang, Chi-Wing Fu, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17431
Pdf URL: https://arxiv.org/pdf/2509.17431
Copy Paste: [[2509.17431]] Emergent 3D Correspondence from Neural Shape Representation(https://arxiv.org/abs/2509.17431)
Keywords: generative
Abstract: This paper presents a new approach to estimate accurate and robust 3D semantic correspondence with the hierarchical neural semantic representation. Our work has three key contributions. First, we design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features to preserve fine details, by carefully harnessing 3D priors from pre-trained 3D generative models. Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature, then iteratively refines it with local geometric features, yielding accurate and semantically-consistent mappings. Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories. Our method also supports various applications, such as shape co-segmentation, keypoint matching, and texture transfer, and generalizes well to structurally diverse shapes, with promising results even in cross-category scenarios. Both qualitative and quantitative evaluations show that our method outperforms previous state-of-the-art techniques.

Title: MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

Authors: Tong Chen, Zimu Wang, Yiyi Miao, Haoran Luo, Yuanfei Sun, Wei Wang, Zhengyong Jiang, Procheta Sen, Jionglong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17436
Pdf URL: https://arxiv.org/pdf/2509.17436
Copy Paste: [[2509.17436]] MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses(https://arxiv.org/abs/2509.17436)
Keywords: in-context
Abstract: Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at this https URL.

Title: Training-Free Label Space Alignment for Universal Domain Adaptation

Authors: Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito, Donghyun Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17452
Pdf URL: https://arxiv.org/pdf/2509.17452
Copy Paste: [[2509.17452]] Training-Free Label Space Alignment for Universal Domain Adaptation(https://arxiv.org/abs/2509.17452)
Keywords: foundation model, generative
Abstract: Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.

Title: CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Authors: Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17458
Pdf URL: https://arxiv.org/pdf/2509.17458
Copy Paste: [[2509.17458]] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration(https://arxiv.org/abs/2509.17458)
Keywords: diffusion
Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at this https URL{this URL}.

Title: Periodic Graph-Enhanced Multivariate Time Series Anomaly Detector

Authors: Jia Li, Shiyu Long, Ye Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17472
Pdf URL: https://arxiv.org/pdf/2509.17472
Copy Paste: [[2509.17472]] Periodic Graph-Enhanced Multivariate Time Series Anomaly Detector(https://arxiv.org/abs/2509.17472)
Keywords: anomaly
Abstract: Multivariate time series (MTS) anomaly detection commonly encounters in various domains like finance, healthcare, and industrial monitoring. However, existing MTS anomaly detection methods are mostly defined on the static graph structure, which fails to perform an accurate representation of complex spatio-temporal correlations in MTS. To address this issue, this study proposes a Periodic Graph-Enhanced Multivariate Time Series Anomaly Detector (PGMA) with the following two-fold ideas: a) designing a periodic time-slot allocation strategy based Fast Fourier Transform (FFT), which enables the graph structure to reflect dynamic changes in MTS; b) utilizing graph neural network and temporal extension convolution to accurate extract the complex spatio-temporal correlations from the reconstructed periodic graphs. Experiments on four real datasets from real applications demonstrate that the proposed PGMA outperforms state-of-the-art models in MTS anomaly detection.

Title: Stable Video-Driven Portraits

Authors: Mallikarjun B. R., Fei Yin, Vikram Voleti, Nikita Drobyshev, Maksim Lapin, Aaryaman Vasishta, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17476
Pdf URL: https://arxiv.org/pdf/2509.17476
Copy Paste: [[2509.17476]] Stable Video-Driven Portraits(https://arxiv.org/abs/2509.17476)
Keywords: diffusion
Abstract: Portrait animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video. While early methods relied on 3D morphable models or feature warping techniques, they often suffered from limited expressivity, temporal inconsistency, and poor generalization to unseen identities or large pose variations. Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations. In this work, we propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues. To enable robust training without appearance leakage, we adopt cross identity supervision. To leverage the strong prior from the pretrained diffusion model, our novel architecture introduces minimal new parameters that converge faster and help in better generalization. We introduce spatial temporal attention mechanisms that allow inter frame and intra frame interactions, effectively capturing subtle motions and reducing temporal artifacts. Our model uses history frames to ensure continuity across segments. At inference, we propose a novel signal fusion strategy that balances motion fidelity with identity preservation. Our approach achieves superior temporal consistency and accurate expression control, enabling high-quality, controllable portrait animation suitable for real-world applications.

Title: Multimodal Medical Image Classification via Synergistic Learning Pre-training

Authors: Qinghua Lin, Guang-Hai Liu, Zuoyong Li, Yang Li, Yuting Jiang, Xiang Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17492
Pdf URL: https://arxiv.org/pdf/2509.17492
Copy Paste: [[2509.17492]] Multimodal Medical Image Classification via Synergistic Learning Pre-training(https://arxiv.org/abs/2509.17492)
Keywords: self-supervised
Abstract: Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model's feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: this https URL.

Title: Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Authors: María Andrea Cruz Blandón, Zakaria Aldeneh, Jie Chi, Maureen de Seyssel
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.17523
Pdf URL: https://arxiv.org/pdf/2509.17523
Copy Paste: [[2509.17523]] Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models(https://arxiv.org/abs/2509.17523)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.

Title: Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A.Subramanian, Alvin Chan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17552
Pdf URL: https://arxiv.org/pdf/2509.17552
Copy Paste: [[2509.17552]] Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning(https://arxiv.org/abs/2509.17552)
Keywords: in-context
Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

Title: Visual Instruction Pretraining for Domain-Specific Foundation Models

Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17562
Pdf URL: https://arxiv.org/pdf/2509.17562
Copy Paste: [[2509.17562]] Visual Instruction Pretraining for Domain-Specific Foundation Models(https://arxiv.org/abs/2509.17562)
Keywords: foundation model
Abstract: Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at this http URL.

Title: MRN: Harnessing 2D Vision Foundation Models for Diagnosing Parkinson's Disease with Limited 3D MR Data

Authors: Ding Shaodong, Liu Ziyang, Zhou Yijun, Liu Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17566
Pdf URL: https://arxiv.org/pdf/2509.17566
Copy Paste: [[2509.17566]] MRN: Harnessing 2D Vision Foundation Models for Diagnosing Parkinson's Disease with Limited 3D MR Data(https://arxiv.org/abs/2509.17566)
Keywords: foundation model
Abstract: The automatic diagnosis of Parkinson's disease is in high clinical demand due to its prevalence and the importance of targeted treatment. Current clinical practice often relies on diagnostic biomarkers in QSM and NM-MRI images. However, the lack of large, high-quality datasets makes training diagnostic models from scratch prone to overfitting. Adapting pre-trained 3D medical models is also challenging, as the diversity of medical imaging leads to mismatches in voxel spacing and modality between pre-training and fine-tuning data. In this paper, we address these challenges by leveraging 2D vision foundation models (VFMs). Specifically, we crop multiple key ROIs from NM and QSM images, process each ROI through separate branches to compress the ROI into a token, and then combine these tokens into a unified patient representation for classification. Within each branch, we use 2D VFMs to encode axial slices of the 3D ROI volume and fuse them into the ROI token, guided by an auxiliary segmentation head that steers the feature extraction toward specific brain nuclei. Additionally, we introduce multi-ROI supervised contrastive learning, which improves diagnostic performance by pulling together representations of patients from the same class while pushing away those from different classes. Our approach achieved first place in the MICCAI 2025 PDCADxFoundation challenge, with an accuracy of 86.0% trained on a dataset of only 300 labeled QSM and NM-MRI scans, outperforming the second-place method by 5.5%.These results highlight the potential of 2D VFMs for clinical analysis of 3D MR images.

Title: From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge

Authors: Lars Heckler-Kram, Ashwin Vaidya, Jan-Hendrik Neudeck, Ulla Scheler, Dick Ameln, Samet Akcay, Paula Ramos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17615
Pdf URL: https://arxiv.org/pdf/2509.17615
Copy Paste: [[2509.17615]] From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge(https://arxiv.org/abs/2509.17615)
Keywords: anomaly
Abstract: Visual anomaly detection is a strongly application-driven field of research. Consequently, the connection between academia and industry is of paramount importance. In this regard, we present the VAND 3.0 Challenge to showcase current progress in anomaly detection across different practical settings whilst addressing critical issues in the field. The challenge hosted two tracks, fostering the development of anomaly detection methods robust against real-world distribution shifts (Category 1) and exploring the capabilities of Vision Language Models within the few-shot regime (Category 2), respectively. The participants' solutions reached significant improvements over previous baselines by combining or adapting existing approaches and fusing them with novel pipelines. While for both tracks the progress in large pre-trained vision (language) backbones played a pivotal role for the performance increase, scaling up anomaly detection methods more efficiently needs to be addressed by future research to meet real-time and computational constraints on-site.

Title: OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Authors: Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17627
Pdf URL: https://arxiv.org/pdf/2509.17627
Copy Paste: [[2509.17627]] OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models(https://arxiv.org/abs/2509.17627)
Keywords: diffusion
Abstract: Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

Title: SISMA: Semantic Face Image Synthesis with Mamba

Authors: Filippo Botti, Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17651
Pdf URL: https://arxiv.org/pdf/2509.17651
Copy Paste: [[2509.17651]] SISMA: Semantic Face Image Synthesis with Mamba(https://arxiv.org/abs/2509.17651)
Keywords: diffusion
Abstract: Diffusion Models have become very popular for Semantic Image Synthesis (SIS) of human faces. Nevertheless, their training and inference is computationally expensive and their computational requirements are high due to the quadratic complexity of attention layers. In this paper, we propose a novel architecture called SISMA, based on the recently proposed Mamba. SISMA generates high quality samples by controlling their shape using a semantic mask at a reduced computational demand. We validated our approach through comprehensive experiments with CelebAMask-HQ, revealing that our architecture not only achieves a better FID score yet also operates at three times the speed of state-of-the-art architectures. This indicates that the proposed design is a viable, lightweight substitute to transformer-based models.

Title: Clothing agnostic Pre-inpainting Virtual Try-ON

Authors: Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Taemin Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17654
Pdf URL: https://arxiv.org/pdf/2509.17654
Copy Paste: [[2509.17654]] Clothing agnostic Pre-inpainting Virtual Try-ON(https://arxiv.org/abs/2509.17654)
Keywords: diffusion
Abstract: With the development of deep learning technology, virtual try-on technology has become an important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa has improved the texture distortion problem of diffu-sion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette remain in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing agnostic Pre-inpainting Virtual Try-ON). CaP-VTON has improved the naturalness and consistency of whole-body clothing syn-thesis by integrating multi-category masking based on Dress Code and skin inpainting based on Stable Diffusion. In particular, a generate skin module was introduced to solve the skin restoration problem that occurs when long-sleeved images are converted into short-sleeved or sleeveless ones, and high-quality restoration was implemented consider-ing the human body posture and color. As a result, CaP-VTON recorded 92.5\%, which is 15.4\% better than Leffa in short-sleeved synthesis accuracy, and showed the performance of consistently reproducing the style and shape of reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffu-sion-based virtual inspection systems, and can contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.

Title: Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study

Authors: Yikun Ma, Bo Li, Ying Chen, Zijie Yue, Shuchang Xu, Jingyao Li, Lei Ma, Liang Zhong, Duowu Zou, Leiming Xu, Yunshi Zhong, Xiaobo Li, Weiqun Ding, Minmin Zhang, Dongli He, Zhenghong Li, Ye Chen, Ye Zhao, Jialong Zhuo, Xiaofen Wu, Lisha Yi, Miaojing Shi, Huihui Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17660
Pdf URL: https://arxiv.org/pdf/2509.17660
Copy Paste: [[2509.17660]] Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study(https://arxiv.org/abs/2509.17660)
Keywords: foundation model
Abstract: The early detection of esophagogastric junction adenocarcinoma (EGJA) is crucial for improving patient prognosis, yet its current diagnosis is highly operator-dependent. This paper aims to make the first attempt to develop an artificial intelligence (AI) foundation model-based method for both screening and staging diagnosis of EGJA using endoscopic images. In this cohort and learning study, we conducted a multicentre study across seven Chinese hospitals between December 28, 2016 and December 30, 2024. It comprises 12,302 images from 1,546 patients; 8,249 of them were employed for model training, while the remaining were divided into the held-out (112 patients, 914 images), external (230 patients, 1,539 images), and prospective (198 patients, 1,600 images) test sets for evaluation. The proposed model employs DINOv2 (a vision foundation model) and ResNet50 (a convolutional neural network) to extract features of global appearance and local details of endoscopic images for EGJA staging diagnosis. Our model demonstrates satisfactory performance for EGJA staging diagnosis across three test sets, achieving an accuracy of 0.9256, 0.8895, and 0.8956, respectively. In contrast, among representative AI models, the best one (ResNet50) achieves an accuracy of 0.9125, 0.8382, and 0.8519 on the three test sets, respectively; the expert endoscopists achieve an accuracy of 0.8147 on the held-out test set. Moreover, with the assistance of our model, the overall accuracy for the trainee, competent, and expert endoscopists improves from 0.7035, 0.7350, and 0.8147 to 0.8497, 0.8521, and 0.8696, respectively. To our knowledge, our model is the first application of foundation models for EGJA staging diagnosis and demonstrates great potential in both diagnostic accuracy and efficiency.

Title: Tailored Transformation Invariance for Industrial Anomaly Detection

Authors: Mariette Schönfeld, Wannes Meert, Hendrik Blockeel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17670
Pdf URL: https://arxiv.org/pdf/2509.17670
Copy Paste: [[2509.17670]] Tailored Transformation Invariance for Industrial Anomaly Detection(https://arxiv.org/abs/2509.17670)
Keywords: anomaly
Abstract: Industrial Anomaly Detection (IAD) is a subproblem within Computer Vision Anomaly Detection that has been receiving increasing amounts of attention due to its applicability to real-life scenarios. Recent research has focused on how to extract the most informative features, contrasting older kNN-based methods that use only pretrained features. These recent methods are much more expensive to train however and could complicate real-life application. Careful study of related work with regards to transformation invariance leads to the idea that popular benchmarks require robustness to only minor translations. With this idea we then formulate LWinNN, a local window based approach that creates a middle ground between kNN based methods that have either complete or no translation invariance. Our experiments demonstrate that this small change increases accuracy considerably, while simultaneously decreasing both train and test time. This teaches us two things: first, the gap between kNN-based approaches and more complex state-of-the-art methodology can still be narrowed by effective usage of the limited data available. Second, our assumption of requiring only limited translation invariance highlights potential areas of interest for future work and the need for more spatially diverse benchmarks, for which our method can hopefully serve as a new baseline. Our code can be found at this https URL .

Title: DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning

Authors: ThankGod Egbe, Peng Wang, Zhihao Guo, Zidong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.17684
Pdf URL: https://arxiv.org/pdf/2509.17684
Copy Paste: [[2509.17684]] DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning(https://arxiv.org/abs/2509.17684)
Keywords: diffusion, self-supervised
Abstract: This paper evaluates DINOv3, a recent large-scale self-supervised vision backbone, for visuomotor diffusion policy learning in robotic manipulation. We investigate whether a purely self-supervised encoder can match or surpass conventional supervised ImageNet-pretrained backbones (e.g., ResNet-18) under three regimes: training from scratch, frozen, and finetuned. Across four benchmark tasks (Push-T, Lift, Can, Square) using a unified FiLM-conditioned diffusion policy, we find that (i) finetuned DINOv3 matches or exceeds ResNet-18 on several tasks, (ii) frozen DINOv3 remains competitive, indicating strong transferable priors, and (iii) self-supervised features improve sample efficiency and robustness. These results support self-supervised large visual models as effective, generalizable perceptual front-ends for action diffusion policies, motivating further exploration of scalable label-free pretraining in robotic manipulation. Compared to using ResNet18 as a backbone, our approach with DINOv3 achieves up to a 10% absolute increase in test-time success rates on challenging tasks such as Can, and on-the-par performance in tasks like Lift, PushT, and Square.

Title: A Generative Conditional Distribution Equality Testing Framework and Its Minimax Analysis

Authors: Siming Zheng, Meifang Lan, Tong Wang, Yuanyuan Lin
Subjects: cs.LG, math.ST, stat.ME
Abstract URL: https://arxiv.org/abs/2509.17729
Pdf URL: https://arxiv.org/pdf/2509.17729
Copy Paste: [[2509.17729]] A Generative Conditional Distribution Equality Testing Framework and Its Minimax Analysis(https://arxiv.org/abs/2509.17729)
Keywords: generative
Abstract: In this paper, we propose a general framework for testing the equality of the conditional distributions in a two-sample problem. This problem is most relevant to transfer learning under covariate shift. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional distribution testing problem into an unconditional one. We introduce two special tests: the generative permutation-based conditional distribution equality test and the generative classification accuracy-based conditional distribution equality test. Theoretically, we establish a minimax lower bound for statistical inference in testing the equality of two conditional distributions under certain smoothness conditions. We demonstrate that the generative permutation-based conditional distribution equality test and its modified version can attain this lower bound precisely or up to some iterated logarithmic factor. Moreover, we prove the testing consistency of the generative classification accuracy-based conditional distribution equality test. We also establish the convergence rate for the learned conditional generator by deriving new results related to the recently-developed offset Rademacher complexity and approximation properties using neural networks. Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach.

Title: Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics

Authors: Kavin R V, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17737
Pdf URL: https://arxiv.org/pdf/2509.17737
Copy Paste: [[2509.17737]] Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics(https://arxiv.org/abs/2509.17737)
Keywords: generative
Abstract: Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA), as well as a biomedical domain-specific benchmark (BC5CDR) using BioBERT. Our findings demonstrate that representing tokens compositionally via ASG achieves extreme compression in embedding parameters (0.4--0.5\%) while maintaining $>$95\% task performance relative to the base model, even in generative tasks and extends to both cross lingual transfer and domain-specific settings. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a simple yet concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling compact yet semantically rich models.

Title: WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification

Authors: Yiwen Jiang, Deval Mehta, Siyuan Yan, Yaling Shen, Zimu Wang, Zongyuan Ge
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.17740
Pdf URL: https://arxiv.org/pdf/2509.17740
Copy Paste: [[2509.17740]] WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification(https://arxiv.org/abs/2509.17740)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.

Title: GEM-T: Generative Tabular Data via Fitting Moments

Authors: Miao Li, Phuc Nguyen, Christopher Tam, Alexandra Morgan, Kenneth Ge, Rahul Bansal, Linzi Yu, Rima Arnaout, Ramy Arnaout
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.17752
Pdf URL: https://arxiv.org/pdf/2509.17752
Copy Paste: [[2509.17752]] GEM-T: Generative Tabular Data via Fitting Moments(https://arxiv.org/abs/2509.17752)
Keywords: generative
Abstract: Tabular data dominates data science but poses challenges for generative models, especially when the data is limited or sensitive. We present a novel approach to generating synthetic tabular data based on the principle of maximum entropy -- MaxEnt -- called GEM-T, for ``generative entropy maximization for tables.'' GEM-T directly captures nth-order interactions -- pairwise, third-order, etc. -- among columns of training data. In extensive testing, GEM-T matches or exceeds deep neural network approaches previously regarded as state-of-the-art in 23 of 34 publicly available datasets representing diverse subject domains (68\%). Notably, GEM-T involves orders-of-magnitude fewer trainable parameters, demonstrating that much of the information in real-world data resides in low-dimensional, potentially human-interpretable correlations, provided that the input data is appropriately transformed first. Furthermore, MaxEnt better handles heterogeneous data types (continuous vs. discrete vs. categorical), lack of local structure, and other features of tabular data. GEM-T represents a promising direction for light-weight high-performance generative models for structured data.

Title: Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance

Authors: Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2509.17757
Pdf URL: https://arxiv.org/pdf/2509.17757
Copy Paste: [[2509.17757]] Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance(https://arxiv.org/abs/2509.17757)
Keywords: diffusion
Abstract: Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.

Title: Qwen3-Omni Technical Report

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
Subjects: cs.CL, cs.AI, cs.CV, eess.AS
Abstract URL: https://arxiv.org/abs/2509.17765
Pdf URL: https://arxiv.org/pdf/2509.17765
Copy Paste: [[2509.17765]] Qwen3-Omni Technical Report(https://arxiv.org/abs/2509.17765)
Keywords: diffusion
Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Title: I2VWM: Robust Watermarking for Image to Video Generation

Authors: Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17773
Pdf URL: https://arxiv.org/pdf/2509.17773
Copy Paste: [[2509.17773]] I2VWM: Robust Watermarking for Image to Video Generation(https://arxiv.org/abs/2509.17773)
Keywords: diffusion, generative
Abstract: The rapid progress of image-guided video generation (I2V) has raised concerns about its potential misuse in misinformation and fraud, underscoring the urgent need for effective digital watermarking. While existing watermarking methods demonstrate robustness within a single modality, they fail to trace source images in I2V settings. To address this gap, we introduce the concept of Robust Diffusion Distance, which measures the temporal persistence of watermark signals in generated videos. Building on this, we propose I2VWM, a cross-modal watermarking framework designed to enhance watermark robustness across time. I2VWM leverages a video-simulation noise layer during training and employs an optical-flow-based alignment module during inference. Experiments on both open-source and commercial I2V models demonstrate that I2VWM significantly improves robustness while maintaining imperceptibility, establishing a new paradigm for cross-modal watermarking in the era of generative video. \href{this https URL}{Code Released.}

Title: Elucidating the Design Space of FP4 training

Authors: Robert Hu, Carlo Luschi, Paul Balanca
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17791
Pdf URL: https://arxiv.org/pdf/2509.17791
Copy Paste: [[2509.17791]] Elucidating the Design Space of FP4 training(https://arxiv.org/abs/2509.17791)
Keywords: diffusion, foundation model
Abstract: The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point (\texttt{FP4}) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize \texttt{FP4} training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of \texttt{FP4} training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using \texttt{UE5M3} as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.

Title: Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training

Authors: Brown Ebouky, Ajad Chhatkuli, Cristiano Malossi, Christoph Studer, Roy Assaf, Andrea Bartezzaghi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17816
Pdf URL: https://arxiv.org/pdf/2509.17816
Copy Paste: [[2509.17816]] Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training(https://arxiv.org/abs/2509.17816)
Keywords: self-supervised, foundation model
Abstract: Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While recent advances have explored parameter-efficient strategies for adapting pre-trained models, extending SSL pre-training itself to new domains - particularly under limited data regimes and for dense prediction tasks - remains underexplored. In this work, we address the problem of adapting vision foundation models to new domains in an unsupervised and data-efficient manner, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules - specifically UniAdapter - while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

Title: ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Authors: Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17818
Pdf URL: https://arxiv.org/pdf/2509.17818
Copy Paste: [[2509.17818]] ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment(https://arxiv.org/abs/2509.17818)
Keywords: diffusion
Abstract: Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

Title: Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

Authors: Saghir Alfasly, Wataru Uegami, MD Enamul Hoq, Ghazal Alabtah, H.R. Tizhoosh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17847
Pdf URL: https://arxiv.org/pdf/2509.17847
Copy Paste: [[2509.17847]] Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology(https://arxiv.org/abs/2509.17847)
Keywords: diffusion, self-supervised, foundation model
Abstract: Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual-conditioning approach combining semantic segmentation maps with tissue-specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20-80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self-supervised extension that clusters whole-slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo-semantic maps for training. Our method synthesizes high-fidelity images with precise region-wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt-guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2-3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole-slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.

Title: Unsupervised Learning and Representation of Mandarin Tonal Categories by a Generative CNN

Authors: Kai Schenck, Gašper Beguš
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17859
Pdf URL: https://arxiv.org/pdf/2509.17859
Copy Paste: [[2509.17859]] Unsupervised Learning and Representation of Mandarin Tonal Categories by a Generative CNN(https://arxiv.org/abs/2509.17859)
Keywords: generative
Abstract: This paper outlines the methodology for modeling tonal learning in fully unsupervised models of human language acquisition. Tonal patterns are among the computationally most complex learning objectives in language. We argue that a realistic generative model of human language (ciwGAN) can learn to associate its categorical variables with Mandarin Chinese tonal categories without any labeled data. All three trained models showed statistically significant differences in F0 across categorical variables. The model trained solely on male tokens consistently encoded tone. Our results sug- gest that not only does the model learn Mandarin tonal contrasts, but it learns a system that corresponds to a stage of acquisition in human language learners. We also outline methodology for tracing tonal representations in internal convolutional layers, which shows that linguistic tools can contribute to interpretability of deep learning and can ultimately be used in neural experiments.

Title: Deep Hierarchical Learning with Nested Subspace Networks

Authors: Paulius Rauba, Mihaela van der Schaar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17874
Pdf URL: https://arxiv.org/pdf/2509.17874
Copy Paste: [[2509.17874]] Deep Hierarchical Learning with Nested Subspace Networks(https://arxiv.org/abs/2509.17874)
Keywords: foundation model
Abstract: Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.

Title: Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark

Authors: Siu Hang Ho, Prasad Ganesan, Nguyen Duong, Daniel Schlabig
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17894
Pdf URL: https://arxiv.org/pdf/2509.17894
Copy Paste: [[2509.17894]] Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark(https://arxiv.org/abs/2509.17894)
Keywords: diffusion, generative
Abstract: Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.

Title: SingLEM: Single-Channel Large EEG Model

Authors: Jamiyan Sukhbaatar, Satoshi Imamura, Ibuki Inoue, Shoya Murakami, Kazi Mahmudul Hassan, Seungwoo Han, Ingon Chanpornpakdi, Toshihisa Tanaka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17920
Pdf URL: https://arxiv.org/pdf/2509.17920
Copy Paste: [[2509.17920]] SingLEM: Single-Channel Large EEG Model(https://arxiv.org/abs/2509.17920)
Keywords: self-supervised, foundation model
Abstract: Current deep learning models for electroencephalography (EEG) are often task-specific and depend on large labeled datasets, limiting their adaptability. Although emerging foundation models aim for broader applicability, their rigid dependence on fixed, high-density multi-channel montages restricts their use across heterogeneous datasets and in missing-channel or practical low-channel settings. To address these limitations, we introduce SingLEM, a self-supervised foundation model that learns robust, general-purpose representations from single-channel EEG, making it inherently hardware agnostic. The model employs a hybrid encoder architecture that combines convolutional layers to extract local features with a hierarchical transformer to model both short- and long-range temporal dependencies. SingLEM is pretrained on 71 public datasets comprising over 9,200 subjects and 357,000 single-channel hours of EEG. When evaluated as a fixed feature extractor across six motor imagery and cognitive tasks, aggregated single-channel representations consistently outperformed leading multi-channel foundation models and handcrafted baselines. These results demonstrate that a single-channel approach can achieve state-of-the-art generalization while enabling fine-grained neurophysiological analysis and enhancing interpretability. The source code and pretrained models are available at this https URL.

Title: Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection

Authors: Xiuqi Ge, Zhibo Yao, Yaosong Du
Subjects: cs.LG, q-bio.TO
Abstract URL: https://arxiv.org/abs/2509.17924
Pdf URL: https://arxiv.org/pdf/2509.17924
Copy Paste: [[2509.17924]] Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection(https://arxiv.org/abs/2509.17924)
Keywords: anomaly
Abstract: Clinical machine learning faces a critical dilemma in high-stakes medical applications: algorithms achieving optimal diagnostic performance typically sacrifice the interpretability essential for physician decision-making, while interpretable methods compromise sensitivity in complex scenarios. This paradox becomes particularly acute in non-invasive prenatal testing (NIPT), where missed chromosomal abnormalities carry profound clinical consequences yet regulatory frameworks mandate explainable AI systems. We introduce Medical Priority Fusion (MPF), a constrained multi-objective optimization framework that resolves this fundamental trade-off by systematically integrating Naive Bayes probabilistic reasoning with Decision Tree rule-based logic through mathematically-principled weighted fusion under explicit medical constraints. Rigorous validation on 1,687 real-world NIPT samples characterized by extreme class imbalance (43.4:1 normal-to-abnormal ratio) employed stratified 5-fold cross-validation with comprehensive ablation studies and statistical hypothesis testing using McNemar's paired comparisons. MPF achieved simultaneous optimization of dual objectives: 89.3% sensitivity (95% CI: 83.9-94.7%) with 80% interpretability score, significantly outperforming individual algorithms (McNemar's test, p < 0.001). The optimal fusion configuration achieved Grade A clinical deployment criteria with large effect size (d = 1.24), establishing the first clinically-deployable solution that maintains both diagnostic accuracy and decision transparency essential for prenatal care. This work demonstrates that medical-constrained algorithm fusion can resolve the interpretability-performance trade-off, providing a mathematical framework for developing high-stakes medical decision support systems that meet both clinical efficacy and explainability requirements.

Title: StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Authors: Nicholas Kraabel, Jiangtao Liu, Yuchen Bian, Daniel Kifer, Chaopeng Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17942
Pdf URL: https://arxiv.org/pdf/2509.17942
Copy Paste: [[2509.17942]] StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions(https://arxiv.org/abs/2509.17942)
Keywords: foundation model, generative
Abstract: Stewarding natural resources, mitigating floods, droughts, wildfires, and landslides, and meeting growing demands require models that can predict climate-driven land-surface responses and human feedback with high accuracy. Traditional impact models, whether process-based, statistical, or machine learning, struggle with spatial generalization due to limited observations and concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute and are ill-suited for dynamic land-surface prediction. We introduce StefaLand, a generative spatiotemporal earth foundation model centered on landscape interactions. StefaLand improves predictions on three tasks and four datasets: streamflow, soil moisture, and soil composition, compared to prior state-of-the-art. Results highlight its ability to generalize across diverse, data-scarce regions and support broad land-surface applications. The model builds on a masked autoencoder backbone that learns deep joint representations of landscape attributes, with a location-aware architecture fusing static and time-series inputs, attribute-based representations that drastically reduce compute, and residual fine-tuning adapters that enhance transfer. While inspired by prior methods, their alignment with geoscience and integration in one model enables robust performance on dynamic land-surface tasks. StefaLand can be pretrained and finetuned on academic compute yet outperforms state-of-the-art baselines and even fine-tuned vision foundation models. To our knowledge, this is the first geoscience land-surface foundation model that demonstrably improves dynamic land-surface interaction predictions and supports diverse downstream applications.

Title: Can multimodal representation learning by alignment preserve modality-specific information?

Authors: Romain Thoreau, Jessie Levillain, Dawa Derksen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17943
Pdf URL: https://arxiv.org/pdf/2509.17943
Copy Paste: [[2509.17943]] Can multimodal representation learning by alignment preserve modality-specific information?(https://arxiv.org/abs/2509.17943)
Keywords: self-supervised
Abstract: Combining multimodal data is a key issue in a wide range of machine learning tasks, including many remote sensing problems. In Earth observation, early multimodal data fusion methods were based on specific neural network architectures and supervised learning. Ever since, the scarcity of labeled data has motivated self-supervised learning techniques. State-of-the-art multimodal representation learning techniques leverage the spatial alignment between satellite data from different modalities acquired over the same geographic area in order to foster a semantic alignment in the latent space. In this paper, we investigate how this methods can preserve task-relevant information that is not shared across modalities. First, we show, under simplifying assumptions, when alignment strategies fundamentally lead to an information loss. Then, we support our theoretical insight through numerical experiments in more realistic settings. With those theoretical and empirical evidences, we hope to support new developments in contrastive learning for the combination of multimodal satellite data. Our code and data is publicly available at this https URL.

Title: Budgeted Adversarial Attack against Graph-Based Anomaly Detection in Sensor Networks

Authors: Sanju Xaviar, Omid Ardakanian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.17987
Pdf URL: https://arxiv.org/pdf/2509.17987
Copy Paste: [[2509.17987]] Budgeted Adversarial Attack against Graph-Based Anomaly Detection in Sensor Networks(https://arxiv.org/abs/2509.17987)
Keywords: anomaly
Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for anomaly detection in sensor networks, particularly when analyzing multivariate time series. In this work, we introduce BETA, a novel grey-box evasion attack targeting such GNN-based detectors, where the attacker is constrained to perturb sensor readings from a limited set of nodes, excluding the target sensor, with the goal of either suppressing a true anomaly or triggering a false alarm at the target node. BETA identifies the sensors most influential to the target node's classification and injects carefully crafted adversarial perturbations into their features, all while maintaining stealth and respecting the attacker's budget. Experiments on three real-world sensor network datasets show that BETA reduces the detection accuracy of state-of-the-art GNN-based detectors by 30.62 to 39.16% on average, and significantly outperforms baseline attack strategies, while operating within realistic constraints.

Title: StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models

Authors: Haoxin Yang, Bangzhen Liu, Xuemiao Xu, Cheng Xu, Yuyang Yu, Zikai Huang, Yi Wang, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.17993
Pdf URL: https://arxiv.org/pdf/2509.17993
Copy Paste: [[2509.17993]] StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models(https://arxiv.org/abs/2509.17993)
Keywords: diffusion, self-supervised
Abstract: The advancement of diffusion models has enhanced the realism of AI-generated content but also raised concerns about misuse, necessitating robust copyright protection and tampering localization. Although recent methods have made progress toward unified solutions, their reliance on post hoc processing introduces considerable application inconvenience and compromises forensic reliability. We propose StableGuard, a novel framework that seamlessly integrates a binary watermark into the diffusion generation process, ensuring copyright protection and tampering localization in Latent Diffusion Models through an end-to-end design. We develop a Multiplexing Watermark VAE (MPW-VAE) by equipping a pretrained Variational Autoencoder (VAE) with a lightweight latent residual-based adapter, enabling the generation of paired watermarked and watermark-free images. These pairs, fused via random masks, create a diverse dataset for training a tampering-agnostic forensic network. To further enhance forensic synergy, we introduce a Mixture-of-Experts Guided Forensic Network (MoE-GFN) that dynamically integrates holistic watermark patterns, local tampering traces, and frequency-domain cues for precise watermark verification and tampered region detection. The MPW-VAE and MoE-GFN are jointly optimized in a self-supervised, end-to-end manner, fostering a reciprocal training between watermark embedding and forensic accuracy. Extensive experiments demonstrate that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.

Title: Variation in Verification: Understanding Verification Dynamics in Large Language Models

Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17995
Pdf URL: https://arxiv.org/pdf/2509.17995
Copy Paste: [[2509.17995]] Variation in Verification: Understanding Verification Dynamics in Large Language Models(https://arxiv.org/abs/2509.17995)
Keywords: generative
Abstract: Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.

Title: Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis

Authors: Joshua Ward, Xiaofeng Lin, Chi-Hua Wang, Guang Cheng
Subjects: cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2509.18014
Pdf URL: https://arxiv.org/pdf/2509.18014
Copy Paste: [[2509.18014]] Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis(https://arxiv.org/abs/2509.18014)
Keywords: generative
Abstract: Tabular Generative Models are often argued to preserve privacy by creating synthetic datasets that resemble training data. However, auditing their empirical privacy remains challenging, as commonly used similarity metrics fail to effectively characterize privacy risk. Membership Inference Attacks (MIAs) have recently emerged as a method for evaluating privacy leakage in synthetic data, but their practical effectiveness is limited. Numerous attacks exist across different threat models, each with distinct implementations targeting various sources of privacy leakage, making them difficult to apply consistently. Moreover, no single attack consistently outperforms the others, leading to a routine underestimation of privacy risk. To address these issues, we propose a unified, model-agnostic threat framework that deploys a collection of attacks to estimate the maximum empirical privacy leakage in synthetic datasets. We introduce Synth-MIA, an open-source Python library that streamlines this auditing process through a novel testbed that integrates seamlessly into existing synthetic data evaluation pipelines through a Scikit-Learn-like API. Our software implements 13 attack methods through a Scikit-Learn-like API, designed to enable fast systematic estimation of privacy leakage for practitioners as well as facilitate the development of new attacks and experiments for researchers. We demonstrate our framework's utility in the largest tabular synthesis privacy benchmark to date, revealing that higher synthetic data quality corresponds to greater privacy leakage, that similarity-based privacy metrics show weak correlation with MIA results, and that the differentially private generator PATEGAN can fail to preserve privacy under such attacks. This underscores the necessity of MIA-based auditing when designing and deploying Tabular Generative Models.

Title: Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments

Authors: Saeid Sheikhi, Panos Kostakos, Lauri Loven
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18044
Pdf URL: https://arxiv.org/pdf/2509.18044
Copy Paste: [[2509.18044]] Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments(https://arxiv.org/abs/2509.18044)
Keywords: anomaly
Abstract: Federated Learning (FL) in 5G and edge network environments face severe security threats from adversarial clients. Malicious participants can perform label flipping, inject backdoor triggers, or launch Sybil attacks to corrupt the global model. This paper introduces Hybrid Reputation Aggregation (HRA), a novel robust aggregation mechanism designed to defend against diverse adversarial behaviors in FL without prior knowledge of the attack type. HRA combines geometric anomaly detection with momentum-based reputation tracking of clients. In each round, it detects outlier model updates via distance-based geometric analysis while continuously updating a trust score for each client based on historical behavior. This hybrid approach enables adaptive filtering of suspicious updates and long-term penalization of unreliable clients, countering attacks ranging from backdoor insertions to random noise Byzantine failures. We evaluate HRA on a large-scale proprietary 5G network dataset (3M+ records) and the widely used NF-CSE-CIC-IDS2018 benchmark under diverse adversarial attack scenarios. Experimental results reveal that HRA achieves robust global model accuracy of up to 98.66% on the 5G dataset and 96.60% on NF-CSE-CIC-IDS2018, outperforming state-of-the-art aggregators such as Krum, Trimmed Mean, and Bulyan by significant margins. Our ablation studies further demonstrate that the full hybrid system achieves 98.66% accuracy, while the anomaly-only and reputation-only variants drop to 84.77% and 78.52%, respectively, validating the synergistic value of our dual-mechanism approach. This demonstrates HRA's enhanced resilience and robustness in 5G/edge federated learning deployments, even under significant adversarial conditions.

Title: Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Authors: Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.18085
Pdf URL: https://arxiv.org/pdf/2509.18085
Copy Paste: [[2509.18085]] Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding(https://arxiv.org/abs/2509.18085)
Keywords: diffusion
Abstract: Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.

Title: ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Authors: Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.18092
Pdf URL: https://arxiv.org/pdf/2509.18092
Copy Paste: [[2509.18092]] ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation(https://arxiv.org/abs/2509.18092)
Keywords: diffusion
Abstract: Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: this https URL.

Title: Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Authors: Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.18096
Pdf URL: https://arxiv.org/pdf/2509.18096
Copy Paste: [[2509.18096]] Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers(https://arxiv.org/abs/2509.18096)
Keywords: diffusion
Abstract: Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.