2025-09-30

Title: Pathological Truth Bias in Vision-Language Models

Authors: Yash Thube
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22674
Pdf URL: https://arxiv.org/pdf/2509.22674
Copy Paste: [[2509.22674]] Pathological Truth Bias in Vision-Language Models(https://arxiv.org/abs/2509.22674)
Keywords: generative
Abstract: Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.

Title: GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d'Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d'Environnement de Collaboration Humain-Robot

Authors: Ahed Alboody (LINEACT)
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22708
Pdf URL: https://arxiv.org/pdf/2509.22708
Copy Paste: [[2509.22708]] GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d'Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d'Environnement de Collaboration Humain-Robot(https://arxiv.org/abs/2509.22708)
Keywords: generative
Abstract: Generative Zero-Shot Learning approach (GZSL) has demonstrated significant potential in 3D point cloud semantic segmentation tasks. GZSL leverages generative models like GANs or VAEs to synthesize realistic features (real features) of unseen classes. This allows the model to label unseen classes during testing, despite being trained only on seen classes. In this context, we introduce the Generalized Zero-Shot Learning based-upon Mixture-of-Experts (GZSL-MoE) model. This model incorporates Mixture-of-Experts layers (MoE) to generate fake features that closely resemble real features extracted using a pre-trained KPConv (Kernel Point Convolution) model on seen classes. The main contribution of this paper is the integration of Mixture-of-Experts into the Generator and Discriminator components of the Generative Zero-Shot Learning model for 3D point cloud semantic segmentation, applied to the COVERED dataset (CollabOratiVE Robot Environment Dataset) for Human-Robot Collaboration (HRC) environments. By combining the Generative Zero-Shot Learning model with Mixture-of- Experts, GZSL-MoE for 3D point cloud semantic segmentation provides a promising solution for understanding complex 3D environments, especially when comprehensive training data for all object classes is unavailable. The performance evaluation of the GZSL-MoE model highlights its ability to enhance performance on both seen and unseen classes. Keywords Generalized Zero-Shot Learning (GZSL), 3D Point Cloud, 3D Semantic Segmentation, Human-Robot Collaboration, COVERED (CollabOratiVE Robot Environment Dataset), KPConv, Mixture-of Experts

Title: LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Authors: Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng, Topojoy Biswas, Evren Korpeoglu, Kaushiki Nag, Kannan Achan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22720
Pdf URL: https://arxiv.org/pdf/2509.22720
Copy Paste: [[2509.22720]] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning(https://arxiv.org/abs/2509.22720)
Keywords: diffusion
Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

Title: Responsible Diffusion: A Comprehensive Survey on Safety, Ethics, and Trust in Diffusion Models

Authors: Kang Wei, Xin Yuan, Fushuo Huo, Chuan Ma, Long Yuan, Songze Li, Ming Ding, Dacheng Tao
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2509.22723
Pdf URL: https://arxiv.org/pdf/2509.22723
Copy Paste: [[2509.22723]] Responsible Diffusion: A Comprehensive Survey on Safety, Ethics, and Trust in Diffusion Models(https://arxiv.org/abs/2509.22723)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have been investigated in various domains due to their ability to generate high-quality data, thereby attracting significant attention. However, similar to traditional deep learning systems, there also exist potential threats to DMs. To provide advanced and comprehensive insights into safety, ethics, and trust in DMs, this survey comprehensively elucidates its framework, threats, and countermeasures. Each threat and its countermeasures are systematically examined and categorized to facilitate thorough analysis. Furthermore, we introduce specific examples of how DMs are used, what dangers they might bring, and ways to protect against these dangers. Finally, we discuss key lessons learned, highlight open challenges related to DM security, and outline prospective research directions in this critical field. This work aims to accelerate progress not only in the technical capabilities of generative artificial intelligence but also in the maturity and wisdom of its application.

Title: Enabling Approximate Joint Sampling in Diffusion LMs

Authors: Parikshit Bansal, Sujay Sanghavi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22738
Pdf URL: https://arxiv.org/pdf/2509.22738
Copy Paste: [[2509.22738]] Enabling Approximate Joint Sampling in Diffusion LMs(https://arxiv.org/abs/2509.22738)
Keywords: diffusion
Abstract: In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base) and instruction-tuned (Dream-7B-Instruct) models on language modeling and math \& coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.

Title: Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

Authors: Sasha Cui, Zhongren Chen
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.22739
Pdf URL: https://arxiv.org/pdf/2509.22739
Copy Paste: [[2509.22739]] Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models(https://arxiv.org/abs/2509.22739)
Keywords: in-context
Abstract: Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

Title: Red Teaming Quantum-Resistant Cryptographic Standards: A Penetration Testing Framework Integrating AI and Quantum Security

Authors: Petar Radanliev
Subjects: cs.CR, cs.AI, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2509.22757
Pdf URL: https://arxiv.org/pdf/2509.22757
Copy Paste: [[2509.22757]] Red Teaming Quantum-Resistant Cryptographic Standards: A Penetration Testing Framework Integrating AI and Quantum Security(https://arxiv.org/abs/2509.22757)
Keywords: anomaly
Abstract: This study presents a structured approach to evaluating vulnerabilities within quantum cryptographic protocols, focusing on the BB84 quantum key distribution method and National Institute of Standards and Technology (NIST) approved quantum-resistant algorithms. By integrating AI-driven red teaming, automated penetration testing, and real-time anomaly detection, the research develops a framework for assessing and mitigating security risks in quantum networks. The findings demonstrate that AI can be effectively used to simulate adversarial attacks, probe weaknesses in cryptographic implementations, and refine security mechanisms through iterative feedback. The use of automated exploit simulations and protocol fuzzing provides a scalable means of identifying latent vulnerabilities, while adversarial machine learning techniques highlight novel attack surfaces within AI-enhanced cryptographic processes. This study offers a comprehensive methodology for strengthening quantum security and provides a foundation for integrating AI-driven cybersecurity practices into the evolving quantum landscape.

Title: In-Context Learning can Perform Continual Learning Like Humans

Authors: Liuwang Kang, Fan Wang, Shaoshan Liu, Hung-Chyun Chou, Chuan Lin, Ning Ding
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22764
Pdf URL: https://arxiv.org/pdf/2509.22764
Copy Paste: [[2509.22764]] In-Context Learning can Perform Continual Learning Like Humans(https://arxiv.org/abs/2509.22764)
Keywords: in-context
Abstract: Large language models (LLMs) can adapt to new tasks via in-context learning (ICL) without parameter updates, making them powerful learning engines for fast adaptation. While extensive research has examined ICL as a few-shot learner, whether it can achieve long-term retention and cross-task knowledge accumulation when multitasks arrive sequentially remains underexplored. Motivated by human memory studies, we investigate the retention characteristics of ICL in multitask settings and extend it to in-context continual learning (ICCL), where continual learning ability emerges through task scheduling and prompt rearrangement. Experiments on Markov-Chain benchmarks demonstrate that, for specific large-language models, ICCL benefits from distributed practice (DP) in a manner analogous to humans, consistently revealing a spacing "sweet spot" for retention. Beyond retention performance, we propose a human-retention similarity metric to quantify how closely a continual-learning (CL) method aligns with human retention dynamics. Using this metric, we show that linear-attention models such as MAMBA and RWKV exhibit particularly human-like retention patterns, despite their retention performance lagging behind that of Transformer-based LLMs. Overall, our results establish ICCL as both cognitively plausible and practically effective, providing an inference-only CL paradigm that mitigates catastrophic forgetting and addresses the stability-plasticity dilemma in conventional CL methods.

Title: DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

Authors: Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22793
Pdf URL: https://arxiv.org/pdf/2509.22793
Copy Paste: [[2509.22793]] DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models(https://arxiv.org/abs/2509.22793)
Keywords: diffusion, in-context
Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{this https URL}{DEFTBase}.

Title: VideoScore2: Think before You Score in Generative Video Evaluation

Authors: Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.22799
Pdf URL: https://arxiv.org/pdf/2509.22799
Copy Paste: [[2509.22799]] VideoScore2: Think before You Score in Generative Video Evaluation(https://arxiv.org/abs/2509.22799)
Keywords: generative
Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: this https URL

Title: ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

Authors: Mohamed Maged, Alhassan Ehab, Ali Mekky, Besher Hassan, Shady Shehata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22808
Pdf URL: https://arxiv.org/pdf/2509.22808
Copy Paste: [[2509.22808]] ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection(https://arxiv.org/abs/2509.22808)
Keywords: generative
Abstract: With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

Title: Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN

Authors: Roie Kazoom, Alon Goldberg, Hodaya Cohen, Ofer Hadar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22836
Pdf URL: https://arxiv.org/pdf/2509.22836
Copy Paste: [[2509.22836]] Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN(https://arxiv.org/abs/2509.22836)
Keywords: generative
Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.

Title: Adaptive Margin RLHF via Preference over Preferences

Authors: Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.22851
Pdf URL: https://arxiv.org/pdf/2509.22851
Copy Paste: [[2509.22851]] Adaptive Margin RLHF via Preference over Preferences(https://arxiv.org/abs/2509.22851)
Keywords: generative
Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

Title: Towards Generalizable Implicit In-Context Learning with Attention Routing

Authors: Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22854
Pdf URL: https://arxiv.org/pdf/2509.22854
Copy Paste: [[2509.22854]] Towards Generalizable Implicit In-Context Learning with Attention Routing(https://arxiv.org/abs/2509.22854)
Keywords: in-context
Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of Large Language Models (LLMs), aiming to attain few-shot performance at zero-shot cost. However, existing approaches largely rely on injecting shift vectors into residual flows, which are typically constructed from labeled demonstrations or task-specific alignment. Such designs fall short of utilizing the structural mechanisms underlying ICL and suffer from limited generalizability. To address this, we propose In-Context Routing (ICR), a novel implicit ICL method that internalizes generalizable ICL patterns at the attention logits level. It extracts reusable structural directions that emerge during ICL and employs a learnable input-conditioned router to modulate attention logits accordingly, enabling a train-once-and-reuse framework. We evaluate ICR on 12 real-world datasets spanning diverse domains and multiple LLMs. The results show that ICR consistently outperforms prior implicit ICL methods that require task-specific retrieval or training, while demonstrating robust generalization to out-of-domain tasks where existing methods struggle. These findings position ICR to push the boundary of ICL's practical value.

Title: ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Authors: Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22864
Pdf URL: https://arxiv.org/pdf/2509.22864
Copy Paste: [[2509.22864]] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models(https://arxiv.org/abs/2509.22864)
Keywords: diffusion, foundation model, generative
Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

Title: From Noise to Knowledge: A Comparative Study of Acoustic Anomaly Detection Models in Pumped-storage Hydropower Plants

Authors: Karim Khamaisi, Nicolas Keller, Stefan Krummenacher, Valentin Huber, Bernhard Fässler, Bruno Rodrigues
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22881
Pdf URL: https://arxiv.org/pdf/2509.22881
Copy Paste: [[2509.22881]] From Noise to Knowledge: A Comparative Study of Acoustic Anomaly Detection Models in Pumped-storage Hydropower Plants(https://arxiv.org/abs/2509.22881)
Keywords: anomaly
Abstract: In the context of industrial factories and energy producers, unplanned outages are highly costly and difficult to service. However, existing acoustic-anomaly detection studies largely rely on generic industrial or synthetic datasets, with few focused on hydropower plants due to limited access. This paper presents a comparative analysis of acoustic-based anomaly detection methods, as a way to improve predictive maintenance in hydropower plants. We address key challenges in the acoustic preprocessing under highly noisy conditions before extracting time- and frequency-domain features. Then, we benchmark three machine learning models: LSTM AE, K-Means, and OC-SVM, which are tested on two real-world datasets from the Rodundwerk II pumped-storage plant in Austria, one with induced anomalies and one with real-world conditions. The One-Class SVM achieved the best trade-off of accuracy (ROC AUC 0.966-0.998) and minimal training time, while the LSTM autoencoder delivered strong detection (ROC AUC 0.889-0.997) at the expense of higher computational cost.

Title: Convolutional Set Transformer

Authors: Federico Chinello, Giacomo Boracchi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22889
Pdf URL: https://arxiv.org/pdf/2509.22889
Copy Paste: [[2509.22889]] Convolutional Set Transformer(https://arxiv.org/abs/2509.22889)
Keywords: anomaly
Abstract: We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (this https URL).

Title: Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22925
Pdf URL: https://arxiv.org/pdf/2509.22925
Copy Paste: [[2509.22925]] Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings(https://arxiv.org/abs/2509.22925)
Keywords: diffusion
Abstract: One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.

Title: FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning

Authors: Chenghan Yang, Peng Zhou, Dong-Sheng Zhang, Yueyun Wang, Hong-Bin Shen, Xiaoyong Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22930
Pdf URL: https://arxiv.org/pdf/2509.22930
Copy Paste: [[2509.22930]] FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning(https://arxiv.org/abs/2509.22930)
Keywords: diffusion
Abstract: Traditional marine biological image recognition faces challenges of incomplete datasets and unsatisfactory model accuracy, particularly for few-shot conditions of rare species where data scarcity significantly hampers the performance. To address these issues, this study proposes an intelligent marine fish recognition framework, FishAI 2.0, integrating multimodal few-shot deep learning techniques with image generation for data augmentation. First, a hierarchical marine fish benchmark dataset, which provides a comprehensive data foundation for subsequent model training, is utilized to train the FishAI 2.0 model. To address the data scarcity of rare classes, the large language model DeepSeek was employed to generate high-quality textual descriptions, which are input into Stable Diffusion 2 for image augmentation through a hierarchical diffusion strategy that extracts latent encoding to construct a multimodal feature space. The enhanced visual-textual datasets were then fed into a Contrastive Language-Image Pre-Training (CLIP) based model, enabling robust few-shot image recognition. Experimental results demonstrate that FishAI 2.0 achieves a Top-1 accuracy of 91.67 percent and Top-5 accuracy of 97.97 percent at the family level, outperforming baseline CLIP and ViT models with a substantial margin for the minority classes with fewer than 10 training samples. To better apply FishAI 2.0 to real-world scenarios, at the genus and species level, FishAI 2.0 respectively achieves a Top-1 accuracy of 87.58 percent and 85.42 percent, demonstrating practical utility. In summary, FishAI 2.0 improves the efficiency and accuracy of marine fish identification and provides a scalable technical solution for marine ecological monitoring and conservation, highlighting its scientific value and practical applicability.

Title: LLMs Behind the Scenes: Enabling Narrative Scene Illustration

Authors: Melissa Roemmele, John Joon Young Chung, Taewook Kim, Yuqian Sun, Alex Calderwood, Max Kreminski
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.22940
Pdf URL: https://arxiv.org/pdf/2509.22940
Copy Paste: [[2509.22940]] LLMs Behind the Scenes: Enabling Narrative Scene Illustration(https://arxiv.org/abs/2509.22940)
Keywords: generative
Abstract: Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.

Title: What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

Authors: Mohammed Sabry, Anya Belz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22947
Pdf URL: https://arxiv.org/pdf/2509.22947
Copy Paste: [[2509.22947]] What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?(https://arxiv.org/abs/2509.22947)
Keywords: in-context
Abstract: Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.

Title: GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes

Authors: Valentyn Melnychuk, Stefan Feuerriegel
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.22953
Pdf URL: https://arxiv.org/pdf/2509.22953
Copy Paste: [[2509.22953]] GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes(https://arxiv.org/abs/2509.22953)
Keywords: diffusion, generative
Abstract: Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed GDR-learners are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.

Title: Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Authors: Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22957
Pdf URL: https://arxiv.org/pdf/2509.22957
Copy Paste: [[2509.22957]] Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas(https://arxiv.org/abs/2509.22957)
Keywords: generative
Abstract: As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.

Title: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Authors: Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22963
Pdf URL: https://arxiv.org/pdf/2509.22963
Copy Paste: [[2509.22963]] Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces(https://arxiv.org/abs/2509.22963)
Keywords: diffusion
Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

Title: Emergent morpho-phonological representations in self-supervised speech models

Authors: Jon Gauthier, Canaan Breiss, Matthew Leonard, Edward F. Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.22973
Pdf URL: https://arxiv.org/pdf/2509.22973
Copy Paste: [[2509.22973]] Emergent morpho-phonological representations in self-supervised speech models(https://arxiv.org/abs/2509.22973)
Keywords: self-supervised
Abstract: Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms. This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon -- often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.

Title: ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View

Authors: Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23008
Pdf URL: https://arxiv.org/pdf/2509.23008
Copy Paste: [[2509.23008]] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View(https://arxiv.org/abs/2509.23008)
Keywords: diffusion, generative
Abstract: Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.

Title: Planning with Unified Multimodal Models

Authors: Yihao Sun, Zhilong Zhang, Yang Yu, Pierre-Luc Bacon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23014
Pdf URL: https://arxiv.org/pdf/2509.23014
Copy Paste: [[2509.23014]] Planning with Unified Multimodal Models(https://arxiv.org/abs/2509.23014)
Keywords: generative
Abstract: With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

Title: On the Sheafification of Higher-Order Message Passing

Authors: Jacob Hume, Pietro Liò
Subjects: cs.LG, math.AT
Abstract URL: https://arxiv.org/abs/2509.23020
Pdf URL: https://arxiv.org/pdf/2509.23020
Copy Paste: [[2509.23020]] On the Sheafification of Higher-Order Message Passing(https://arxiv.org/abs/2509.23020)
Keywords: diffusion
Abstract: Recent work in Topological Deep Learning (TDL) seeks to generalize graph learning's preeminent $message \ passing$ paradigm to more complex relational structures: simplicial complexes, cell complexes, hypergraphs, and combinations thereof. Many approaches to such ${higher\text{-}order \ message \ passing}$ (HOMP) admit formulation in terms of nonlinear diffusion with the Hodge (combinatorial) Laplacian, a graded operator which carries an inductive bias that dimension-$k$ data features correlate with dimension-$k$ topological features encoded in the (singular) cohomology of the underlying domain. For $k=0$ this recovers the graph Laplacian and its well-studied homophily bias. In higher gradings, however, the Hodge Laplacian's bias is more opaque and potentially even degenerate. In this essay, we position sheaf theory as a natural and principled formalism for modifying the Hodge Laplacian's diffusion-mediated interface between local and global descriptors toward more expressive message passing. The sheaf Laplacian's inductive bias correlates dimension-$k$ data features with dimension-$k$ $sheaf$ cohomology, a data-aware generalization of singular cohomology. We will contextualize and novelly extend prior theory on sheaf diffusion in graph learning ($k=0$) in such a light -- and explore how it fails to generalize to $k>0$ -- before developing novel theory and practice for the higher-order setting. Our exposition is accompanied by a self-contained introduction shepherding sheaves from the abstract to the applied.

Title: Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy

Authors: Xiafeng Man, Zhipeng Wei, Jingjing Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23022
Pdf URL: https://arxiv.org/pdf/2509.23022
Copy Paste: [[2509.23022]] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy(https://arxiv.org/abs/2509.23022)
Keywords: diffusion, generative
Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model's output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.

Title: Dynamics of Learning: Generative Schedules from Latent ODEs

Authors: Matt L. Sampson, Peter Melchior
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23052
Pdf URL: https://arxiv.org/pdf/2509.23052
Copy Paste: [[2509.23052]] Dynamics of Learning: Generative Schedules from Latent ODEs(https://arxiv.org/abs/2509.23052)
Keywords: generative
Abstract: The learning rate schedule is one of the most impactful aspects of neural network optimization, yet most schedules either follow simple parametric functions or react only to short-term training signals. None of them are supported by a comprehensive temporal view of how well neural networks actually train. We present a new learning rate scheduler that models the training performance of neural networks as a dynamical system. It leverages training runs from a hyperparameter search to learn a latent representation of the training process. Given current training metrics, it predicts the future learning rate schedule with the best long-term validation performance. Our scheduler generalizes beyond previously observed training dynamics and creates specialized schedules that deviate noticeably from common parametric functions. It achieves SOTA results for image classification with CNN and ResNet models as well as for next-token prediction with a transformer model. The trained models are located in flatter regions of the loss landscape and thus provide better generalization than those trained with other schedules. Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top of ML experiment-tracking platforms. An implementation of our scheduler will be made available after acceptance.

Title: Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Authors: Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23054
Pdf URL: https://arxiv.org/pdf/2509.23054
Copy Paste: [[2509.23054]] Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis(https://arxiv.org/abs/2509.23054)
Keywords: self-supervised
Abstract: The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40\% vs. 70\%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

Title: CLAD-Net: Continual Activity Recognition in Multi-Sensor Wearable Systems

Authors: Reza Rahimi Azghan, Gautham Krishna Gudur, Mohit Malu, Edison Thomaz, Giulia Pedrielli, Pavan Turaga, Hassan Ghasemzadeh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23077
Pdf URL: https://arxiv.org/pdf/2509.23077
Copy Paste: [[2509.23077]] CLAD-Net: Continual Activity Recognition in Multi-Sensor Wearable Systems(https://arxiv.org/abs/2509.23077)
Keywords: self-supervised
Abstract: The rise of deep learning has greatly advanced human behavior monitoring using wearable sensors, particularly human activity recognition (HAR). While deep models have been widely studied, most assume stationary data distributions - an assumption often violated in real-world scenarios. For example, sensor data from one subject may differ significantly from another, leading to distribution shifts. In continual learning, this shift is framed as a sequence of tasks, each corresponding to a new subject. Such settings suffer from catastrophic forgetting, where prior knowledge deteriorates as new tasks are learned. This challenge is compounded by the scarcity and inconsistency of labeled data in human studies. To address these issues, we propose CLAD-Net (Continual Learning with Attention and Distillation), a framework enabling wearable-sensor models to be updated continuously without sacrificing performance on past tasks. CLAD-Net integrates a self-supervised transformer, acting as long-term memory, with a supervised Convolutional Neural Network (CNN) trained via knowledge distillation for activity classification. The transformer captures global activity patterns through cross-attention across body-mounted sensors, learning generalizable representations without labels. Meanwhile, the CNN leverages knowledge distillation to retain prior knowledge during subject-wise fine-tuning. On PAMAP2, CLAD-Net achieves 91.36 percent final accuracy with only 8.78 percent forgetting, surpassing memory-based and regularization-based baselines such as Experience Replay and Elastic Weight Consolidation. In semi-supervised settings with only 10-20 percent labeled data, CLAD-Net still delivers strong performance, demonstrating robustness to label scarcity. Ablation studies further validate each module's contribution.

Title: Follow-Your-Preference: Towards Preference-Aligned Image Inpainting

Authors: Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23082
Pdf URL: https://arxiv.org/pdf/2509.23082
Copy Paste: [[2509.23082]] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting(https://arxiv.org/abs/2509.23082)
Keywords: generative
Abstract: This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: this https URL.

Title: Demystifying Network Foundation Models

Authors: Sylee (Roman)Beltiukov, Satyandra Guthula, Wenbo Guo, Walter Willinger, Arpit Gupta
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2509.23089
Pdf URL: https://arxiv.org/pdf/2509.23089
Copy Paste: [[2509.23089]] Demystifying Network Foundation Models(https://arxiv.org/abs/2509.23089)
Keywords: foundation model
Abstract: This work presents a systematic investigation into the latent knowledge encoded within Network Foundation Models (NFMs) that focuses on hidden representations analysis rather than pure downstream task performance. Different from existing efforts, we analyze the models through a three-part evaluation: Embedding Geometry Analysis to assess representation space utilization, Metric Alignment Assessment to measure correspondence with domain-expert features, and Causal Sensitivity Testing to evaluate robustness to protocol perturbations. Using five diverse network datasets spanning controlled and real-world environments, we evaluate four state-of-the-art NFMs, revealing that they all exhibit significant anisotropy, inconsistent feature sensitivity patterns, an inability to separate the high-level context, payload dependency, and other properties. Our work identifies numerous limitations across all models and demonstrates that addressing them can significantly improve model performance (by up to +0.35 $F_1$ score without architectural changes).

Title: Sensitivity Analysis for Diffusion Models

Authors: Christopher Scarvelis, Justin Solomon
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23092
Pdf URL: https://arxiv.org/pdf/2509.23092
Copy Paste: [[2509.23092]] Sensitivity Analysis for Diffusion Models(https://arxiv.org/abs/2509.23092)
Keywords: diffusion
Abstract: Training a diffusion model approximates a map from a data distribution $\rho$ to the optimal score function $s_t$ for that distribution. Can we differentiate this map? If we could, then we could predict how the score, and ultimately the model's samples, would change under small perturbations to the training set before committing to costly retraining. We give a closed-form procedure for computing this map's directional derivatives, relying only on black-box access to a pre-trained score model and its derivatives with respect to its inputs. We extend this result to estimate the sensitivity of a diffusion model's samples to additive perturbations of its target measure, with runtime comparable to sampling from a diffusion model and computing log-likelihoods along the sample path. Our method is robust to numerical and approximation error, and the resulting sensitivities correlate with changes in an image diffusion model's samples after retraining and fine-tuning.

Title: d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Authors: Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23094
Pdf URL: https://arxiv.org/pdf/2509.23094
Copy Paste: [[2509.23094]] d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching(https://arxiv.org/abs/2509.23094)
Keywords: diffusion
Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at this https URL.

Title: Streamline pathology foundation model by cross-magnification distillation

Authors: Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Anil V. Parwani, Muhammad Khalid Khan Niazi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23097
Pdf URL: https://arxiv.org/pdf/2509.23097
Copy Paste: [[2509.23097]] Streamline pathology foundation model by cross-magnification distillation(https://arxiv.org/abs/2509.23097)
Keywords: foundation model
Abstract: Foundation models (FM) have transformed computational pathology but remain computationally prohibitive for clinical deployment due to their massive parameter counts and high-magnification processing requirements. Here, we introduce XMAG, a lightweight FM developed through corss-magnification distillation that transfers knowledge from state-of-the-art 20x magnification teacher to an efficient 5x magnification student architecture. XMAG employs a compact backbone and operates entirely at 5x, requiring 11.3 times fewer patches per whole slide image (WSI) compared to existing approaches. Our Novel distillation framework incorporates dual-level knowledge transfer, aligning both global image representations and local spatial token mapping. We trained XMAG on 3.49 million images curated from publicly available datasets and evaluated performance across six clinically relevant histopathology analysis tasks spanning multiple cancer types. XMAG achieved diagnostic accuracy within 1% of substantially larger foundation models while delivering 30-fold processing acceleration, reaching 8.8 WSIs per minute processing speed. Our cross-institutional validation confirmed robust generalization. Further, we developed an end-to-end training strategy to further boost our model's performance to approach the larger FMs' performance. These results establish cross-magnification distillation as a viable approach for deploying FM capabilities in resource-constrained clinical environments, potentially enabling real-time pathology AI integration.

Title: How to Make Large Language Models Generate 100% Valid Molecules?

Authors: Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, Yiwei Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23099
Pdf URL: https://arxiv.org/pdf/2509.23099
Copy Paste: [[2509.23099]] How to Make Large Language Models Generate 100% Valid Molecules?(https://arxiv.org/abs/2509.23099)
Keywords: generative
Abstract: Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at this https URL.

Title: Stochastic Interpolants via Conditional Dependent Coupling

Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23122
Pdf URL: https://arxiv.org/pdf/2509.23122
Copy Paste: [[2509.23122]] Stochastic Interpolants via Conditional Dependent Coupling(https://arxiv.org/abs/2509.23122)
Keywords: diffusion, generative
Abstract: Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.

Title: Impute-MACFM: Imputation based on Mask-Aware Flow Matching

Authors: Dengyi Liu, Honggang Wang, Hua Fang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23126
Pdf URL: https://arxiv.org/pdf/2509.23126
Copy Paste: [[2509.23126]] Impute-MACFM: Imputation based on Mask-Aware Flow Matching(https://arxiv.org/abs/2509.23126)
Keywords: generative
Abstract: Tabular data are central to many applications, especially longitudinal data in healthcare, where missing values are common, undermining model fidelity and reliability. Prior imputation methods either impose restrictive assumptions or struggle with complex cross-feature structure, while recent generative approaches suffer from instability and costly inference. We propose Impute-MACFM, a mask-aware conditional flow matching framework for tabular imputation that addresses missingness mechanisms, missing completely at random, missing at random, and missing not at random. Its mask-aware objective builds trajectories only on missing entries while constraining predicted velocity to remain near zero on observed entries, using flexible nonlinear schedules. Impute-MACFM combines: (i) stability penalties on observed positions, (ii) consistency regularization enforcing local invariance, and (iii) time-decayed noise injection for numeric features. Inference uses constraint-preserving ordinary differential equation integration with per-step projection to fix observed values, optionally aggregating multiple trajectories for robustness. Across diverse benchmarks, Impute-MACFM achieves state-of-the-art results while delivering more robust, efficient, and higher-quality imputation than competing approaches, establishing flow matching as a promising direction for tabular missing-data problems, including longitudinal data.

Title: Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT

Authors: Donghao Zhang, Yimin Chen, Kauê TN Duarte, Taha Aslan, Mohamed AlShamrani, Brij Karmur, Yan Wan, Shengcai Chen, Bo Hu, Bijoy K Menon, Wu Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23132
Pdf URL: https://arxiv.org/pdf/2509.23132
Copy Paste: [[2509.23132]] Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT(https://arxiv.org/abs/2509.23132)
Keywords: self-supervised, anomaly
Abstract: Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at this https URL.

Title: Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models

Authors: Zichao Yu, Ming Li, Wenyi Zhang, Weiguo Gao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23146
Pdf URL: https://arxiv.org/pdf/2509.23146
Copy Paste: [[2509.23146]] Tree Reward-Aligned Search for TReASURe in Masked Diffusion Language Models(https://arxiv.org/abs/2509.23146)
Keywords: diffusion, generative
Abstract: Tree search has recently emerged as a powerful framework for aligning generative models with task-specific rewards at test time. Applying tree search to Masked Diffusion Language Models, however, introduces two key challenges: (i) parallel unmasking yields highly correlated branches, limiting exploration, and (ii) reward evaluation via sampled completions produces high-variance estimates, making pruning unstable. We propose TReASURe, a tree-search test-time alignment method that addresses these issues. It introduces (i) UnmaskBranch, a branching strategy based on first-hitting unmasking that diversifies both token content and reveal order with a single model call per parent node, and (ii) ResubstituteScore, a pruning rule that uses deterministic resubstitution to score partially masked sequences with low-variance proxy completions. Theoretically, we quantify branching efficiency gains in NFEs (number of function evaluations), show that the scoring rule approximates the true reward with error bounded by predictive uncertainty, and prove improvements with larger tree widths. Empirically, TReASURe achieves state-of-the-art results on perplexity, linguistic acceptability, and control of sentiment and toxicity, outperforming prior methods under matched compute budgets, with especially strong gains in low-NFE regimes.

Title: CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Authors: Prashant Govindarajan, Mathieu Reymond, Antoine Clavaud, Mariano Phielipp, Santiago Miret, Sarath Chandar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23156
Pdf URL: https://arxiv.org/pdf/2509.23156
Copy Paste: [[2509.23156]] CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning(https://arxiv.org/abs/2509.23156)
Keywords: generative
Abstract: In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

Title: Dense associative memory on the Bures-Wasserstein space

Authors: Chandan Tankala, Krishnakumar Balasubramanian
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23162
Pdf URL: https://arxiv.org/pdf/2509.23162
Copy Paste: [[2509.23162]] Dense associative memory on the Bures-Wasserstein space(https://arxiv.org/abs/2509.23162)
Keywords: generative
Abstract: Dense associative memories (DAMs) store and retrieve patterns via energy-functional fixed points, but existing models are limited to vector representations. We extend DAMs to probability distributions equipped with the 2-Wasserstein distance, focusing mainly on the Bures-Wasserstein class of Gaussian densities. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity, provide quantitative retrieval guarantees under Wasserstein perturbations, and validate the model on synthetic and real-world distributional tasks. This work elevates associative memory from vectors to full distributions, bridging classical DAMs with modern generative modeling and enabling distributional storage and retrieval in memory-augmented learning.

Title: Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction

Authors: Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, Yibo Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23169
Pdf URL: https://arxiv.org/pdf/2509.23169
Copy Paste: [[2509.23169]] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction(https://arxiv.org/abs/2509.23169)
Keywords: generative
Abstract: For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.

Title: From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs

Authors: Haonan Wang, Weida Liang, Zihang Fu, Nie Zheng, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, Kenji Kawaguchi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23196
Pdf URL: https://arxiv.org/pdf/2509.23196
Copy Paste: [[2509.23196]] From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs(https://arxiv.org/abs/2509.23196)
Keywords: in-context
Abstract: Recent reasoning LLMs (RLMs), especially those trained with verifier-based reinforcement learning, often perform worse with few-shot CoT than with direct answering. We revisit this paradox using high-quality reasoning traces from DeepSeek-R1 as demonstrations and find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal. A detailed analysis reveals two mechanisms behind this decline: (i) semantic misguidance, where high textual similarity leads the model to treat the target as the same as the exemplar and to copy intermediate steps verbatim; and (ii) strategy transfer failure, where the model struggles to extract useful reasoning strategies and apply them to target questions. Guided by these, we introduce Insight-to-Solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights and derives a target-specific reasoning trace; optionally, the reasoning is self-refined for coherence and correctness (I2S+). Extensive experiments on diverse benchmarks show that I2S and I2S+ consistently outperform both direct answering and test-time scaling baselines across open- and closed-source models. Even for GPT models, our method helps: on AIME'25, GPT-4.1 rises by +14.0%, and o1-mini improves by +2.7% on AIME and +1.7% on GPQA, indicating that in-context demonstrations can be harnessed effectively via insight-refine-solve framework.

Title: Towards Monotonic Improvement in In-Context Reinforcement Learning

Authors: Wenhao Zhang, Shao Zhang, Xihuai Wang, Yang Li, Ying Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23209
Pdf URL: https://arxiv.org/pdf/2509.23209
Copy Paste: [[2509.23209]] Towards Monotonic Improvement in In-Context Reinforcement Learning(https://arxiv.org/abs/2509.23209)
Keywords: in-context
Abstract: In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model's own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at this https URL .

Title: More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression

Authors: Shayan Alahyari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23240
Pdf URL: https://arxiv.org/pdf/2509.23240
Copy Paste: [[2509.23240]] More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression(https://arxiv.org/abs/2509.23240)
Keywords: diffusion
Abstract: In many real-world regression tasks, the data distribution is heavily skewed, and models learn predominantly from abundant majority samples while failing to predict minority labels accurately. While imbalanced classification has been extensively studied, imbalanced regression remains relatively unexplored. Deep imbalanced regression (DIR) represents cases where the input data are high-dimensional and unstructured. Although several data-level approaches for tabular imbalanced regression exist, deep imbalanced regression currently lacks dedicated data-level solutions suitable for high-dimensional data and relies primarily on algorithmic modifications. To fill this gap, we propose LatentDiff, a novel framework that uses conditional diffusion models with priority-based generation to synthesize high-quality features in the latent representation space. LatentDiff is computationally efficient and applicable across diverse data modalities, including images, text, and other high-dimensional inputs. Experiments on three DIR benchmarks demonstrate substantial improvements in minority regions while maintaining overall accuracy.

Title: OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Authors: Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23258
Pdf URL: https://arxiv.org/pdf/2509.23258
Copy Paste: [[2509.23258]] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting(https://arxiv.org/abs/2509.23258)
Keywords: diffusion, generative
Abstract: Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.

Title: CREPE: Controlling Diffusion with Replica Exchange

Authors: Jiajun He, Paul Jeha, Peter Potaptchik, Leo Zhang, José Miguel Hernández-Lobato, Yuanqi Du, Saifuddin Syed, Francisco Vargas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23265
Pdf URL: https://arxiv.org/pdf/2509.23265
Copy Paste: [[2509.23265]] CREPE: Controlling Diffusion with Replica Exchange(https://arxiv.org/abs/2509.23265)
Keywords: diffusion
Abstract: Inference-time control of diffusion models aims to steer model outputs to satisfy new constraints without retraining. Previous approaches have mostly relied on heuristic guidance or have been coupled with Sequential Monte Carlo (SMC) for bias correction. In this paper, we propose a flexible alternative based on replica exchange, an algorithm designed initially for sampling problems. We refer to this method as the CREPE (Controlling with REPlica Exchange). Unlike SMC, CREPE: (1) generates particles sequentially, (2) maintains high diversity in the generated samples after a burn-in period, and (3) enables online refinement or early termination. We demonstrate its versatility across various tasks, including temperature annealing, reward-tilting, model composition and classifier-free guidance debiasing, with competitive performance compared to prior SMC methods.

Title: SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Authors: Yihao Ding, Soyeon Caren Han, Yanbei Jiang, Yan Li, Zechuan Li, Yifan Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23273
Pdf URL: https://arxiv.org/pdf/2509.23273
Copy Paste: [[2509.23273]] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction(https://arxiv.org/abs/2509.23273)
Keywords: generative
Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model's ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.

Title: A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Authors: Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, Albert No
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23286
Pdf URL: https://arxiv.org/pdf/2509.23286
Copy Paste: [[2509.23286]] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models(https://arxiv.org/abs/2509.23286)
Keywords: diffusion
Abstract: Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination.

Title: Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection

Authors: Minsun Jeon, Simon S. Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23289
Pdf URL: https://arxiv.org/pdf/2509.23289
Copy Paste: [[2509.23289]] Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection(https://arxiv.org/abs/2509.23289)
Keywords: generative
Abstract: The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: this https URL

Title: Seeing the Unseen in Low-light Spike Streams

Authors: Liwen Hu, Yang Li, Mianzhi Liu, Yijia Guo, Shenghao Xie, Ziluo Ding, Tiejun Huang, Lei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23304
Pdf URL: https://arxiv.org/pdf/2509.23304
Copy Paste: [[2509.23304]] Seeing the Unseen in Low-light Spike Streams(https://arxiv.org/abs/2509.23304)
Keywords: diffusion, generative
Abstract: Spike camera, a type of neuromorphic sensor with high-temporal resolution, shows great promise for high-speed visual tasks. Unlike traditional cameras, spike camera continuously accumulates photons and fires asynchronous spike streams. Due to unique data modality, spike streams require reconstruction methods to become perceptible to the human eye. However, lots of methods struggle to handle spike streams in low-light high-speed scenarios due to severe noise and sparse information. In this work, we propose Diff-SPK, the first diffusion-based reconstruction method for spike camera. Diff-SPK effectively leverages generative priors to supplement texture information in low-light conditions. Specifically, it first employs an \textbf{E}nhanced \textbf{T}exture \textbf{f}rom Inter-spike \textbf{I}nterval (ETFI) to aggregate sparse information from low-light spike streams. Then, ETFI serves as a conditioning input for ControlNet to generate the high-speed scenes. To improve the quality of results, we introduce an ETFI-based feature fusion module during the generation process. Moreover, we establish the first bona fide benchmark for the low-light spike stream reconstruction task. It significantly surpasses existing reconstruction datasets in scale and provides quantitative illumination information. The performance on real low-light spike streams demonstrates the superiority of Diff-SPK.

Title: Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification

Authors: Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23310
Pdf URL: https://arxiv.org/pdf/2509.23310
Copy Paste: [[2509.23310]] Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification(https://arxiv.org/abs/2509.23310)
Keywords: diffusion
Abstract: Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at this https URL.

Title: Learning to Reason in Structured In-context Environments with Reinforcement Learning

Authors: Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang, Ying Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23330
Pdf URL: https://arxiv.org/pdf/2509.23330
Copy Paste: [[2509.23330]] Learning to Reason in Structured In-context Environments with Reinforcement Learning(https://arxiv.org/abs/2509.23330)
Keywords: in-context
Abstract: Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.

Title: Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

Authors: Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23348
Pdf URL: https://arxiv.org/pdf/2509.23348
Copy Paste: [[2509.23348]] Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport(https://arxiv.org/abs/2509.23348)
Keywords: diffusion, generative
Abstract: The Entropic Optimal Transport (EOT) problem and its dynamic counterpart, the Schrödinger bridge (SB) problem, play an important role in modern machine learning, linking generative modeling with optimal transport theory. While recent advances in discrete diffusion and flow models have sparked growing interest in applying SB methods to discrete domains, there is still no reliable way to evaluate how well these methods actually solve the underlying problem. We address this challenge by introducing a benchmark for SB on discrete spaces. Our construction yields pairs of probability distributions with analytically known SB solutions, enabling rigorous evaluation. As a byproduct of building this benchmark, we obtain two new SB algorithms, DLightSB and DLightSB-M, and additionally extend prior related work to construct the $\alpha$-CSBM algorithm. We demonstrate the utility of our benchmark by evaluating both existing and new solvers in high-dimensional discrete settings. This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies.

Title: Landing with the Score: Riemannian Optimization through Denoising

Authors: Andrey Kharitenko, Zebang Shen, Riccardo de Santi, Niao He, Florian Doerfler
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23357
Pdf URL: https://arxiv.org/pdf/2509.23357
Copy Paste: [[2509.23357]] Landing with the Score: Riemannian Optimization through Denoising(https://arxiv.org/abs/2509.23357)
Keywords: diffusion, generative
Abstract: Under the data manifold hypothesis, high-dimensional data are concentrated near a low-dimensional manifold. We study the problem of Riemannian optimization over such manifolds when they are given only implicitly through the data distribution, and the standard manifold operations required by classical algorithms are unavailable. This formulation captures a broad class of data-driven design problems that are central to modern generative AI. Our key idea is to introduce a link function that connects the data distribution to the geometric operations needed for optimization. We show that this function enables the recovery of essential manifold operations, such as retraction and Riemannian gradient computation. Moreover, we establish a direct connection between our construction and the score function in diffusion models of the data distribution. This connection allows us to leverage well-studied parameterizations, efficient training procedures, and even pretrained score networks from the diffusion model literature to perform optimization. Building on this foundation, we propose two efficient inference-time algorithms -- Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD) -- and provide theoretical guarantees for both feasibility (approximate manifold adherence) and optimality (small Riemannian gradient norm). Finally, we demonstrate the effectiveness of our approach on finite-horizon reference tracking tasks in data-driven control, highlighting its potential for practical generative and design applications.

Title: UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation

Authors: Jinghong Zheng, Changlong Jiang, Jiaqi Li, Haohong Kuang, Hang Xu, Tingbing Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23376
Pdf URL: https://arxiv.org/pdf/2509.23376
Copy Paste: [[2509.23376]] UniPose: Unified Cross-modality Pose Prior Propagation towards RGB-D data for Weakly Supervised 3D Human Pose Estimation(https://arxiv.org/abs/2509.23376)
Keywords: self-supervised
Abstract: In this paper, we present UniPose, a unified cross-modality pose prior propagation method for weakly supervised 3D human pose estimation (HPE) using unannotated single-view RGB-D sequences (RGB, depth, and point cloud data). UniPose transfers 2D HPE annotations from large-scale RGB datasets (e.g., MS COCO) to the 3D domain via self-supervised learning on easily acquired RGB-D sequences, eliminating the need for labor-intensive 3D keypoint annotations. This approach bridges the gap between 2D and 3D domains without suffering from issues related to multi-view camera calibration or synthetic-to-real data shifts. During training, UniPose leverages off-the-shelf 2D pose estimations as weak supervision for point cloud networks, incorporating spatial-temporal constraints like body symmetry and joint motion. The 2D-to-3D back-projection loss and cross-modality interaction further enhance this process. By treating the point cloud network's 3D HPE results as pseudo ground truth, our anchor-to-joint prediction method performs 3D lifting on RGB and depth networks, making it more robust against inaccuracies in 2D HPE results compared to state-of-the-art methods. Experiments on CMU Panoptic and ITOP datasets show that UniPose achieves comparable performance to fully supervised methods. Incorporating large-scale unlabeled data (e.g., NTU RGB+D 60) enhances its performance under challenging conditions, demonstrating its potential for practical applications. Our proposed 3D lifting method also achieves state-of-the-art results.

Title: Generative Modeling of Shape-Dependent Self-Contact Human Poses

Authors: Takehiko Ohkawa, Jihyun Lee, Shunsuke Saito, Jason Saragih, Fabian Prado, Yichen Xu, Shoou-I Yu, Ryosuke Furuta, Yoichi Sato, Takaaki Shiratori
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23393
Pdf URL: https://arxiv.org/pdf/2509.23393
Copy Paste: [[2509.23393]] Generative Modeling of Shape-Dependent Self-Contact Human Poses(https://arxiv.org/abs/2509.23393)
Keywords: diffusion, generative
Abstract: One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.

Title: WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Authors: Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23402
Pdf URL: https://arxiv.org/pdf/2509.23402
Copy Paste: [[2509.23402]] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving(https://arxiv.org/abs/2509.23402)
Keywords: diffusion, generative
Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

Title: Planner Aware Path Learning in Diffusion Language Models Training

Authors: Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Avishek Joey Bose, Alexander Tong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23405
Pdf URL: https://arxiv.org/pdf/2509.23405
Copy Paste: [[2509.23405]] Planner Aware Path Learning in Diffusion Language Models Training(https://arxiv.org/abs/2509.23405)
Keywords: diffusion
Abstract: Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation.

Title: Comparison of Scoring Rationales Between Large Language Models and Human Raters

Authors: Haowei Hua (1), Hong Jiao (2), Dan Song (3) ((1) Princeton University, (2) University of Maryland, (3) University of Iowa)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23412
Pdf URL: https://arxiv.org/pdf/2509.23412
Copy Paste: [[2509.23412]] Comparison of Scoring Rationales Between Large Language Models and Human Raters(https://arxiv.org/abs/2509.23412)
Keywords: generative
Abstract: Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.

Title: FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Authors: Tanawan Premsri, Parisa Kordjamshidi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23452
Pdf URL: https://arxiv.org/pdf/2509.23452
Copy Paste: [[2509.23452]] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing(https://arxiv.org/abs/2509.23452)
Keywords: diffusion
Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. For-Sale evaluates the alignment between a given text and an initially generated image, and refines the image based on the Frame of Reference specified in the spatial expressions. It employs vision modules to extract the spatial configuration of the image, while simultaneously mapping the spatial expression to a corresponding camera perspective. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3% using only a single round of correction.

Title: 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras

Authors: Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23455
Pdf URL: https://arxiv.org/pdf/2509.23455
Copy Paste: [[2509.23455]] 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras(https://arxiv.org/abs/2509.23455)
Keywords: self-supervised
Abstract: Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Learning

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23462
Pdf URL: https://arxiv.org/pdf/2509.23462
Copy Paste: [[2509.23462]] Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Learning(https://arxiv.org/abs/2509.23462)
Keywords: generative
Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~6x faster, has 1.3x less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

Title: Memory-Efficient Fine-Tuning via Low-Rank Activation Compression

Authors: Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23472
Pdf URL: https://arxiv.org/pdf/2509.23472
Copy Paste: [[2509.23472]] Memory-Efficient Fine-Tuning via Low-Rank Activation Compression(https://arxiv.org/abs/2509.23472)
Keywords: foundation model
Abstract: The parameter-efficient fine-tuning paradigm has garnered significant attention with the advancement of foundation models. Although numerous methods have been proposed to reduce the number of trainable parameters, their substantial memory overhead remains a critical bottleneck that hinders practical deployment. In this paper, we observe that model activations constitute a major source of memory consumption, especially under large batch sizes and long context lengths; however, the rank of the activations remains consistently low. Motivated by this insight, we propose a memory-efficient fine-tuning approach Low-Rank Activation Compression (LoRAct). Unlike prior work, LoRAct provides a more flexible and versatile compressing strategy that can be applied online during the forward pass without the need for any calibration data. Moreover, LoRAct incorporates a novel sampling-based orthogonal decomposition algorithm specifically designed for low-rank matrices, offering improved computational efficiency and a tighter error bound compared to the widely used RSVD. Experiments on both vision and language tasks demonstrate the effectiveness of LoRAct. Notably, LoRAct further reduces activation memory by approximately 80% in comparison with the widely adopted LoRA method, while maintaining competitive performance. The source code is available at this https URL.

Title: RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation

Authors: Shourya Verma, Mengbo Wang, Nadia Atallah Lanman, Ananth Grama
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23480
Pdf URL: https://arxiv.org/pdf/2509.23480
Copy Paste: [[2509.23480]] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation(https://arxiv.org/abs/2509.23480)
Keywords: diffusion, generative
Abstract: Current approaches for restoration of degraded images face a critical trade-off: high-performance models are too slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose 'RestoRect', a novel Latent Rectified Flow Feature Distillation method for restoring degraded images. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality. We demonstrate superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.

Title: The Impact of Role Design in In-Context Learning for Large Language Models

Authors: Hamidreza Rouzegar, Masoud Makrehchi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23501
Pdf URL: https://arxiv.org/pdf/2509.23501
Copy Paste: [[2509.23501]] The Impact of Role Design in In-Context Learning for Large Language Models(https://arxiv.org/abs/2509.23501)
Keywords: in-context
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models' performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.

Title: Disentanglement of Variations with Multimodal Generative Modeling

Authors: Yijie Zhang, Yiyang Shen, Weiran Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23548
Pdf URL: https://arxiv.org/pdf/2509.23548
Copy Paste: [[2509.23548]] Disentanglement of Variations with Multimodal Generative Modeling(https://arxiv.org/abs/2509.23548)
Keywords: diffusion, generative
Abstract: Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.

Title: Towards Interpretable Visual Decoding with Attention to Brain Representations

Authors: Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, Nikolaus Kriegeskorte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23566
Pdf URL: https://arxiv.org/pdf/2509.23566
Copy Paste: [[2509.23566]] Towards Interpretable Visual Decoding with Attention to Brain Representations(https://arxiv.org/abs/2509.23566)
Keywords: diffusion, generative
Abstract: Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, helping brain science researchers interpret how the brain represents real-world scenes. However, most current approaches leverage mapping brain signals into intermediate image or text feature spaces before guiding the generative process, masking the effect of contributions from different brain areas on the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals shape the generation process. To this end, we contribute an Image-Brain BI-directional interpretability framework (IBBI) which investigates cross-attention mechanisms across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our results highlight the potential of end-to-end brain-to-image decoding and establish a path toward interpreting diffusion models through the lens of visual neuroscience.

Title: RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

Authors: Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23582
Pdf URL: https://arxiv.org/pdf/2509.23582
Copy Paste: [[2509.23582]] RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization(https://arxiv.org/abs/2509.23582)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at this https URL .

Title: VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Authors: Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23584
Pdf URL: https://arxiv.org/pdf/2509.23584
Copy Paste: [[2509.23584]] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement(https://arxiv.org/abs/2509.23584)
Keywords: diffusion, generative
Abstract: Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.

Title: Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models

Authors: Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23593
Pdf URL: https://arxiv.org/pdf/2509.23593
Copy Paste: [[2509.23593]] Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models(https://arxiv.org/abs/2509.23593)
Keywords: diffusion
Abstract: Catastrophic forgetting remains a central obstacle for continual learning in neural models. Popular approaches -- replay and elastic weight consolidation (EWC) -- have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation. In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data. We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient. Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction. We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift. On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is roughly halved on ImageNet-1k. These results suggest that diffusion models admit an approximately rank-1 Fisher. With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.

Title: MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising

Authors: Tangtangfang Fang, Jingxi Hu, Xiangjian He, Jiaqi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23603
Pdf URL: https://arxiv.org/pdf/2509.23603
Copy Paste: [[2509.23603]] MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising(https://arxiv.org/abs/2509.23603)
Keywords: diffusion, generative
Abstract: While diffusion models have set a new benchmark for quality in Low-Dose Computed Tomography (LDCT) denoising, their clinical adoption is critically hindered by extreme computational costs, with inference times often exceeding thousands of seconds per scan. To overcome this barrier, we introduce MAN, a Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising task. Our method operates in a compressed latent space via a perceptually-optimized autoencoder, enabling an attention-based conditional U-Net to perform the fast, deterministic conditional denoising diffusion process with drastically reduced overhead. On the LDCT and Projection dataset, our model achieves superior perceptual quality, surpassing CNN/GAN-based methods while rivaling the reconstruction fidelity of computationally heavy diffusion models like DDPM and Dn-Dp. Most critically, in the inference stage, our model is over 60x faster than representative pixel space diffusion denoisers, while remaining competitive on PSNR/SSIM scores. By bridging the gap between high fidelity and clinical viability, our work demonstrates a practical path forward for advanced generative models in medical imaging.

Title: VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Authors: Zeren Xiong, Yue Yu, Zedong Zhang, Shuo Chen, Jian Yang, Jun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23605
Pdf URL: https://arxiv.org/pdf/2509.23605
Copy Paste: [[2509.23605]] VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis(https://arxiv.org/abs/2509.23605)
Keywords: diffusion
Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.

Title: BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images

Authors: Cheng Huang, Weizheng Xie, Fan Gao, Yutong Liu, Ruoling Wu, Zeyu Han, Jingxi Qiu, Xiangxiang Wang, Zhenglin Yang, Hao Wang, Yongbin Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23617
Pdf URL: https://arxiv.org/pdf/2509.23617
Copy Paste: [[2509.23617]] BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images(https://arxiv.org/abs/2509.23617)
Keywords: generative
Abstract: Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: this https URL.

Title: DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Authors: Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23624
Pdf URL: https://arxiv.org/pdf/2509.23624
Copy Paste: [[2509.23624]] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation(https://arxiv.org/abs/2509.23624)
Keywords: diffusion, generative
Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.

Title: RIV: Recursive Introspection Mask Diffusion Vision Language Model

Authors: YuQian Li, Limeng Qiao, Lin Ma
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23625
Pdf URL: https://arxiv.org/pdf/2509.23625
Copy Paste: [[2509.23625]] RIV: Recursive Introspection Mask Diffusion Vision Language Model(https://arxiv.org/abs/2509.23625)
Keywords: diffusion
Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.

Title: Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

Authors: Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23626
Pdf URL: https://arxiv.org/pdf/2509.23626
Copy Paste: [[2509.23626]] Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models(https://arxiv.org/abs/2509.23626)
Keywords: foundation model
Abstract: Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10$\times$ smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

Title: LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders

Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23639
Pdf URL: https://arxiv.org/pdf/2509.23639
Copy Paste: [[2509.23639]] LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders(https://arxiv.org/abs/2509.23639)
Keywords: diffusion
Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at this https URL.

Title: Griffin: Generative Reference and Layout Guided Image Composition

Authors: Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23643
Pdf URL: https://arxiv.org/pdf/2509.23643
Copy Paste: [[2509.23643]] Griffin: Generative Reference and Layout Guided Image Composition(https://arxiv.org/abs/2509.23643)
Keywords: generative
Abstract: Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

Title: Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

Authors: Zemin Huang, Yuhang Wang, Zhiyang Chen, Guo-Jun Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23653
Pdf URL: https://arxiv.org/pdf/2509.23653
Copy Paste: [[2509.23653]] Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models(https://arxiv.org/abs/2509.23653)
Keywords: diffusion
Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.

Title: Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs

Authors: Shulin Huang, Yiran Ding, Junshu Pan, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.23657
Pdf URL: https://arxiv.org/pdf/2509.23657
Copy Paste: [[2509.23657]] Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs(https://arxiv.org/abs/2509.23657)
Keywords: foundation model
Abstract: Enhancing the complex reasoning capabilities of Large Language Models (LLMs) attracts widespread attention. While reinforcement learning (RL) has shown superior performance for improving complex reasoning, its impact on cross-lingual generalization compared to Supervised Fine-Tuning (SFT) remains unexplored. We present the first systematic investigation into cross-lingual reasoning generalization of RL and SFT. Using Qwen2.5-3B-Base as our foundation model, we conduct experiments on diverse multilingual reasoning benchmarks, including math reasoning, commonsense reasoning, and scientific reasoning. Our investigation yields two significant findings: (1) Tuning with RL not only achieves higher accuracy but also demonstrates substantially stronger cross-lingual generalization capabilities compared to SFT. (2) RL training on non-English data yields better overall performance and generalization than training on English data, which is not observed with SFT. Furthermore, through comprehensive mechanistic analyses, we explore the underlying factors of RL's superiority and generalization across languages. Our results provide compelling evidence that RL enables the model with more robust reasoning strategies, offering crucial guidance for more equitable and effective multilingual reasoning.

Title: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23681
Pdf URL: https://arxiv.org/pdf/2509.23681
Copy Paste: [[2509.23681]] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification(https://arxiv.org/abs/2509.23681)
Keywords: diffusion
Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in this https URL.

Title: Estimating Time Series Foundation Model Transferability via In-Context Learning

Authors: Qingren Yao, Ming Jin, Chengqi Zhang, Chao-Han Huck Yang, Jun Qi, Shirui Pan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23695
Pdf URL: https://arxiv.org/pdf/2509.23695
Copy Paste: [[2509.23695]] Estimating Time Series Foundation Model Transferability via In-Context Learning(https://arxiv.org/abs/2509.23695)
Keywords: foundation model, in-context
Abstract: Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training, yet fine-tuning remains critical for boosting performance in domains with limited public data. With the growing number of TSFMs, efficiently identifying the best model for downstream fine-tuning becomes increasingly challenging. In this work, we introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem: given observations on known (source) datasets, it predicts how a TSFM will perform after fine-tuning on a downstream (target) dataset. TimeTic flexibly organizes the observed model-data relationships as contextual information, allowing it to adapt seamlessly to various test-time scenarios. Leveraging the natural tabular structure formed by dataset meta-features, model characteristics, and fine-tuned performance, we employ tabular foundation models to serve as in-context learners. We further introduce a novel model characterization based on entropy evolution across model layers, capturing embedding-space distinctions and enabling TimeTic to generalize across arbitrary model sets. We establish a comprehensive benchmark for transferability estimation including 10 datasets, 10 foundation models, and 3 forecasting tasks. On this benchmark, TimeTic's estimation demonstrates strong alignment with actual fine-tuned performance for previously unseen datasets, achieving a mean rank correlation of approximately 0.6 and a 30% improvement compared to using zero-shot performance as the transferability score.

Title: CrimEdit: Controllable Editing for Counterfactual Object Removal, Insertion, and Movement

Authors: Boseong Jeon, Junghyuk Lee, Jimin Park, Kwanyoung Kim, Jingi Jung, Sangwon Lee, Hyunbo Shim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23708
Pdf URL: https://arxiv.org/pdf/2509.23708
Copy Paste: [[2509.23708]] CrimEdit: Controllable Editing for Counterfactual Object Removal, Insertion, and Movement(https://arxiv.org/abs/2509.23708)
Keywords: diffusion
Abstract: Recent works on object removal and insertion have enhanced their performance by handling object effects such as shadows and reflections, using diffusion models trained on counterfactual datasets. However, the performance impact of applying classifier-free guidance to handle object effects across removal and insertion tasks within a unified model remains largely unexplored. To address this gap and improve efficiency in composite editing, we propose CrimEdit, which jointly trains the task embeddings for removal and insertion within a single model and leverages them in a classifier-free guidance scheme -- enhancing the removal of both objects and their effects, and enabling controllable synthesis of object effects during insertion. CrimEdit also extends these two task prompts to be applied to spatially distinct regions, enabling object movement (repositioning) within a single denoising step. By employing both guidance techniques, extensive experiments show that CrimEdit achieves superior object removal, controllable effect insertion, and efficient object movement without requiring additional training or separate removal and insertion stages.

Title: PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease

Authors: Shuai Shao, Shu Jiang, Shiyuan Zhao, Di Yang, Yan Wang, Yutong Bai, Jianguo Zhang, Jiangtao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23719
Pdf URL: https://arxiv.org/pdf/2509.23719
Copy Paste: [[2509.23719]] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease(https://arxiv.org/abs/2509.23719)
Keywords: generative
Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86\% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.

Title: DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion

Authors: Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23723
Pdf URL: https://arxiv.org/pdf/2509.23723
Copy Paste: [[2509.23723]] DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion(https://arxiv.org/abs/2509.23723)
Keywords: diffusion, generative
Abstract: Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM' s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.

Title: M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Authors: Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23728
Pdf URL: https://arxiv.org/pdf/2509.23728
Copy Paste: [[2509.23728]] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation(https://arxiv.org/abs/2509.23728)
Keywords: diffusion
Abstract: In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.

Title: ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning

Authors: Xincheng Yao, Chao Shi, Muming Zhao, Guangtao Zhai, Chongyang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23741
Pdf URL: https://arxiv.org/pdf/2509.23741
Copy Paste: [[2509.23741]] ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning(https://arxiv.org/abs/2509.23741)
Keywords: anomaly
Abstract: This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at this https URL.

Title: UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

Authors: Xinyang Song, Libin Wang, Weining Wang, Shaozhen Liu, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23760
Pdf URL: https://arxiv.org/pdf/2509.23760
Copy Paste: [[2509.23760]] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception(https://arxiv.org/abs/2509.23760)
Keywords: diffusion
Abstract: The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

Title: GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Authors: Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23770
Pdf URL: https://arxiv.org/pdf/2509.23770
Copy Paste: [[2509.23770]] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning(https://arxiv.org/abs/2509.23770)
Keywords: generative
Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at this https URL.

Title: Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

Authors: Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xinyu Zhou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23774
Pdf URL: https://arxiv.org/pdf/2509.23774
Copy Paste: [[2509.23774]] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution(https://arxiv.org/abs/2509.23774)
Keywords: generative
Abstract: Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

Title: Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression

Authors: Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, Liqiang Nie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23779
Pdf URL: https://arxiv.org/pdf/2509.23779
Copy Paste: [[2509.23779]] Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression(https://arxiv.org/abs/2509.23779)
Keywords: foundation model, in-context
Abstract: State-space models (SSMs), particularly Mamba, emerge as an efficient Transformer alternative with linear complexity for long-sequence modeling. Recent empirical works demonstrate Mamba's in-context learning (ICL) capabilities competitive with Transformers, a critical capacity for large foundation models. However, theoretical understanding of Mamba's ICL remains limited, restricting deeper insights into its underlying mechanisms. Even fundamental tasks such as linear regression ICL, widely studied as a standard theoretical benchmark for Transformers, have not been thoroughly analyzed in the context of Mamba. To address this gap, we study the training dynamics of Mamba on the linear regression ICL task. By developing novel techniques tackling non-convex optimization with gradient descent related to Mamba's structure, we establish an exponential convergence rate to ICL solution, and derive a loss bound that is comparable to Transformer's. Importantly, our results reveal that Mamba can perform a variant of \textit{online gradient descent} to learn the latent function in context. This mechanism is different from that of Transformer, which is typically understood to achieve ICL through gradient descent emulation. The theoretical results are verified by experimental simulation.

Title: Space Group Conditional Flow Matching

Authors: Omri Puny, Yaron Lipman, Benjamin Kurt Miller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23822
Pdf URL: https://arxiv.org/pdf/2509.23822
Copy Paste: [[2509.23822]] Space Group Conditional Flow Matching(https://arxiv.org/abs/2509.23822)
Keywords: generative
Abstract: Inorganic crystals are periodic, highly-symmetric arrangements of atoms in three-dimensional space. Their structures are constrained by the symmetry operations of a crystallographic \emph{space group} and restricted to lie in specific affine subspaces known as \emph{Wyckoff positions}. The frequency an atom appears in the crystal and its rough positioning are determined by its Wyckoff position. Most generative models that predict atomic coordinates overlook these symmetry constraints, leading to unrealistically high populations of proposed crystals exhibiting limited symmetry. We introduce Space Group Conditional Flow Matching, a novel generative framework that samples significantly closer to the target population of highly-symmetric, stable crystals. We achieve this by conditioning the entire generation process on a given space group and set of Wyckoff positions; specifically, we define a conditionally symmetric noise base distribution and a group-conditioned, equivariant, parametric vector field that restricts the motion of atoms to their initial Wyckoff position. Our form of group-conditioned equivariance is achieved using an efficient reformulation of \emph{group averaging} tailored for symmetric crystals. Importantly, it reduces the computational overhead of symmetrization to a negligible level. We achieve state of the art results on crystal structure prediction and de novo generation benchmarks. We also perform relevant ablations.

Title: Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation

Authors: Hanyu Zhou, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23828
Pdf URL: https://arxiv.org/pdf/2509.23828
Copy Paste: [[2509.23828]] Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation(https://arxiv.org/abs/2509.23828)
Keywords: diffusion
Abstract: Vision-language models (VLMs) have demonstrated strong performance in 2D scene understanding and generation, but extending this unification to the physical world remains an open challenge. Existing 3D and 4D approaches typically embed scene geometry into autoregressive model for semantic understanding and diffusion model for content generation. This paradigm gap prevents a single model from jointly handling both tasks, especially in dynamic 4D settings where spatiotemporal modeling is critical. We propose Uni4D-LLM, the first unified VLM framework with spatiotemporal awareness for 4D scene understanding and generation. Our design is guided by two key insights: 1) Unification requires a shared representation. We extract semantic features for understanding and noisy-injected appearance features for generation, incorporate 4D geometric cues, and fuse them into a spatiotemporal-aware visual representation through adaptive cross-attention. 2) Unification requires a shared architecture. Both autoregression and diffusion are built on Transformer backbones, and this enables integration into a single LLM with task-specific heads. By aligning visual and linguistic representations, our Uni4D-LLM produces predictions for both understanding and generation within one Transformer-based framework. We further apply instruction fine-tuning on diverse 4D vision-language datasets to improve generalization across tasks. Extensive experiments on multiple benchmarks demonstrate that Uni4D-LLM achieves competitive or superior results compared to state-of-the-art models and offers the first true unification of 4D scene understanding and generation.

Title: Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Authors: Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23841
Pdf URL: https://arxiv.org/pdf/2509.23841
Copy Paste: [[2509.23841]] Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric(https://arxiv.org/abs/2509.23841)
Keywords: generative
Abstract: Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at this https URL.

Title: Adversarial Diffusion for Robust Reinforcement Learning

Authors: Daniele Foffano, Alessio Russo, Alexandre Proutiere
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23846
Pdf URL: https://arxiv.org/pdf/2509.23846
Copy Paste: [[2509.23846]] Adversarial Diffusion for Robust Reinforcement Learning(https://arxiv.org/abs/2509.23846)
Keywords: diffusion
Abstract: Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories "all at once", mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

Title: Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction

Authors: Guoquan Wei, Zekun Zhou, Liu Shi, Wenzhe Shan, Qiegen Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23885
Pdf URL: https://arxiv.org/pdf/2509.23885
Copy Paste: [[2509.23885]] Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction(https://arxiv.org/abs/2509.23885)
Keywords: diffusion, self-supervised
Abstract: Current models based on deep learning for low-dose CT denoising rely heavily on paired data and generalize poorly. Even the more concerned diffusion models need to learn the distribution of clean data for reconstruction, which is difficult to satisfy in medical clinical applications. At the same time, self-supervised-based methods face the challenge of significant degradation of generalizability of models pre-trained for the current dose to expand to other doses. To address these issues, this paper proposes a novel method of tunable-generalization diffusion powered by self-supervised contextual sub-data for low-dose CT reconstruction, named SuperDiff. Firstly, a contextual subdata similarity adaptive sensing strategy is designed for denoising centered on the LDCT projection domain, which provides an initial prior for the subsequent progress. Subsequently, the initial prior is used to combine knowledge distillation with a deep combination of latent diffusion models for optimizing image details. The pre-trained model is used for inference reconstruction, and the pixel-level self-correcting fusion technique is proposed for fine-grained reconstruction of the image domain to enhance the image fidelity, using the initial prior and the LDCT image as a guide. In addition, the technique is flexibly applied to the generalization of upper and lower doses or even unseen doses. Dual-domain strategy cascade for self-supervised LDCT denoising, SuperDiff requires only LDCT projection domain data for training and testing. Full qualitative and quantitative evaluations on both datasets and real data show that SuperDiff consistently outperforms existing state-of-the-art methods in terms of reconstruction and generalization performance.

Title: EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging

Authors: Anoushka Harit, William Prew, Zhongtian Sun, Florian Markowetz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23906
Pdf URL: https://arxiv.org/pdf/2509.23906
Copy Paste: [[2509.23906]] EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging(https://arxiv.org/abs/2509.23906)
Keywords: diffusion, foundation model
Abstract: Medical imaging foundation models must adapt over time, yet full retraining is often blocked by privacy constraints and cost. We present a continual learning framework that avoids storing patient exemplars by pairing class conditional diffusion replay with Elastic Weight Consolidation. Using a compact Vision Transformer backbone, we evaluate across eight MedMNIST v2 tasks and CheXpert. On CheXpert our approach attains 0.851 AUROC, reduces forgetting by more than 30\% relative to DER\texttt{++}, and approaches joint training at 0.869 AUROC, while remaining efficient and privacy preserving. Analyses connect forgetting to two measurable factors: fidelity of replay and Fisher weighted parameter drift, highlighting the complementary roles of replay diffusion and synaptic stability. The results indicate a practical route for scalable, privacy aware continual adaptation of clinical imaging models.

Title: EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Authors: Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23909
Pdf URL: https://arxiv.org/pdf/2509.23909
Copy Paste: [[2509.23909]] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling(https://arxiv.org/abs/2509.23909)
Keywords: generative
Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

Title: MoReact: Generating Reactive Motion from Textual Descriptions

Authors: Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23911
Pdf URL: https://arxiv.org/pdf/2509.23911
Copy Paste: [[2509.23911]] MoReact: Generating Reactive Motion from Textual Descriptions(https://arxiv.org/abs/2509.23911)
Keywords: diffusion
Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at this https URL.

Title: Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis

Authors: Yihang Guo, Tianyuan Yu, Liang Bai, Yanming Guo, Yirun Ruan, William Li, Weishi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23915
Pdf URL: https://arxiv.org/pdf/2509.23915
Copy Paste: [[2509.23915]] Revisit the Imbalance Optimization in Multi-task Learning: An Experimental Analysis(https://arxiv.org/abs/2509.23915)
Keywords: foundation model
Abstract: Multi-task learning (MTL) aims to build general-purpose vision systems by training a single network to perform multiple tasks jointly. While promising, its potential is often hindered by "unbalanced optimization", where task interference leads to subpar performance compared to single-task models. To facilitate research in MTL, this paper presents a systematic experimental analysis to dissect the factors contributing to this persistent problem. Our investigation confirms that the performance of existing optimization methods varies inconsistently across datasets, and advanced architectures still rely on costly grid-searched loss weights. Furthermore, we show that while powerful Vision Foundation Models (VFMs) provide strong initialization, they do not inherently resolve the optimization imbalance, and merely increasing data quantity offers limited benefits. A crucial finding emerges from our analysis: a strong correlation exists between the optimization imbalance and the norm of task-specific gradients. We demonstrate that this insight is directly applicable, showing that a straightforward strategy of scaling task losses according to their gradient norms can achieve performance comparable to that of an extensive and computationally expensive grid search. Our comprehensive analysis suggests that understanding and controlling gradient dynamics is a more direct path to stable MTL than developing increasingly complex methods.

Title: Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Authors: Longtao Jiang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23919
Pdf URL: https://arxiv.org/pdf/2509.23919
Copy Paste: [[2509.23919]] Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models(https://arxiv.org/abs/2509.23919)
Keywords: diffusion
Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results. Codes will be released.

Title: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Authors: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23924
Pdf URL: https://arxiv.org/pdf/2509.23924
Copy Paste: [[2509.23924]] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step(https://arxiv.org/abs/2509.23924)
Keywords: diffusion
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: this https URL.

Title: SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing

Authors: Yi Yang, Xiaokun Zhang, Qingchen Fang, Ziqi Ye, Rui Li, Li Liu, Haipeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23927
Pdf URL: https://arxiv.org/pdf/2509.23927
Copy Paste: [[2509.23927]] SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing(https://arxiv.org/abs/2509.23927)
Keywords: self-supervised, foundation model
Abstract: Cross-modal artificial intelligence has garnered widespread attention in recent years, achieving significant progress in the study of natural images. However, existing methods are mostly designed for RGB imagery, leaving a significant gap in modeling synthetic aperture radar (SAR) imagery. SAR, with its all-day, all-weather imaging capabilities, plays an irreplaceable role in remote sensing scene understanding. To address this gap, this paper proposes SAR-KnowLIP, the first universal SAR multimodal foundational model, along with reusable data and evaluation baselines. Specifically: (1) This work introduces the critical yet long-overlooked attribute of geographic information into remote sensing research, constructing SAR-GEOVL-1M (the first large-scale SAR dataset with complete geographic projection properties), covering multiple satellite platforms, 120,000 images, and 135 cities. (2) Aligned structured text is generated through a hierarchical cognitive chain-of-thought (HCoT), providing more than one million multi-dimensional semantic annotations of landforms, regional functions, target attributes, and spatial relationships. (3) We design a Self-Consistent Iterative Optimization mechanism that continuously enhances cross-modal alignment through a self-supervised closed loop of contrastive, matching, and reconstruction learning on a transferable multimodal encoder. (4) A unified evaluation benchmark is established across 11 representative downstream vision and vision-language tasks, with comparisons against 14 leading foundation models, where SAR-KnowLIP demonstrates leading performance, particularly in object counting and land-cover classification. We expect that SAR-KnowLIP's large-scale multimodal data, transferable model architecture, and comprehensive experimental benchmark will significantly advance the development of SAR multimodal baseline models.

Title: Diffusion Models are Kelly Gamblers

Authors: Akhil Premkumar
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2509.23937
Pdf URL: https://arxiv.org/pdf/2509.23937
Copy Paste: [[2509.23937]] Diffusion Models are Kelly Gamblers(https://arxiv.org/abs/2509.23937)
Keywords: diffusion
Abstract: We draw a connection between diffusion models and the Kelly criterion for maximizing returns in betting games. We find that conditional diffusion models store additional information to bind the signal $X$ with the conditioning information $Y$, equal to the mutual information between them. Classifier-free guidance effectively boosts the mutual information between $X$ and $Y$ at sampling time. This is especially helpful in image models, since the mutual information between images and their labels is low, a fact which is intimately connected to the manifold hypothesis. Finally, we point out some nuances in the popular perspective that diffusion models are infinitely deep autoencoders. In doing so, we relate the denoising loss to the Fermi Golden Rule from quantum mechanics.

Title: Brain-language fusion enables interactive neural readout and in-silico experimentation

Authors: Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, Tim Christian Kietzmann
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2509.23941
Pdf URL: https://arxiv.org/pdf/2509.23941
Copy Paste: [[2509.23941]] Brain-language fusion enables interactive neural readout and in-silico experimentation(https://arxiv.org/abs/2509.23941)
Keywords: generative
Abstract: Large language models (LLMs) have revolutionized human-machine interaction, and have been extended by embedding diverse modalities such as images into a shared language space. Yet, neural decoding has remained constrained by static, non-interactive methods. We introduce CorText, a framework that integrates neural activity directly into the latent space of an LLM, enabling open-ended, natural language interaction with brain data. Trained on fMRI data recorded during viewing of natural scenes, CorText generates accurate image captions and can answer more detailed questions better than controls, while having access to neural data only. We showcase that CorText achieves zero-shot generalization beyond semantic categories seen during training. Furthermore, we present a counterfactual analysis that emulates in-silico cortical microstimulation. These advances mark a shift from passive decoding toward generative, flexible interfaces between brain activity and language.

Title: HunyuanImage 3.0 Technical Report

Authors: Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23951
Pdf URL: https://arxiv.org/pdf/2509.23951
Copy Paste: [[2509.23951]] HunyuanImage 3.0 Technical Report(https://arxiv.org/abs/2509.23951)
Keywords: foundation model, generative
Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at this https URL

Title: Reinforcement Learning with Inverse Rewards for World Model Post-training

Authors: Yang Ye, Tianyu He, Shuo Yang, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23958
Pdf URL: https://arxiv.org/pdf/2509.23958
Copy Paste: [[2509.23958]] Reinforcement Learning with Inverse Rewards for World Model Post-training(https://arxiv.org/abs/2509.23958)
Keywords: diffusion
Abstract: World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.

Title: VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion

Authors: Kargi Chauhan, Leilani H. Gilpin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23971
Pdf URL: https://arxiv.org/pdf/2509.23971
Copy Paste: [[2509.23971]] VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion(https://arxiv.org/abs/2509.23971)
Keywords: diffusion
Abstract: Modern diffusion models generate realistic traffic simulations but systematically violate physical constraints. In a large-scale evaluation of SceneDiffuser++, a state-of-the-art traffic simulator, we find that 50% of generated trajectories violate basic physical laws - vehicles collide, drive off roads, and spawn inside buildings. This reveals a fundamental limitation: current models treat physical validity as an emergent property rather than an architectural requirement. We propose Validity-First Spatial Intelligence (VFSI), which enforces constraints through energy-based guidance during diffusion sampling, without model retraining. By incorporating collision avoidance and kinematic constraints as energy functions, we guide the denoising process toward physically valid trajectories. Across 200 urban scenarios from the Waymo Open Motion Dataset, VFSI reduces collision rates by 67% (24.6% to 8.1%) and improves overall validity by 87% (50.3% to 94.2%), while simultaneously improving realism metrics (ADE: 1.34m to 1.21m). Our model-agnostic approach demonstrates that explicit constraint enforcement during inference is both necessary and sufficient for physically valid traffic simulation.

Title: Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution

Authors: Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Yulun Zhang, Jian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23980
Pdf URL: https://arxiv.org/pdf/2509.23980
Copy Paste: [[2509.23980]] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution(https://arxiv.org/abs/2509.23980)
Keywords: diffusion, generative
Abstract: Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{this https URL}{this https URL}.

Title: RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization

Authors: Dongki Jung, Jaehoon Choi, Yonghan Lee, Dinesh Manocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23991
Pdf URL: https://arxiv.org/pdf/2509.23991
Copy Paste: [[2509.23991]] RPG360: Robust 360 Depth Estimation with Perspective Foundation Models and Graph Optimization(https://arxiv.org/abs/2509.23991)
Keywords: foundation model
Abstract: The increasing use of 360 images across various domains has emphasized the need for robust depth estimation techniques tailored for omnidirectional images. However, obtaining large-scale labeled datasets for 360 depth estimation remains a significant challenge. In this paper, we propose RPG360, a training-free robust 360 monocular depth estimation method that leverages perspective foundation models and graph optimization. Our approach converts 360 images into six-face cubemap representations, where a perspective foundation model is employed to estimate depth and surface normals. To address depth scale inconsistencies across different faces of the cubemap, we introduce a novel depth scale alignment technique using graph-based optimization, which parameterizes the predicted depth and normal maps while incorporating an additional per-face scale parameter. This optimization ensures depth scale consistency across the six-face cubemap while preserving 3D structural integrity. Furthermore, as foundation models exhibit inherent robustness in zero-shot settings, our method achieves superior performance across diverse datasets, including Matterport3D, Stanford2D3D, and 360Loc. We also demonstrate the versatility of our depth estimation approach by validating its benefits in downstream tasks such as feature matching 3.2 ~ 5.4% and Structure from Motion 0.2 ~ 9.7% in AUC@5.

Title: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Authors: Muleilan Pei, Shaoshuai Shi, Shaojie Shen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.23993
Pdf URL: https://arxiv.org/pdf/2509.23993
Copy Paste: [[2509.23993]] Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning(https://arxiv.org/abs/2509.23993)
Keywords: foundation model
Abstract: Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.

Title: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2509.24006
Pdf URL: https://arxiv.org/pdf/2509.24006
Copy Paste: [[2509.24006]] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention(https://arxiv.org/abs/2509.24006)
Keywords: diffusion
Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Title: Sequential Diffusion Language Models

Authors: Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24007
Pdf URL: https://arxiv.org/pdf/2509.24007
Copy Paste: [[2509.24007]] Sequential Diffusion Language Models(https://arxiv.org/abs/2509.24007)
Keywords: diffusion
Abstract: Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: this https URL

Title: Pretraining Scaling Laws for Generative Evaluations of Language Models

Authors: Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24012
Pdf URL: https://arxiv.org/pdf/2509.24012
Copy Paste: [[2509.24012]] Pretraining Scaling Laws for Generative Evaluations of Language Models(https://arxiv.org/abs/2509.24012)
Keywords: generative
Abstract: Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters\,+\,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.

Title: SparseD: Sparse Attention for Diffusion Language Models

Authors: Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24014
Pdf URL: https://arxiv.org/pdf/2509.24014
Copy Paste: [[2509.24014]] SparseD: Sparse Attention for Diffusion Language Models(https://arxiv.org/abs/2509.24014)
Keywords: diffusion
Abstract: While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

Title: GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning

Authors: Umang Garg, Bowen Zhang, Anantanjit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath
Subjects: cs.LG, cs.AI, cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2509.24031
Pdf URL: https://arxiv.org/pdf/2509.24031
Copy Paste: [[2509.24031]] GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning(https://arxiv.org/abs/2509.24031)
Keywords: self-supervised, foundation model
Abstract: Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

Title: Automated Vulnerability Validation and Verification: A Large Language Model Approach

Authors: Alireza Lotfi, Charalampos Katsis, Elisa Bertino
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.24037
Pdf URL: https://arxiv.org/pdf/2509.24037
Copy Paste: [[2509.24037]] Automated Vulnerability Validation and Verification: A Large Language Model Approach(https://arxiv.org/abs/2509.24037)
Keywords: generative
Abstract: Software vulnerabilities remain a critical security challenge, providing entry points for attackers into enterprise networks. Despite advances in security practices, the lack of high-quality datasets capturing diverse exploit behavior limits effective vulnerability assessment and mitigation. This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs), to address the challenges of orchestrating and reproducing attacks to known software vulnerabilities. Our approach extracts information from CVE disclosures in the National Vulnerability Database, augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG), and automates the creation of containerized environments and exploit code for each vulnerability. The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups. Our methodology overcomes key obstacles, including noisy and incomplete vulnerability descriptions, by integrating LLMs and RAG to fill information gaps. We demonstrate the effectiveness of our pipeline across different vulnerability types, such as memory overflows, denial of service, and remote code execution, spanning diverse programming languages, libraries and years. In doing so, we uncover significant inconsistencies in CVE descriptions, emphasizing the need for more rigorous verification in the CVE disclosure process. Our approach is model-agnostic, working across multiple LLMs, and we open-source the artifacts to enable reproducibility and accelerate security research. To the best of our knowledge, this is the first system to systematically orchestrate and exploit known vulnerabilities in containerized environments by combining general-purpose LLM reasoning with CVE data and RAG-based context enrichment.

Title: In-Context Compositional Q-Learning for Offline Reinforcement Learning

Authors: Qiushui Xu, Yuhao Huang, Yushu Jiang, Lei Song, Jinyu Wang, Wenliang Zheng, Jiang Bian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24067
Pdf URL: https://arxiv.org/pdf/2509.24067
Copy Paste: [[2509.24067]] In-Context Compositional Q-Learning for Offline Reinforcement Learning(https://arxiv.org/abs/2509.24067)
Keywords: in-context
Abstract: Accurately estimating the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a single global Q-function, which struggles to capture the compositional nature of tasks involving diverse subtasks. We propose In-context Compositional Q-Learning (\texttt{ICQL}), the first offline RL framework that formulates Q-learning as a contextual inference problem, using linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that under two assumptions--linear approximability of the local Q-function and accurate weight inference from retrieved context--\texttt{ICQL} achieves bounded Q-function approximation error, and supports near-optimal policy extraction. Empirically, \texttt{ICQL} substantially improves performance in offline settings: improving performance in kitchen tasks by up to 16.4\%, and in Gym and Adroit tasks by up to 8.6\% and 6.3\%. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation, positioning \texttt{ICQL} as a principled and effective framework for offline RL.

Title: AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring

Authors: Youssef Sabiri, Walid Houmaidi, Ouail El Maadi, Yousra Chtouki
Subjects: cs.LG, cs.AI, cs.CV, stat.AP
Abstract URL: https://arxiv.org/abs/2509.24069
Pdf URL: https://arxiv.org/pdf/2509.24069
Copy Paste: [[2509.24069]] AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring(https://arxiv.org/abs/2509.24069)
Keywords: anomaly
Abstract: Smart aquaculture systems depend on rich environmental data streams to protect fish welfare, optimize feeding, and reduce energy use. Yet public datasets that describe the air surrounding indoor tanks remain scarce, limiting the development of forecasting and anomaly-detection tools that couple head-space conditions with water-quality dynamics. We therefore introduce AQUAIR, an open-access public dataset that logs six Indoor Environmental Quality (IEQ) variables--air temperature, relative humidity, carbon dioxide, total volatile organic compounds, PM2.5 and PM10--inside a fish aquaculture facility in Amghass, Azrou, Morocco. A single Awair HOME monitor sampled every five minutes from 14 October 2024 to 9 January 2025, producing more than 23,000 time-stamped observations that are fully quality-controlled and publicly archived on Figshare. We describe the sensor placement, ISO-compliant mounting height, calibration checks against reference instruments, and an open-source processing pipeline that normalizes timestamps, interpolates short gaps, and exports analysis-ready tables. Exploratory statistics show stable conditions (median CO2 = 758 ppm; PM2.5 = 12 micrograms/m3) with pronounced feeding-time peaks, offering rich structure for short-horizon forecasting, event detection, and sensor drift studies. AQUAIR thus fills a critical gap in smart aquaculture informatics and provides a reproducible benchmark for data-centric machine learning curricula and environmental sensing research focused on head-space dynamics in recirculating aquaculture systems.

Title: A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks

Authors: Bo Hu, José C. Príncipe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24076
Pdf URL: https://arxiv.org/pdf/2509.24076
Copy Paste: [[2509.24076]] A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks(https://arxiv.org/abs/2509.24076)
Keywords: self-supervised, generative
Abstract: Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.

Title: Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Authors: Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24099
Pdf URL: https://arxiv.org/pdf/2509.24099
Copy Paste: [[2509.24099]] Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow(https://arxiv.org/abs/2509.24099)
Keywords: diffusion
Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Title: GeoFunFlow: Geometric Function Flow Matching for Inverse Operator Learning over Complex Geometries

Authors: Sifan Wang, Zhikai Wu, David van Dijk, Lu Lu
Subjects: cs.LG, physics.comp-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24117
Pdf URL: https://arxiv.org/pdf/2509.24117
Copy Paste: [[2509.24117]] GeoFunFlow: Geometric Function Flow Matching for Inverse Operator Learning over Complex Geometries(https://arxiv.org/abs/2509.24117)
Keywords: diffusion
Abstract: Inverse problems governed by partial differential equations (PDEs) are crucial in science and engineering. They are particularly challenging due to ill-posedness, data sparsity, and the added complexity of irregular geometries. Classical PDE-constrained optimization methods are computationally expensive, especially when repeated posterior sampling is required. Learning-based approaches improve efficiency and scalability, yet most are designed for regular domains or focus on forward modeling. Here, we introduce {\em GeoFunFlow}, a geometric diffusion model framework for inverse problems on complex geometries. GeoFunFlow combines a novel geometric function autoencoder (GeoFAE) and a latent diffusion model trained via rectified flow. GeoFAE employs a Perceiver module to process unstructured meshes of varying sizes and produces continuous reconstructions of physical fields, while the diffusion model enables posterior sampling from sparse and noisy data. Across five benchmarks, GeoFunFlow achieves state-of-the-art reconstruction accuracy over complex geometries, provides calibrated uncertainty quantification, and delivers efficient inference compared to operator-learning and diffusion model baselines.

Title: The Impossibility of Inverse Permutation Learning in Transformer Models

Authors: Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24125
Pdf URL: https://arxiv.org/pdf/2509.24125
Copy Paste: [[2509.24125]] The Impossibility of Inverse Permutation Learning in Transformer Models(https://arxiv.org/abs/2509.24125)
Keywords: in-context
Abstract: In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

Title: GANji: A Framework for Introductory AI Image Generation

Authors: Chandon Hamel, Mike Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24128
Pdf URL: https://arxiv.org/pdf/2509.24128
Copy Paste: [[2509.24128]] GANji: A Framework for Introductory AI Image Generation(https://arxiv.org/abs/2509.24128)
Keywords: diffusion, generative
Abstract: The comparative study of generative models often requires significant computational resources, creating a barrier for researchers and practitioners. This paper introduces GANji, a lightweight framework for benchmarking foundational AI image generation techniques using a dataset of 10,314 Japanese Kanji characters. It systematically compares the performance of a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN), and a Denoising Diffusion Probabilistic Model (DDPM). The results demonstrate that while the DDPM achieves the highest image fidelity, with a Fréchet Inception Distance (FID) score of 26.2, its sampling time is over 2,000 times slower than the other models. The GANji framework is an effective and accessible tool for revealing the fundamental trade-offs between model architecture, computational cost, and visual quality, making it ideal for both educational and research purposes.

Title: Asymmetric VAE for One-Step Video Super-Resolution Acceleration

Authors: Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24142
Pdf URL: https://arxiv.org/pdf/2509.24142
Copy Paste: [[2509.24142]] Asymmetric VAE for One-Step Video Super-Resolution Acceleration(https://arxiv.org/abs/2509.24142)
Keywords: diffusion
Abstract: Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE's performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at this https URL.

Title: Your thoughts tell who you are: Characterize the reasoning patterns of LRMs

Authors: Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, Shaoliang Nie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24147
Pdf URL: https://arxiv.org/pdf/2509.24147
Copy Paste: [[2509.24147]] Your thoughts tell who you are: Characterize the reasoning patterns of LRMs(https://arxiv.org/abs/2509.24147)
Keywords: generative
Abstract: Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.

Title: Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis

Authors: Haolin Yang, Hakaze Cho, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24164
Pdf URL: https://arxiv.org/pdf/2509.24164
Copy Paste: [[2509.24164]] Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis(https://arxiv.org/abs/2509.24164)
Keywords: in-context
Abstract: We investigate the mechanistic underpinnings of in-context learning (ICL) in large language models by reconciling two dominant perspectives: the component-level analysis of attention heads and the holistic decomposition of ICL into Task Recognition (TR) and Task Learning (TL). We propose a novel framework based on Task Subspace Logit Attribution (TSLA) to identify attention heads specialized in TR and TL, and demonstrate their distinct yet complementary roles. Through correlation analysis, ablation studies, and input perturbations, we show that the identified TR and TL heads independently and effectively capture the TR and TL components of ICL. Using steering experiments with geometric analysis of hidden states, we reveal that TR heads promote task recognition by aligning hidden states with the task subspace, while TL heads rotate hidden states within the subspace toward the correct label to facilitate prediction. We further show how previous findings on ICL mechanisms, including induction heads and task vectors, can be reconciled with our attention-head-level analysis of the TR-TL decomposition. Our framework thus provides a unified and interpretable account of how large language models execute ICL across diverse tasks and settings.

Title: LatXGen: Towards Radiation-Free and Accurate Quantitative Analysis of Sagittal Spinal Alignment Via Cross-Modal Radiographic View Synthesis

Authors: Moxin Zhao, Nan Meng, Jason Pui Yin Cheung, Chris Yuk Kwan Tang, Chenxi Yu, Wenting Zhong, Pengyu Lu, Chang Shi, Yipeng Zhuang, Teng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24165
Pdf URL: https://arxiv.org/pdf/2509.24165
Copy Paste: [[2509.24165]] LatXGen: Towards Radiation-Free and Accurate Quantitative Analysis of Sagittal Spinal Alignment Via Cross-Modal Radiographic View Synthesis(https://arxiv.org/abs/2509.24165)
Keywords: generative
Abstract: Adolescent Idiopathic Scoliosis (AIS) is a complex three-dimensional spinal deformity, and accurate morphological assessment requires evaluating both coronal and sagittal alignment. While previous research has made significant progress in developing radiation-free methods for coronal plane assessment, reliable and accurate evaluation of sagittal alignment without ionizing radiation remains largely underexplored. To address this gap, we propose LatXGen, a novel generative framework that synthesizes realistic lateral spinal radiographs from posterior Red-Green-Blue and Depth (RGBD) images of unclothed backs. This enables accurate, radiation-free estimation of sagittal spinal alignment. LatXGen tackles two core challenges: (1) inferring sagittal spinal morphology changes from a lateral perspective based on posteroanterior surface geometry, and (2) performing cross-modality translation from RGBD input to the radiographic domain. The framework adopts a dual-stage architecture that progressively estimates lateral spinal structure and synthesizes corresponding radiographs. To enhance anatomical consistency, we introduce an attention-based Fast Fourier Convolution (FFC) module for integrating anatomical features from RGBD images and 3D landmarks, and a Spatial Deformation Network (SDN) to model morphological variations in the lateral view. Additionally, we construct the first large-scale paired dataset for this task, comprising 3,264 RGBD and lateral radiograph pairs. Experimental results demonstrate that LatXGen produces anatomically accurate radiographs and outperforms existing GAN-based methods in both visual fidelity and quantitative metrics. This study offers a promising, radiation-free solution for sagittal spine assessment and advances comprehensive AIS evaluation.

Title: Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight

Authors: Haolin Yang, Hakaze Cho, Kaize Ding, Naoya Inoue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24169
Pdf URL: https://arxiv.org/pdf/2509.24169
Copy Paste: [[2509.24169]] Task Vectors, Learned Not Extracted: Performance Gains and Mechanistic Insight(https://arxiv.org/abs/2509.24169)
Keywords: in-context
Abstract: Large Language Models (LLMs) can perform new tasks from in-context demonstrations, a phenomenon known as in-context learning (ICL). Recent work suggests that these demonstrations are compressed into task vectors (TVs), compact task representations that LLMs exploit for predictions. However, prior studies typically extract TVs from model outputs or hidden states using cumbersome and opaque methods, and they rarely elucidate the mechanisms by which TVs influence computation. In this work, we address both limitations. First, we propose directly training Learned Task Vectors (LTVs), which surpass extracted TVs in accuracy and exhibit superior flexibility-acting effectively at arbitrary layers, positions, and even with ICL prompts. Second, through systematic analysis, we investigate the mechanistic role of TVs, showing that at the low level they steer predictions primarily through attention-head OV circuits, with a small subset of "key heads" most decisive. At a higher level, we find that despite Transformer nonlinearities, TV propagation is largely linear: early TVs are rotated toward task-relevant subspaces to improve logits of relevant labels, while later TVs are predominantly scaled in magnitude. Taken together, LTVs not only provide a practical approach for obtaining effective TVs but also offer a principled lens into the mechanistic foundations of ICL.

Title: FM-FoG: A Real-Time Foundation Model-based Wearable System for Freezing-of-Gait Mitigation

Authors: Chuntian Chi, John Clapham, Leslie Cloud, Ingrid Pretzer-Aboff, GinaMari Blackwell, Huajie Shao, Gang Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24176
Pdf URL: https://arxiv.org/pdf/2509.24176
Copy Paste: [[2509.24176]] FM-FoG: A Real-Time Foundation Model-based Wearable System for Freezing-of-Gait Mitigation(https://arxiv.org/abs/2509.24176)
Keywords: self-supervised, foundation model
Abstract: Freezing-of-Gait (FoG) affects over 50% of mid-to-late stage Parkinson's disease (PD) patients, significantly impairing patients' mobility independence and reducing quality of life. FoG is characterized by sudden episodes where walking cannot start or is interrupted, occurring exclusively during standing or walking, and never while sitting or lying down. Current FoG detection systems require extensive patient-specific training data and lack generalization, limiting clinical deployment. To address these issues, we introduce FM-FoG, a real-time foundation model-based wearable system achieving FoG detection in unseen patients without patient-specific training. Our approach combines self-supervised pretraining on diverse Inertial Measurement Unit (IMU) datasets with sensor context integration. Since FoG occurs only during ambulatory activities, a lightweight CNN-LSTM activity classifier selectively activates the foundation model only during walking or standing, avoiding unnecessary computation. Evaluated on the VCU FoG-IMU dataset with 23 PD patients, FM-FoG achieves a 98.5% F1-score when tested on previously unseen patients, substantially outperforming competitive baseline methods. Deployed on a Google Pixel 8a smartphone, the system extends battery life by up to 72% while maintaining sub-20ms intervention latency. The results indicate that our FM-FoG can enable practical, energy-efficient healthcare applications that generalize across patients without individual training requirements.

Title: Tumor Synthesis conditioned on Radiomics

Authors: Jonghun Kim, Inye Na, Eun Sook Ko, Hyunjin Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24182
Pdf URL: https://arxiv.org/pdf/2509.24182
Copy Paste: [[2509.24182]] Tumor Synthesis conditioned on Radiomics(https://arxiv.org/abs/2509.24182)
Keywords: diffusion, generative
Abstract: Due to privacy concerns, obtaining large datasets is challenging in medical image analysis, especially with 3D modalities like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing generative models, developed to address this issue, often face limitations in output diversity and thus cannot accurately represent 3D medical images. We propose a tumor-generation model that utilizes radiomics features as generative conditions. Radiomics features are high-dimensional handcrafted semantic features that are biologically well-grounded and thus are good candidates for conditioning. Our model employs a GAN-based model to generate tumor masks and a diffusion-based approach to generate tumor texture conditioned on radiomics features. Our method allows the user to generate tumor images according to user-specified radiomics features such as size, shape, and texture at an arbitrary location. This enables the physicians to easily visualize tumor images to better understand tumors according to changing radiomics features. Our approach allows for the removal, manipulation, and repositioning of tumors, generating various tumor types in different scenarios. The model has been tested on tumors in four different organs (kidney, lung, breast, and brain) across CT and MRI. The synthesized images are shown to effectively aid in training for downstream tasks and their authenticity was also evaluated through expert evaluations. Our method has potential usage in treatment planning with diverse synthesized tumors.

Title: Retrieval-augmented GUI Agents with Generative Guidelines

Authors: Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24183
Pdf URL: https://arxiv.org/pdf/2509.24183
Copy Paste: [[2509.24183]] Retrieval-augmented GUI Agents with Generative Guidelines(https://arxiv.org/abs/2509.24183)
Keywords: generative
Abstract: GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.

Title: Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning

Authors: Jonghun Kim, Hyunjin Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24185
Pdf URL: https://arxiv.org/pdf/2509.24185
Copy Paste: [[2509.24185]] Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning(https://arxiv.org/abs/2509.24185)
Keywords: diffusion, generative
Abstract: Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.

Title: PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Authors: Luyang Zhang, Siyuan Peng, Jialu Wang, Shichao Zhu, Beibei Li, Zhongcun Wang, Guangmou Pan, Yan Li, Song Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24189
Pdf URL: https://arxiv.org/pdf/2509.24189
Copy Paste: [[2509.24189]] PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution(https://arxiv.org/abs/2509.24189)
Keywords: generative
Abstract: Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user's next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user's preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.

Title: An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation

Authors: Zach Eidex, Mojtaba Safari, Jie Ding, Richard Qiu, Justin Roper, David Yu, Hui-Kuo Shu, Zhen Tian, Hui Mao, Xiaofeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24194
Pdf URL: https://arxiv.org/pdf/2509.24194
Copy Paste: [[2509.24194]] An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation(https://arxiv.org/abs/2509.24194)
Keywords: diffusion
Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly employed with T1w MRI to enhance lesion visualization but are restricted in patients at risk of nephrogenic systemic fibrosis and variations in GBCA administration can introduce imaging inconsistencies. This study develops an efficient 3D deep-learning framework to generate T1-contrast enhanced images (T1C) from pre-contrast multiparametric MRI. Approach: We propose the 3D latent rectified flow (T1C-RFlow) model for generating high-quality T1C images. First, T1w and T2-FLAIR images are input into a pretrained autoencoder to acquire an efficient latent space representation. A rectified flow diffusion model is then trained in this latent space representation. The T1C-RFlow model was trained on a curated dataset comprised of the BraTS 2024 glioma (GLI; 1480 patients), meningioma (MEN; 1141 patients), and metastases (MET; 1475 patients) datasets. Selected patients were split into train (N=2860), validation (N=612), and test (N=614) sets. Results: Both qualitative and quantitative results demonstrate that the T1C-RFlow model outperforms benchmark 3D models (pix2pix, DDPM, Diffusion Transformers (DiT-3D)) trained in the same latent space. T1C-RFlow achieved the following metrics - GLI: NMSE 0.044 +/- 0.047, SSIM 0.935 +/- 0.025; MEN: NMSE 0.046 +/- 0.029, SSIM 0.937 +/- 0.021; MET: NMSE 0.098 +/- 0.088, SSIM 0.905 +/- 0.082. T1C-RFlow had the best tumor reconstruction performance and significantly faster denoising times (6.9 s/volume, 200 steps) than conventional DDPM models in both latent space (37.7s, 1000 steps) and patch-based in image space (4.3 hr/volume). Significance: Our proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models. Further development may permit a practical method for contrast-agent-free MRI for brain tumors.

Title: UniVid: The Open-Source Unified Video Model

Authors: Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24200
Pdf URL: https://arxiv.org/pdf/2509.24200
Copy Paste: [[2509.24200]] UniVid: The Open-Source Unified Video Model(https://arxiv.org/abs/2509.24200)
Keywords: diffusion
Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

Title: BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation

Authors: Zelin Liu, Sicheng Dong, Bocheng Li, Yixuan Yang, Jiacheng Ruan, Chenxu Zhou, Suncheng Xiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24204
Pdf URL: https://arxiv.org/pdf/2509.24204
Copy Paste: [[2509.24204]] BALR-SAM: Boundary-Aware Low-Rank Adaptation of SAM for Resource-Efficient Medical Image Segmentation(https://arxiv.org/abs/2509.24204)
Keywords: foundation model
Abstract: Vision foundation models like the Segment Anything Model (SAM), pretrained on large-scale natural image datasets, often struggle in medical image segmentation due to a lack of domain-specific adaptation. In clinical practice, fine-tuning such models efficiently for medical downstream tasks with minimal resource demands, while maintaining strong performance, is challenging. To address these issues, we propose BALR-SAM, a boundary-aware low-rank adaptation framework that enhances SAM for medical imaging. It combines three tailored components: (1) a Complementary Detail Enhancement Network (CDEN) using depthwise separable convolutions and multi-scale fusion to capture boundary-sensitive features essential for accurate segmentation; (2) low-rank adapters integrated into SAM's Vision Transformer blocks to optimize feature representation and attention for medical contexts, while simultaneously significantly reducing the parameter space; and (3) a low-rank tensor attention mechanism in the mask decoder, cutting memory usage by 75% and boosting inference speed. Experiments on standard medical segmentation datasets show that BALR-SAM, without requiring prompts, outperforms several state-of-the-art (SOTA) methods, including fully fine-tuned MedSAM, while updating just 1.8% (11.7M) of its parameters.

Title: Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos

Authors: Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, Jun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24209
Pdf URL: https://arxiv.org/pdf/2509.24209
Copy Paste: [[2509.24209]] Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos(https://arxiv.org/abs/2509.24209)
Keywords: self-supervised
Abstract: Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: this https URL.

Title: Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis

Authors: Xuecheng Wu, Junxiao Xue, Xinyi Yin, Yunyun Shi, Liangyu Fu, Danlei Huang, Yifan Wang, Jia Zhang, Jiayu Nie, Jun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24214
Pdf URL: https://arxiv.org/pdf/2509.24214
Copy Paste: [[2509.24214]] Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis(https://arxiv.org/abs/2509.24214)
Keywords: self-supervised
Abstract: Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.

Title: Semantic Editing with Coupled Stochastic Differential Equations

Authors: Jianxin Zhang, Clayton Scott
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24223
Pdf URL: https://arxiv.org/pdf/2509.24223
Copy Paste: [[2509.24223]] Semantic Editing with Coupled Stochastic Differential Equations(https://arxiv.org/abs/2509.24223)
Keywords: diffusion, generative
Abstract: Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box-without retraining or auxiliary networks-and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.

Title: EVLF-FM: Explainable Vision Language Foundation Model for Medicine

Authors: Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu, Yuhe Ke, Jie Yao, Kanae Fukutsu, Chrystie Wan Ning Quek, Ashley Hong, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Hiok Hong Chan, Victor Koh, Marcus Tan, Kelvin Z. Li, Leonard Yip, Ching Yu Cheng, Yih Chung Tham, Gavin Siew Wei Tan, Leopold Schmetterer, Marcus Ang, Rahat Hussain, Jod Mehta, Tin Aung, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Soon Thye Lim, Eyal Klang, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24231
Pdf URL: https://arxiv.org/pdf/2509.24231
Copy Paste: [[2509.24231]] EVLF-FM: Explainable Vision Language Foundation Model for Medicine(https://arxiv.org/abs/2509.24231)
Keywords: foundation model
Abstract: Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.

Title: FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation

Authors: Seungwook Kim, Seunghyeon Lee, Minsu Cho
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.24241
Pdf URL: https://arxiv.org/pdf/2509.24241
Copy Paste: [[2509.24241]] FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation(https://arxiv.org/abs/2509.24241)
Keywords: diffusion, foundation model
Abstract: Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.

Title: Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization

Authors: Yunhao Liang, Pujun Zhang, Yuan Qu, Shaochong Lin, Zuo-jun Max Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24256
Pdf URL: https://arxiv.org/pdf/2509.24256
Copy Paste: [[2509.24256]] Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization(https://arxiv.org/abs/2509.24256)
Keywords: self-supervised, foundation model, generative
Abstract: The pretrain-transfer paradigm, which underpins the success of large language models (LLMs), has demonstrated the immense power of creating foundation models that learn generalizable representations from vast datasets. However, extending this paradigm to Operations Research (OR) problems on graph structures remains challenging due to the fundamental conflict between the statistical flexibility of language and the strict combinatorial constraints of graphs. To bridge this gap, we introduce the Graph Foundation Model (GFM), the first framework capable of solving all distance-based optimization problems on graph structures. By introducing the LLM-like self-supervised pre-training paradigm on the paths generated from random walks in the graph, GFM is compelled to internalize the graph's complex topological and combinatorial rules, where the connectivity of the structure itself can be treated as the supervisory signal. Unlike existing neural methods that learn complex and task-specific solving policies, our approach leverages the pre-trained GFM as a foundational model of the graph's intrinsic structure, which in turn enables a simple generative heuristic to tackle a diverse range of optimization challenges effectively. Comprehensive experiments on networks ranging from 20 to 893 nodes demonstrate that GFM achieves competitive performance against specialized solvers across a variety of distinct optimization task classes, while maintaining significantly faster inference times. Our work establishes a new paradigm of adapting the pretrain-transfer framework to graph optimization, opening the door for applying foundation model innovations to OR.

Title: Cycle Diffusion Model for Counterfactual Image Generation

Authors: Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24267
Pdf URL: https://arxiv.org/pdf/2509.24267
Copy Paste: [[2509.24267]] Cycle Diffusion Model for Counterfactual Image Generation(https://arxiv.org/abs/2509.24267)
Keywords: diffusion, generative
Abstract: Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.

Title: ASIA: Adaptive 3D Segmentation using Few Image Annotations

Authors: Sai Raj Kishore Perla, Aditya Vora, Sauradip Nag, Ali Mahdavi-Amiri, Hao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24288
Pdf URL: https://arxiv.org/pdf/2509.24288
Copy Paste: [[2509.24288]] ASIA: Adaptive 3D Segmentation using Few Image Annotations(https://arxiv.org/abs/2509.24288)
Keywords: diffusion
Abstract: We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text-describable "parts" in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novel Noise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

Title: Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Authors: Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24291
Pdf URL: https://arxiv.org/pdf/2509.24291
Copy Paste: [[2509.24291]] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement(https://arxiv.org/abs/2509.24291)
Keywords: generative
Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

Title: DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Authors: Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24296
Pdf URL: https://arxiv.org/pdf/2509.24296
Copy Paste: [[2509.24296]] DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models(https://arxiv.org/abs/2509.24296)
Keywords: diffusion
Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: this https URL.

Title: ELASTIQ: EEG-Language Alignment with Semantic Task Instruction and Querying

Authors: Muyun Jiang, Shuailei Zhang, Zhenjie Yang, Mengjun Wu, Weibang Jiang, Zhiwei Guo, Wei Zhang, Rui Liu, Shangen Zhang, Yong Li, Yi Ding, Cuntai Guan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24302
Pdf URL: https://arxiv.org/pdf/2509.24302
Copy Paste: [[2509.24302]] ELASTIQ: EEG-Language Alignment with Semantic Task Instruction and Querying(https://arxiv.org/abs/2509.24302)
Keywords: foundation model
Abstract: Recent advances in electroencephalography (EEG) foundation models, which capture transferable EEG representations, have greatly accelerated the development of brain-computer interfaces (BCI). However, existing approaches still struggle to incorporate language instructions as prior constraints for EEG representation learning, limiting their ability to leverage the semantic knowledge inherent in language to unify different labels and tasks. To address this challenge, we present ELASTIQ, a foundation model for EEG-Language Alignment with Semantic Task Instruction and Querying. ELASTIQ integrates task-aware semantic guidance to produce structured and linguistically aligned EEG embeddings, thereby enhancing decoding robustness and transferability. In the pretraining stage, we introduce a joint Spectral-Temporal Reconstruction (STR) module, which combines frequency masking as a global spectral perturbation with two complementary temporal objectives: random masking to capture contextual dependencies and causal masking to model sequential dynamics. In the instruction tuning stage, we propose the Instruction-conditioned Q-Former (IQF), a query-based cross-attention transformer that injects instruction embeddings into EEG tokens and aligns them with textual label embeddings through learnable queries. We evaluate ELASTIQ on 20 datasets spanning motor imagery, emotion recognition, steady-state visual evoked potentials, covert speech, and healthcare tasks. ELASTIQ achieves state-of-the-art performance on 14 of the 20 datasets and obtains the best average results across all five task categories. Importantly, our analyses reveal for the first time that explicit task instructions serve as semantic priors guiding EEG embeddings into coherent and linguistically grounded spaces. The code and pre-trained weights will be released.

Title: A study of Universal ODE approaches to predicting soil organic carbon

Authors: Satyanarayana Raju G.V.V, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24306
Pdf URL: https://arxiv.org/pdf/2509.24306
Copy Paste: [[2509.24306]] A study of Universal ODE approaches to predicting soil organic carbon(https://arxiv.org/abs/2509.24306)
Keywords: diffusion
Abstract: Soil Organic Carbon (SOC) is a foundation of soil health and global climate resilience, yet its prediction remains difficult because of intricate physical, chemical, and biological processes. In this study, we explore a Scientific Machine Learning (SciML) framework built on Universal Differential Equations (UDEs) to forecast SOC dynamics across soil depth and time. UDEs blend mechanistic physics, such as advection diffusion transport, with neural networks that learn nonlinear microbial production and respiration. Using synthetic datasets, we systematically evaluated six experimental cases, progressing from clean, noise free benchmarks to stress tests with high (35%) multiplicative, spatially correlated noise. Our results highlight both the potential and limitations of the approach. In noise free and moderate noise settings, the UDE accurately reconstructed SOC dynamics. In clean terminal profile at 50 years (Case 4) achieved near perfect fidelity, with MSE = 1.6e-5, and R2 = 0.9999. Case 5, with 7% noise, remained robust (MSE = 3.4e-6, R2 = 0.99998), capturing depth wise SOC trends while tolerating realistic measurement uncertainty. In contrast, Case 3 (35% noise at t = 0) showed clear evidence of overfitting: the model reproduced noisy inputs with high accuracy but lost generalization against the clean truth (R2 = 0.94). Case 6 (35% noise at t = 50) collapsed toward overly smooth mean profiles, failing to capture depth wise variability and yielding negative R2, underscoring the limits of standard training under severe uncertainty. These findings suggest that UDEs are well suited for scalable, noise tolerant SOC forecasting, though advancing toward field deployment will require noise aware loss functions, probabilistic modelling, and tighter integration of microbial dynamics.

Title: Towards Foundation Models for Cryo-ET Subtomogram Analysis

Authors: Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang, Xi Xiao, Xiaoyu Cao, Genpei Zhang, Xiao Wang, Xiaolong Wu, Tianyang Wang, Yang Liu, Xingjian Li, Min Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24311
Pdf URL: https://arxiv.org/pdf/2509.24311
Copy Paste: [[2509.24311]] Towards Foundation Models for Cryo-ET Subtomogram Analysis(https://arxiv.org/abs/2509.24311)
Keywords: foundation model
Abstract: Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.

Title: Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Authors: Guolin Ke, Hui Xue
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24335
Pdf URL: https://arxiv.org/pdf/2509.24335
Copy Paste: [[2509.24335]] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation(https://arxiv.org/abs/2509.24335)
Keywords: diffusion
Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

Title: Expanding Horizons of Level Diversity via Multi-objective Evolutionary Learning

Authors: Qingquan Zhang, Ziqi Wang, Yuchen Li, Keyuan Zhang, Bo Yuan, Jialin Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24341
Pdf URL: https://arxiv.org/pdf/2509.24341
Copy Paste: [[2509.24341]] Expanding Horizons of Level Diversity via Multi-objective Evolutionary Learning(https://arxiv.org/abs/2509.24341)
Keywords: generative
Abstract: In recent years, the generation of diverse game levels has gained increasing interest, contributing to a richer and more engaging gaming experience. A number of level diversity metrics have been proposed in literature, which are naturally multi-dimensional, leading to conflicted, complementary, or both relationships among these dimensions. However, existing level generation approaches often fail to comprehensively assess diversity across those dimensions. This paper aims to expand horizons of level diversity by considering multi-dimensional diversity when training generative models. We formulate the model training as a multi-objective learning problem, where each diversity metric is treated as a distinct objective. Furthermore, a multi-objective evolutionary learning framework that optimises multiple diversity metrics simultaneously throughout the model training process is proposed. Our case study on the commonly used benchmark Super Mario Bros. demonstrates that our proposed framework can enhance multi-dimensional diversity and identify a Pareto front of generative models, which provides a range of tradeoffs among playability and two representative diversity metrics, including a content-based one and a player-centered one. Such capability enables decision-makers to make informed choices when selecting generators accommodating a variety of scenarios and the diverse needs of players and designers.

Title: NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

Authors: Yixuan Ren, Hanyu Wang, Hao Chen, Bo He, Abhinav Shrivastava
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24353
Pdf URL: https://arxiv.org/pdf/2509.24353
Copy Paste: [[2509.24353]] NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis(https://arxiv.org/abs/2509.24353)
Keywords: diffusion
Abstract: We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.

Title: DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense

Authors: Amira Guesmi, Muhammad Shafique
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24359
Pdf URL: https://arxiv.org/pdf/2509.24359
Copy Paste: [[2509.24359]] DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense(https://arxiv.org/abs/2509.24359)
Keywords: diffusion
Abstract: Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus}--the tendency of randomized transformations to yield aligned gradients--as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.

Title: Watermarking Diffusion Language Models

Authors: Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2509.24368
Pdf URL: https://arxiv.org/pdf/2509.24368
Copy Paste: [[2509.24368]] Watermarking Diffusion Language Models(https://arxiv.org/abs/2509.24368)
Keywords: diffusion
Abstract: We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.

Title: From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis

Authors: Khawlah Bajbaa, Abbas Anwar, Muhammad Saqib, Hafeez Anwar, Nabin Sharma, Muhammad Usman
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2509.24369
Pdf URL: https://arxiv.org/pdf/2509.24369
Copy Paste: [[2509.24369]] From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis(https://arxiv.org/abs/2509.24369)
Keywords: diffusion, generative
Abstract: Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework's capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.

Title: DINOReg: Strong Point Cloud Registration with Vision Foundation Model

Authors: Congjia Chen, Yufu Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24370
Pdf URL: https://arxiv.org/pdf/2509.24370
Copy Paste: [[2509.24370]] DINOReg: Strong Point Cloud Registration with Vision Foundation Model(https://arxiv.org/abs/2509.24370)
Keywords: foundation model
Abstract: Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model's ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at this https URL.

Title: AXIS: Explainable Time Series Anomaly Detection with Large Language Models

Authors: Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24378
Pdf URL: https://arxiv.org/pdf/2509.24378
Copy Paste: [[2509.24378]] AXIS: Explainable Time Series Anomaly Detection with Large Language Models(https://arxiv.org/abs/2509.24378)
Keywords: anomaly
Abstract: Time-series anomaly detection (TSAD) increasingly demands explanations that articulate not only if an anomaly occurred, but also what pattern it exhibits and why it is anomalous. Leveraging the impressive explanatory capabilities of Large Language Models (LLMs), recent works have attempted to treat time series as text for explainable TSAD. However, this approach faces a fundamental challenge: LLMs operate on discrete tokens and struggle to directly process long, continuous signals. Consequently, naive time-to-text serialization suffers from a lack of contextual grounding and representation alignment between the two modalities. To address this gap, we introduce AXIS, a framework that conditions a frozen LLM for nuanced time-series understanding. Instead of direct serialization, AXIS enriches the LLM's input with three complementary hints derived from the series: (i) a symbolic numeric hint for numerical grounding, (ii) a context-integrated, step-aligned hint distilled from a pretrained time-series encoder to capture fine-grained dynamics, and (iii) a task-prior hint that encodes global anomaly characteristics. Furthermore, to facilitate robust evaluation of explainability, we introduce a new benchmark featuring multi-format questions and rationales that supervise contextual grounding and pattern-level semantics. Extensive experiments, including both LLM-based and human evaluations, demonstrate that AXIS yields explanations of significantly higher quality and achieves competitive detection accuracy compared to general-purpose LLMs, specialized time-series LLMs, and time-series Vision Language Models.

Title: REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

Authors: Soumyadeep Chandra, Kaushik Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24382
Pdf URL: https://arxiv.org/pdf/2509.24382
Copy Paste: [[2509.24382]] REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport(https://arxiv.org/abs/2509.24382)
Keywords: self-supervised
Abstract: Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL, leverage Kantorovich Optimal Transport (KOT) to build frame-to-frame correspondences, but rely solely on feature similarity and fail to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, a self-supervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to KOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% average F1-score improvements and over 30% temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

Title: LLaDA-MoE: A Sparse MoE Diffusion Language Model

Authors: Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24389
Pdf URL: https://arxiv.org/pdf/2509.24389
Copy Paste: [[2509.24389]] LLaDA-MoE: A Sparse MoE Diffusion Language Model(https://arxiv.org/abs/2509.24389)
Keywords: diffusion
Abstract: We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Title: RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Authors: Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24410
Pdf URL: https://arxiv.org/pdf/2509.24410
Copy Paste: [[2509.24410]] RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis(https://arxiv.org/abs/2509.24410)
Keywords: generative
Abstract: Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.

Title: ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection

Authors: Tao Yin, Xiaohong Zhang, Shaochen Fu, Zhibin Zhang, Li Huang, Yiyuan Yang, Kaixiang Yang, Meng Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24414
Pdf URL: https://arxiv.org/pdf/2509.24414
Copy Paste: [[2509.24414]] ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection(https://arxiv.org/abs/2509.24414)
Keywords: anomaly
Abstract: One main challenge in time series anomaly detection for industrial IoT lies in the complex spatio-temporal couplings within multivariate data. However, traditional anomaly detection methods focus on modeling spatial or temporal dependencies independently, resulting in suboptimal representation learning and limited sensitivity to anomalous dispersion in high-dimensional spaces. In this work, we conduct an empirical analysis showing that both normal and anomalous samples tend to scatter in high-dimensional space, especially anomalous samples are markedly more dispersed. We formalize this dispersion phenomenon as scattering, quantified by the mean pairwise distance among sample representations, and leverage it as an inductive signal to enhance spatio-temporal anomaly detection. Technically, we propose ScatterAD to model representation scattering across temporal and topological dimensions. ScatterAD incorporates a topological encoder for capturing graph-structured scattering and a temporal encoder for constraining over-scattering through mean squared error minimization between neighboring time steps. We introduce a contrastive fusion mechanism to ensure the complementarity of the learned temporal and topological representations. Additionally, we theoretically show that maximizing the conditional mutual information between temporal and topological views improves cross-view consistency and enhances more discriminative representations. Extensive experiments on multiple public benchmarks show that ScatterAD achieves state-of-the-art performance on multivariate time series anomaly detection. Code is available at this repository: this https URL.

Title: CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Authors: Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24416
Pdf URL: https://arxiv.org/pdf/2509.24416
Copy Paste: [[2509.24416]] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers(https://arxiv.org/abs/2509.24416)
Keywords: diffusion
Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{this https URL}{this https URL}.

Title: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24427
Pdf URL: https://arxiv.org/pdf/2509.24427
Copy Paste: [[2509.24427]] UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark(https://arxiv.org/abs/2509.24427)
Keywords: diffusion, generative
Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

Title: Alternatives To Next Token Prediction In Text Generation - A Survey

Authors: Charlie Wyatt, Aditya Joshi, Flora Salim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24435
Pdf URL: https://arxiv.org/pdf/2509.24435
Copy Paste: [[2509.24435]] Alternatives To Next Token Prediction In Text Generation - A Survey(https://arxiv.org/abs/2509.24435)
Keywords: diffusion
Abstract: The paradigm of Next Token Prediction (NTP) has driven the unprecedented success of Large Language Models (LLMs), but is also the source of their most persistent weaknesses such as poor long-term planning, error accumulation, and computational inefficiency. Acknowledging the growing interest in exploring alternatives to NTP, the survey describes the emerging ecosystem of alternatives to NTP. We categorise these approaches into five main families: (1) Multi-Token Prediction, which targets a block of future tokens instead of a single one; (2) Plan-then-Generate, where a global, high-level plan is created upfront to guide token-level decoding; (3) Latent Reasoning, which shifts the autoregressive process itself into a continuous latent space; (4) Continuous Generation Approaches, which replace sequential generation with iterative, parallel refinement through diffusion, flow matching, or energy-based methods; and (5) Non-Transformer Architectures, which sidestep NTP through their inherent model structure. By synthesizing insights across these methods, this survey offers a taxonomy to guide research into models that address the known limitations of token-level generation to develop new transformative models for natural language processing.

Title: Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

Authors: Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.24445
Pdf URL: https://arxiv.org/pdf/2509.24445
Copy Paste: [[2509.24445]] Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA(https://arxiv.org/abs/2509.24445)
Keywords: generative
Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5\% on STAR (+4.9\%) and a 7B model to 80.8\% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.

Title: Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks

Authors: Hangil Park, Yongmin Seo, Tae-Kyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24448
Pdf URL: https://arxiv.org/pdf/2509.24448
Copy Paste: [[2509.24448]] Generalist Multi-Class Anomaly Detection via Distillation to Two Heterogeneous Student Networks(https://arxiv.org/abs/2509.24448)
Keywords: anomaly
Abstract: Anomaly detection (AD) plays an important role in various real-world applications. Recent advancements in AD, however, are often biased towards industrial inspection, struggle to generalize to broader tasks like semantic anomaly detection and vice versa. Although recent methods have attempted to address general anomaly detection, their performance remains sensitive to dataset-specific settings and single-class tasks. In this paper, we propose a novel dual-model ensemble approach based on knowledge distillation (KD) to bridge this gap. Our framework consists of a teacher and two student models: an Encoder-Decoder model, specialized in detecting patch-level minor defects for industrial AD and an Encoder-Encoder model, optimized for semantic AD. Both models leverage a shared pre-trained encoder (DINOv2) to extract high-quality feature representations. The dual models are jointly learned using the Noisy-OR objective, and the final anomaly score is obtained using the joint probability via local and semantic anomaly scores derived from the respective models. We evaluate our method on eight public benchmarks under both single-class and multi-class settings: MVTec-AD, MVTec-LOCO, VisA and Real-IAD for industrial inspection and CIFAR-10/100, FMNIST and View for semantic anomaly detection. The proposed method achieved state-of-the-art accuracies in both domains, in multi-class as well as single-class settings, demonstrating generalization across multiple domains of anomaly detection. Our model achieved an image-level AUROC of 99.7% on MVTec-AD and 97.8% on CIFAR-10, which is significantly better than the prior general AD models in multi-class settings and even higher than the best specialist models on individual benchmarks.

Title: Interpretable Kernel Representation Learning at Scale: A Unified Framework Utilizing Nyström Approximation

Authors: Maedeh Zarvandi, Michael Timothy, Theresa Wasserer, Debarghya Ghoshdastidar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24467
Pdf URL: https://arxiv.org/pdf/2509.24467
Copy Paste: [[2509.24467]] Interpretable Kernel Representation Learning at Scale: A Unified Framework Utilizing Nyström Approximation(https://arxiv.org/abs/2509.24467)
Keywords: self-supervised, foundation model
Abstract: Kernel methods provide a theoretically grounded framework for non-linear and non-parametric learning, with strong analytic foundations and statistical guarantees. Yet, their scalability has long been limited by prohibitive time and memory costs. While progress has been made in scaling kernel regression, no framework exists for scalable kernel-based representation learning, restricting their use in the era of foundation models where representations are learned from massive unlabeled data. We introduce KREPES -- a unified, scalable framework for kernel-based representation learning via Nyström approximation. KREPES accommodates a wide range of unsupervised and self-supervised losses, and experiments on large image and tabular datasets demonstrate its efficiency. Crucially, KREPES enables principled interpretability of the learned representations, an immediate benefit over deep models, which we substantiate through dedicated analysis.

Title: LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

Authors: Heechang Kim, Gwanghyun Kim, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24469
Pdf URL: https://arxiv.org/pdf/2509.24469
Copy Paste: [[2509.24469]] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation(https://arxiv.org/abs/2509.24469)
Keywords: diffusion
Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.

Title: Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Authors: Jonas Hübotter, Patrik Wolf, Alexander Shevchenko, Dennis Jüni, Andreas Krause, Gil Kur
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24510
Pdf URL: https://arxiv.org/pdf/2509.24510
Copy Paste: [[2509.24510]] Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models(https://arxiv.org/abs/2509.24510)
Keywords: foundation model
Abstract: Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization, focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

Title: CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Authors: Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24526
Pdf URL: https://arxiv.org/pdf/2509.24526
Copy Paste: [[2509.24526]] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models(https://arxiv.org/abs/2509.24526)
Keywords: diffusion
Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.

Title: Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

Authors: Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24531
Pdf URL: https://arxiv.org/pdf/2509.24531
Copy Paste: [[2509.24531]] Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis(https://arxiv.org/abs/2509.24531)
Keywords: diffusion
Abstract: Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at this https URL.

Title: Training-Free Multimodal Guidance for Video to Audio Generation

Authors: Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2509.24550
Pdf URL: https://arxiv.org/pdf/2509.24550
Copy Paste: [[2509.24550]] Training-Free Multimodal Guidance for Video to Audio Generation(https://arxiv.org/abs/2509.24550)
Keywords: diffusion
Abstract: Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

Title: SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics

Authors: Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.24572
Pdf URL: https://arxiv.org/pdf/2509.24572
Copy Paste: [[2509.24572]] SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics(https://arxiv.org/abs/2509.24572)
Keywords: diffusion
Abstract: Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: this https URL.

Title: SAIP: A Plug-and-Play Scale-adaptive Module in Diffusion-based Inverse Problems

Authors: Lingyu Wang, Xiangming Meng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.24580
Pdf URL: https://arxiv.org/pdf/2509.24580
Copy Paste: [[2509.24580]] SAIP: A Plug-and-Play Scale-adaptive Module in Diffusion-based Inverse Problems(https://arxiv.org/abs/2509.24580)
Keywords: diffusion
Abstract: Solving inverse problems with diffusion models has shown promise in tasks such as image restoration. A common approach is to formulate the problem in a Bayesian framework and sample from the posterior by combining the prior score with the likelihood score. Since the likelihood term is often intractable, estimators like DPS, DMPS, and $\pi$GDM are widely adopted. However, these methods rely on a fixed, manually tuned scale to balance prior and likelihood contributions. Such a static design is suboptimal, as the ideal balance varies across timesteps and tasks, limiting performance and generalization. To address this issue, we propose SAIP, a plug-and-play module that adaptively refines the scale at each timestep without retraining or altering the diffusion backbone. SAIP integrates seamlessly into existing samplers and consistently improves reconstruction quality across diverse image restoration tasks, including challenging scenarios.

Title: FreeRet: MLLMs as Training-Free Retrievers

Authors: Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24621
Pdf URL: https://arxiv.org/pdf/2509.24621
Copy Paste: [[2509.24621]] FreeRet: MLLMs as Training-Free Retrievers(https://arxiv.org/abs/2509.24621)
Keywords: generative
Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Title: RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Authors: Zhu, Libo, Zhou, Zihan, Liu, Xiaoyang, Zhang, Weihang, Shi, Keyu, Fu, Yifan, Zhang, Yulun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24644
Pdf URL: https://arxiv.org/pdf/2509.24644
Copy Paste: [[2509.24644]] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement(https://arxiv.org/abs/2509.24644)
Keywords: diffusion
Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera's rolling-shutter readout and the display's brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.

Title: Learning Object-Centric Representations Based on Slots in Real World Scenarios

Authors: Adil Kaan Akan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24652
Pdf URL: https://arxiv.org/pdf/2509.24652
Copy Paste: [[2509.24652]] Learning Object-Centric Representations Based on Slots in Real World Scenarios(https://arxiv.org/abs/2509.24652)
Keywords: diffusion, generative
Abstract: A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.

Title: SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Authors: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24695
Pdf URL: https://arxiv.org/pdf/2509.24695
Copy Paste: [[2509.24695]] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer(https://arxiv.org/abs/2509.24695)
Keywords: diffusion
Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

Title: Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Authors: Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24702
Pdf URL: https://arxiv.org/pdf/2509.24702
Copy Paste: [[2509.24702]] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility(https://arxiv.org/abs/2509.24702)
Keywords: diffusion
Abstract: Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.

Title: MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Authors: Guibin Zhang, Muxin Fu, Shuicheng Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24704
Pdf URL: https://arxiv.org/pdf/2509.24704
Copy Paste: [[2509.24704]] MemGen: Weaving Generative Latent Memory for Self-Evolving Agents(https://arxiv.org/abs/2509.24704)
Keywords: generative
Abstract: Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

Title: Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Authors: Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24739
Pdf URL: https://arxiv.org/pdf/2509.24739
Copy Paste: [[2509.24739]] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation(https://arxiv.org/abs/2509.24739)
Keywords: foundation model
Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.

Title: ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Authors: Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24758
Pdf URL: https://arxiv.org/pdf/2509.24758
Copy Paste: [[2509.24758]] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors(https://arxiv.org/abs/2509.24758)
Keywords: diffusion
Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce \textbf{ExGS}, a novel feed-forward framework that unifies \textbf{Universal Gaussian Compression} (UGC) with \textbf{GaussPainter} for \textbf{Ex}treme 3D\textbf{GS} compression. \textbf{UGC} performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas \textbf{GaussPainter} leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over $100\times$ compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at \href{this https URL}{here}.

Title: In-Context Learning of Temporal Point Processes with Foundation Inference Models

Authors: David Berghaus, Patrick Seifner, Kostadin Cvejoski, César Ojeda, Ramsés J. Sánchez
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24762
Pdf URL: https://arxiv.org/pdf/2509.24762
Copy Paste: [[2509.24762]] In-Context Learning of Temporal Point Processes with Foundation Inference Models(https://arxiv.org/abs/2509.24762)
Keywords: in-context
Abstract: Modeling event sequences of multiple event types with marked temporal point processes (MTPPs) provides a principled way to uncover governing dynamical rules and predict future events. Current neural network approaches to MTPP inference rely on training separate, specialized models for each target system. We pursue a radically different approach: drawing on amortized inference and in-context learning, we pretrain a deep neural network to infer, in-context, the conditional intensity functions of event histories from a context defined by sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution of Hawkes processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without any additional training, or be rapidly finetuned to target systems. Experiments show that this amortized approach matches the performance of specialized models on next-event prediction across common benchmark datasets. Our pretrained model, repository and tutorials will soon be available online

Title: MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Authors: Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2509.24779
Pdf URL: https://arxiv.org/pdf/2509.24779
Copy Paste: [[2509.24779]] MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models(https://arxiv.org/abs/2509.24779)
Keywords: generative
Abstract: Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.

Title: SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment

Authors: Hongyang Zhang, Yinhao Liu, Zhenyu Kuang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2509.24783
Pdf URL: https://arxiv.org/pdf/2509.24783
Copy Paste: [[2509.24783]] SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment(https://arxiv.org/abs/2509.24783)
Keywords: self-supervised
Abstract: Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$\%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at this https URL.

Title: Assessing the risk of future Dunkelflaute events for Germany using generative deep learning

Authors: Felix Strnad, Jonathan Schmidt, Fabian Mockert, Philipp Hennig, Nicole Ludwig
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2509.24788
Pdf URL: https://arxiv.org/pdf/2509.24788
Copy Paste: [[2509.24788]] Assessing the risk of future Dunkelflaute events for Germany using generative deep learning(https://arxiv.org/abs/2509.24788)
Keywords: generative
Abstract: The European electricity power grid is transitioning towards renewable energy sources, characterized by an increasing share of off- and onshore wind and solar power. However, the weather dependency of these energy sources poses a challenge to grid stability, with so-called Dunkelflaute events -- periods of low wind and solar power generation -- being of particular concern due to their potential to cause electricity supply shortages. In this study, we investigate the impact of these events on the German electricity production in the years and decades to come. For this purpose, we adapt a recently developed generative deep learning framework to downscale climate simulations from the CMIP6 ensemble. We first compare their statistics to the historical record taken from ERA5 data. Next, we use these downscaled simulations to assess plausible future occurrences of Dunkelflaute events in Germany under the optimistic low (SSP2-4.5) and high (SSP5-8.5) emission scenarios. Our analysis indicates that both the frequency and duration of Dunkelflaute events in Germany in the ensemble mean are projected to remain largely unchanged compared to the historical period. This suggests that, under the considered climate scenarios, the associated risk is expected to remain stable throughout the century.

Title: Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Authors: Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24798
Pdf URL: https://arxiv.org/pdf/2509.24798
Copy Paste: [[2509.24798]] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation(https://arxiv.org/abs/2509.24798)
Keywords: diffusion
Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91\% MAE reduction on Pendulum for accurate attribute control and 87\% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.

Title: Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

Authors: Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2509.24840
Pdf URL: https://arxiv.org/pdf/2509.24840
Copy Paste: [[2509.24840]] Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data(https://arxiv.org/abs/2509.24840)
Keywords: foundation model, generative
Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.

Title: DRIFT-Net: A Spectral--Coupled Neural Operator for PDEs Learning

Authors: Jiayi Li, Flora D. Salim
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2509.24868
Pdf URL: https://arxiv.org/pdf/2509.24868
Copy Paste: [[2509.24868]] DRIFT-Net: A Spectral--Coupled Neural Operator for PDEs Learning(https://arxiv.org/abs/2509.24868)
Keywords: foundation model
Abstract: Learning PDE dynamics with neural solvers can significantly improve wall-clock efficiency and accuracy compared with classical numerical solvers. In recent years, foundation models for PDEs have largely adopted multi-scale windowed self-attention, with the scOT backbone in \textsc{Poseidon} serving as a representative example. However, because of their locality, truly globally consistent spectral coupling can only be propagated gradually through deep stacking and window shifting. This weakens global coupling and leads to error accumulation and drift during closed-loop rollouts. To address this, we propose \textbf{DRIFT-Net}. It employs a dual-branch design comprising a spectral branch and an image branch. The spectral branch is responsible for capturing global, large-scale low-frequency information, whereas the image branch focuses on local details and nonstationary structures. Specifically, we first perform controlled, lightweight mixing within the low-frequency range. Then we fuse the spectral and image paths at each layer via bandwise weighting, which avoids the width inflation and training instability caused by naive concatenation. The fused result is transformed back into the spatial domain and added to the image branch, thereby preserving both global structure and high-frequency details across scales. Compared with strong attention-based baselines, DRIFT-Net achieves lower error and higher throughput with fewer parameters under identical training settings and budget. On Navier--Stokes benchmarks, the relative $L_{1}$ error is reduced by 7\%--54\%, the parameter count decreases by about 15\%, and the throughput remains higher than scOT. Ablation studies and theoretical analyses further demonstrate the stability and effectiveness of this design. The code is available at this https URL.

Title: Environment-Aware Satellite Image Generation with Diffusion Models

Authors: Nikos Kostagiolas, Pantelis Georgiades, Yannis Panagakis, Mihalis A. Nicolaou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24875
Pdf URL: https://arxiv.org/pdf/2509.24875
Copy Paste: [[2509.24875]] Environment-Aware Satellite Image Generation with Diffusion Models(https://arxiv.org/abs/2509.24875)
Keywords: diffusion, foundation model, generative
Abstract: Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.

Title: ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Authors: Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.24878
Pdf URL: https://arxiv.org/pdf/2509.24878
Copy Paste: [[2509.24878]] ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation(https://arxiv.org/abs/2509.24878)
Keywords: diffusion, generative
Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: this http URL

Title: VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines

Authors: Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24891
Pdf URL: https://arxiv.org/pdf/2509.24891
Copy Paste: [[2509.24891]] VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines(https://arxiv.org/abs/2509.24891)
Keywords: diffusion, generative
Abstract: Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.

Title: DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation

Authors: Xi Chen, Hongxun Yao, Zhaopan Xu, Kui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24896
Pdf URL: https://arxiv.org/pdf/2509.24896
Copy Paste: [[2509.24896]] DAM: Dual Active Learning with Multimodal Foundation Model for Source-Free Domain Adaptation(https://arxiv.org/abs/2509.24896)
Keywords: foundation model
Abstract: Source-free active domain adaptation (SFADA) enhances knowledge transfer from a source model to an unlabeled target domain using limited manual labels selected via active learning. While recent domain adaptation studies have introduced Vision-and-Language (ViL) models to improve pseudo-label quality or feature alignment, they often treat ViL-based and data supervision as separate sources, lacking effective fusion. To overcome this limitation, we propose Dual Active learning with Multimodal (DAM) foundation model, a novel framework that integrates multimodal supervision from a ViL model to complement sparse human annotations, thereby forming a dual supervisory signal. DAM initializes stable ViL-guided targets and employs a bidirectional distillation mechanism to foster mutual knowledge exchange between the target model and the dual supervisions during iterative adaptation. Extensive experiments demonstrate that DAM consistently outperforms existing methods and sets a new state-of-the-art across multiple SFADA benchmarks and active learning strategies.

Title: Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Authors: Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24899
Pdf URL: https://arxiv.org/pdf/2509.24899
Copy Paste: [[2509.24899]] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer(https://arxiv.org/abs/2509.24899)
Keywords: diffusion
Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

Title: BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Authors: Andrés Fernández García, Javier de la Rosa, Julio Gonzalo, Roser Morante, Enrique Amigó, Alejandro Benito-Santos, Jorge Carrillo-de-Albornoz, Víctor Fresno, Adrian Ghajari, Guillermo Marco, Laura Plaza, Eva Sánchez Salido
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.24908
Pdf URL: https://arxiv.org/pdf/2509.24908
Copy Paste: [[2509.24908]] BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications(https://arxiv.org/abs/2509.24908)
Keywords: generative
Abstract: The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain's ``Bolet\'ın Oficial del Estado'' (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model -- BERTIN GPT-J 6B (32-bit precision) -- achieves a 24\% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6\% vs.\ 33.5\%).

Title: Scalable GANs with Transformers

Authors: Sangeek Hyun, MinKyu Lee, Jae-Pil Heo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24935
Pdf URL: https://arxiv.org/pdf/2509.24935
Copy Paste: [[2509.24935]] Scalable GANs with Transformers(https://arxiv.org/abs/2509.24935)
Keywords: generative
Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.

Title: OAT-FM: Optimal Acceleration Transport for Improved Flow Matching

Authors: Angxiao Yue, Anqi Dong, Hongteng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24936
Pdf URL: https://arxiv.org/pdf/2509.24936
Copy Paste: [[2509.24936]] OAT-FM: Optimal Acceleration Transport for Improved Flow Matching(https://arxiv.org/abs/2509.24936)
Keywords: generative
Abstract: As a powerful technique in generative modeling, Flow Matching (FM) aims to learn velocity fields from noise to data, which is often explained and implemented as solving Optimal Transport (OT) problems. In this study, we bridge FM and the recent theory of Optimal Acceleration Transport (OAT), developing an improved FM method called OAT-FM and exploring its benefits in both theory and practice. In particular, we demonstrate that the straightening objective hidden in existing OT-based FM methods is mathematically equivalent to minimizing the physical action associated with acceleration defined by OAT. Accordingly, instead of enforcing constant velocity, OAT-FM optimizes the acceleration transport in the product space of sample and velocity, whose objective corresponds to a necessary and sufficient condition of flow straightness. An efficient algorithm is designed to achieve OAT-FM with low complexity. OAT-FM motivates a new two-phase FM paradigm: Given a generative model trained by an arbitrary FM method, whose velocity information has been relatively reliable, we can fine-tune and improve it via OAT-FM. This paradigm eliminates the risk of data distribution drift and the need to generate a large number of noise data pairs, which consistently improves model performance in various generative tasks. Code is available at: this https URL

Title: Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning

Authors: Donghwa Kang, Junho Kim, Dongwoo Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24968
Pdf URL: https://arxiv.org/pdf/2509.24968
Copy Paste: [[2509.24968]] Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning(https://arxiv.org/abs/2509.24968)
Keywords: self-supervised
Abstract: Event cameras offer unique advantages for facial keypoint alignment under challenging conditions, such as low light and rapid motion, due to their high temporal resolution and robustness to varying illumination. However, existing RGB facial keypoint alignment methods do not perform well on event data, and training solely on event data often leads to suboptimal performance because of its limited spatial information. Moreover, the lack of comprehensive labeled event datasets further hinders progress in this area. To address these issues, we propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment. Our framework employs CMFA to integrate corresponding RGB data, guiding the model to extract robust facial features from event input images. In parallel, SSMER enables effective feature learning from unlabeled event data, overcoming spatial limitations. Extensive experiments on our real-event E-SIE dataset and a synthetic-event version of the public WFLW-V benchmark show that our approach consistently surpasses state-of-the-art methods across multiple evaluation metrics.

Title: On-the-Fly Data Augmentation for Brain Tumor Segmentation

Authors: Ishika Jain, Siri Willems, Steven Latre, Tom De Schepper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24973
Pdf URL: https://arxiv.org/pdf/2509.24973
Copy Paste: [[2509.24973]] On-the-Fly Data Augmentation for Brain Tumor Segmentation(https://arxiv.org/abs/2509.24973)
Keywords: generative
Abstract: Robust segmentation across both pre-treatment and post-treatment glioma scans can be helpful for consistent tumor monitoring and treatment planning. BraTS 2025 Task 1 addresses this by challenging models to generalize across varying tumor appearances throughout the treatment timeline. However, training such generalized models requires access to diverse, high-quality annotated data, which is often limited. While data augmentation can alleviate this, storing large volumes of augmented 3D data is computationally expensive. To address these challenges, we propose an on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training. We evaluate three nnU-Net-based models and their ensembles: (1) a baseline without external augmentation, (2) a regular on-the-fly augmented model, and (3) a model with customized on-the-fly augmentation. Built upon the nnU-Net framework, our pipeline leverages pretrained GliGAN weights and tumor insertion methods from prior challenge-winning solutions. An ensemble of the three models achieves lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on the online BraTS 2025 validation platform. This work ranked first in the BraTS Lighthouse Challenge 2025 Task 1- Adult Glioma Segmentation.

Title: Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models

Authors: Ahmad Fraij, Sam Dauncey
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24974
Pdf URL: https://arxiv.org/pdf/2509.24974
Copy Paste: [[2509.24974]] Double Descent as a Lens for Sample Efficiency in Autoregressive vs. Discrete Diffusion Models(https://arxiv.org/abs/2509.24974)
Keywords: diffusion
Abstract: Data scarcity drives the need for more sample-efficient large language models. In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models. We show that discrete diffusion models require larger capacity and more training epochs to escape their underparameterized regime and reach the interpolation threshold. In the strongly overparameterized regime, both models exhibit similar behavior, with neither exhibiting a pronounced second descent in test loss across a large range of model sizes. Overall, our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.

Title: Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Authors: Haotian Dong, Wenjing Wang, Chen Li, Di Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24979
Pdf URL: https://arxiv.org/pdf/2509.24979
Copy Paste: [[2509.24979]] Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel(https://arxiv.org/abs/2509.24979)
Keywords: diffusion
Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{this https URL}{this https URL}.

Title: SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Authors: Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24980
Pdf URL: https://arxiv.org/pdf/2509.24980
Copy Paste: [[2509.24980]] SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation(https://arxiv.org/abs/2509.24980)
Keywords: diffusion, generative
Abstract: Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold~\citep{ke2024repurposing} and Lotus~\citep{he2024lotus} adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose \textbf{SDPose}, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct \textbf{COCO-OOD}, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.

Title: Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

Authors: Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24988
Pdf URL: https://arxiv.org/pdf/2509.24988
Copy Paste: [[2509.24988]] Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns(https://arxiv.org/abs/2509.24988)
Keywords: in-context
Abstract: Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.

Title: PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Authors: Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24997
Pdf URL: https://arxiv.org/pdf/2509.24997
Copy Paste: [[2509.24997]] PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion(https://arxiv.org/abs/2509.24997)
Keywords: diffusion
Abstract: Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.

Title: Score-based Membership Inference on Diffusion Models

Authors: Mingxing Rao, Bowen Qu, Daniel Moyer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.25003
Pdf URL: https://arxiv.org/pdf/2509.25003
Copy Paste: [[2509.25003]] Score-based Membership Inference on Diffusion Models(https://arxiv.org/abs/2509.25003)
Keywords: diffusion
Abstract: Membership inference attacks (MIAs) against diffusion models have emerged as a pressing privacy concern, as these models may inadvertently reveal whether a given sample was part of their training set. We present a theoretical and empirical study of score-based MIAs, focusing on the predicted noise vectors that diffusion models learn to approximate. We show that the expected denoiser output points toward a kernel-weighted local mean of nearby training samples, such that its norm encodes proximity to the training set and thereby reveals membership. Building on this observation, we propose SimA, a single-query attack that provides a principled, efficient alternative to existing multi-query methods. SimA achieves consistently strong performance across variants of DDPM, Latent Diffusion Model (LDM). Notably, we find that Latent Diffusion Models are surprisingly less vulnerable than pixel-space models, due to the strong information bottleneck imposed by their latent auto-encoder. We further investigate this by differing the regularization hyperparameters ($\beta$ in $\beta$-VAE) in latent channel and suggest a strategy to make LDM training more robust to MIA. Our results solidify the theory of score-based MIAs, while highlighting that Latent Diffusion class of methods requires better understanding of inversion for VAE, and not simply inversion of the Diffusion process

Title: Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Authors: Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25035
Pdf URL: https://arxiv.org/pdf/2509.25035
Copy Paste: [[2509.25035]] Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct(https://arxiv.org/abs/2509.25035)
Keywords: diffusion
Abstract: Fast generation of language texts is the holy grail that people pursue in the AI era. In this work, we introduced Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that leads to fast language generation models by initializing from a pre-trained (masked) discrete diffusion language model (dLLM). The resulting DiDi-Instruct model outperforms the dLLM counterparts and the GPT-2 baseline with 64x acceleration. In the theoretical part of the paper, we build the foundation of DiDi-Instruct in a framework of integral KL-divergence minimization, with practical training algorithms. We also introduce techniques like grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler (RGAS) that significantly improve the training stability, the model coverage, and the inference performances. On OpenWebText, DiDi-Instruct outperforms all accelerated language generation models as well as the GPT-2 baseline and the standard dLLMs, achieving sample perplexities ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs). These performance gains are accomplished with a negligible entropy loss of about 1% and 20x less additional training wall-clock time. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, and the generation of discrete protein sequences. In conclusion, DiDi-Instruct is an efficient yet effective distillation method, enabling language generation in the blink of an eye. We will release both code and models at this http URL.

Title: Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Authors: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.25050
Pdf URL: https://arxiv.org/pdf/2509.25050
Copy Paste: [[2509.25050]] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models(https://arxiv.org/abs/2509.25050)
Keywords: diffusion
Abstract: Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at this https URL.

Title: UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

Authors: Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, Qi Tian
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2509.25079
Pdf URL: https://arxiv.org/pdf/2509.25079
Copy Paste: [[2509.25079]] UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation(https://arxiv.org/abs/2509.25079)
Keywords: diffusion
Abstract: High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos \& code are available at this https URL

Title: Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Authors: Bogdan Raonić, Siddhartha Mishra, Samuel Lanthaler
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.25080
Pdf URL: https://arxiv.org/pdf/2509.25080
Copy Paste: [[2509.25080]] Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI(https://arxiv.org/abs/2509.25080)
Keywords: diffusion
Abstract: Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at this https URL

Title: MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification

Authors: Xiaoyi Huang, Junwei Wu, Kejia Zhang, Carl Yang, Zhiming Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25082
Pdf URL: https://arxiv.org/pdf/2509.25082
Copy Paste: [[2509.25082]] MANI-Pure: Magnitude-Adaptive Noise Injection for Adversarial Purification(https://arxiv.org/abs/2509.25082)
Keywords: diffusion
Abstract: Adversarial purification with diffusion models has emerged as a promising defense strategy, but existing methods typically rely on uniform noise injection, which indiscriminately perturbs all frequencies, corrupting semantic structures and undermining robustness. Our empirical study reveals that adversarial perturbations are not uniformly distributed: they are predominantly concentrated in high-frequency regions, with heterogeneous magnitude intensity patterns that vary across frequencies and attack types. Motivated by this observation, we introduce MANI-Pure, a magnitude-adaptive purification framework that leverages the magnitude spectrum of inputs to guide the purification process. Instead of injecting homogeneous noise, MANI-Pure adaptively applies heterogeneous, frequency-targeted noise, effectively suppressing adversarial perturbations in fragile high-frequency, low-magnitude bands while preserving semantically critical low-frequency content. Extensive experiments on CIFAR-10 and ImageNet-1K validate the effectiveness of MANI-Pure. It narrows the clean accuracy gap to within 0.59 of the original classifier, while boosting robust accuracy by 2.15, and achieves the top-1 robust accuracy on the RobustBench leaderboard, surpassing the previous state-of-the-art method.

Title: jina-reranker-v3: Last but Not Late Interaction for Document Reranking

Authors: Feng Wang, Yuqing Li, Han Xiao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.25085
Pdf URL: https://arxiv.org/pdf/2509.25085
Copy Paste: [[2509.25085]] jina-reranker-v3: Last but Not Late Interaction for Document Reranking(https://arxiv.org/abs/2509.25085)
Keywords: generative
Abstract: jina-reranker-v3 is a 0.6B parameter multilingual document reranker that introduces a novel last but not late interaction. Unlike late interaction models such as ColBERT that perform separate encoding followed by multi-vector matching, our approach conducts causal self-attention between query and documents within the same context window, enabling rich cross-document interactions before extracting contextual embeddings from the last token of each document. This compact architecture achieves state-of-the-art BEIR performance with 61.94 nDCG@10 while being ten times smaller than generative listwise rerankers.

Title: Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs

Authors: Akio Hayakawa, Stefan Bott, Horacio Saggion
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25086
Pdf URL: https://arxiv.org/pdf/2509.25086
Copy Paste: [[2509.25086]] Towards Trustworthy Lexical Simplification: Exploring Safety and Efficiency with Small LLMs(https://arxiv.org/abs/2509.25086)
Keywords: in-context
Abstract: Despite their strong performance, large language models (LLMs) face challenges in real-world application of lexical simplification (LS), particularly in privacy-sensitive and resource-constrained environments. Moreover, since vulnerable user groups (e.g., people with disabilities) are one of the key target groups of this technology, it is crucial to ensure the safety and correctness of the output of LS systems. To address these issues, we propose an efficient framework for LS systems that utilizes small LLMs deployable in local environments. Within this framework, we explore knowledge distillation with synthesized data and in-context learning as baselines. Our experiments in five languages evaluate model outputs both automatically and manually. Our manual analysis reveals that while knowledge distillation boosts automatic metric scores, it also introduces a safety trade-off by increasing harmful simplifications. Importantly, we find that the model's output probability is a useful signal for detecting harmful simplifications. Leveraging this, we propose a filtering strategy that suppresses harmful simplifications while largely preserving beneficial ones. This work establishes a benchmark for efficient and safe LS with small LLMs. It highlights the key trade-offs between performance, efficiency, and safety, and demonstrates a promising approach for safe real-world deployment.

Title: Score Distillation of Flow Matching Models

Authors: Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25127
Pdf URL: https://arxiv.org/pdf/2509.25127
Copy Paste: [[2509.25127]] Score Distillation of Flow Matching Models(https://arxiv.org/abs/2509.25127)
Keywords: diffusion
Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.

Title: Chance-constrained Flow Matching for High-Fidelity Constraint-aware Generation

Authors: Jinhao Liang, Yixuan Sun, Anirban Samaddar, Sandeep Madireddy, Ferdinando Fioretto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25157
Pdf URL: https://arxiv.org/pdf/2509.25157
Copy Paste: [[2509.25157]] Chance-constrained Flow Matching for High-Fidelity Constraint-aware Generation(https://arxiv.org/abs/2509.25157)
Keywords: generative
Abstract: Generative models excel at synthesizing high-fidelity samples from complex data distributions, but they often violate hard constraints arising from physical laws or task specifications. A common remedy is to project intermediate samples onto the feasible set; however, repeated projection can distort the learned distribution and induce a mismatch with the data manifold. Thus, recent multi-stage procedures attempt to defer projection to clean samples during sampling, but they increase algorithmic complexity and accumulate errors across steps. This paper addresses these challenges by proposing a novel training-free method, Chance-constrained Flow Matching (CCFM), that integrates stochastic optimization into the sampling process, enabling effective enforcement of hard constraints while maintaining high-fidelity sample generation. Importantly, CCFM guarantees feasibility in the same manner as conventional repeated projection, yet, despite operating directly on noisy intermediate samples, it is theoretically equivalent to projecting onto the feasible set defined by clean samples. This yields a sampler that mitigates distributional distortion. Empirical experiments show that CCFM outperforms current state-of-the-art constrained generative models in modeling complex physical systems governed by partial differential equations and molecular docking problems, delivering higher feasibility and fidelity.

Title: Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Authors: Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25161
Pdf URL: https://arxiv.org/pdf/2509.25161
Copy Paste: [[2509.25161]] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time(https://arxiv.org/abs/2509.25161)
Keywords: diffusion
Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

Title: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Authors: Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25162
Pdf URL: https://arxiv.org/pdf/2509.25162
Copy Paste: [[2509.25162]] Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models(https://arxiv.org/abs/2509.25162)
Keywords: diffusion
Abstract: In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

Title: GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Authors: Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.25170
Pdf URL: https://arxiv.org/pdf/2509.25170
Copy Paste: [[2509.25170]] GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models(https://arxiv.org/abs/2509.25170)
Keywords: diffusion
Abstract: The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a "flow matching model within a flow matching model" to sample Markov transitions. As we show in this work, this "inner" flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

Title: TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion

Authors: Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2509.25171
Pdf URL: https://arxiv.org/pdf/2509.25171
Copy Paste: [[2509.25171]] TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion(https://arxiv.org/abs/2509.25171)
Keywords: diffusion
Abstract: Reinforcement learning with stochastic optimal control offers a promising framework for diffusion fine-tuning, where a pre-trained diffusion model is optimized to generate paths that lead to a reward-tilted distribution. While these approaches enable optimization without access to explicit samples from the optimal distribution, they require training on rollouts under the current fine-tuned model, making them susceptible to reinforcing sub-optimal trajectories that yield poor rewards. To overcome this challenge, we introduce TRee Search Guided TRajectory-Aware Fine-Tuning for Discrete Diffusion (TR2-D2), a novel framework that optimizes reward-guided discrete diffusion trajectories with tree search to construct replay buffers for trajectory-aware fine-tuning. These buffers are generated using Monte Carlo Tree Search (MCTS) and subsequently used to fine-tune a pre-trained discrete diffusion model under a stochastic optimal control objective. We validate our framework on single- and multi-objective fine-tuning of biological sequence diffusion models, highlighting the overall effectiveness of TR2-D2 for reliable reward-guided fine-tuning in discrete sequence generation.

Title: Personalized Vision via Visual In-Context Learning

Authors: Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25172
Pdf URL: https://arxiv.org/pdf/2509.25172
Copy Paste: [[2509.25172]] Personalized Vision via Visual In-Context Learning(https://arxiv.org/abs/2509.25172)
Keywords: diffusion, in-context
Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

Title: GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Authors: Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25178
Pdf URL: https://arxiv.org/pdf/2509.25178
Copy Paste: [[2509.25178]] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs(https://arxiv.org/abs/2509.25178)
Keywords: diffusion
Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

Title: DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Authors: Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25180
Pdf URL: https://arxiv.org/pdf/2509.25180
Copy Paste: [[2509.25180]] DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space(https://arxiv.org/abs/2509.25180)
Keywords: diffusion
Abstract: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: this https URL.

Title: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Authors: Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25182
Pdf URL: https://arxiv.org/pdf/2509.25182
Copy Paste: [[2509.25182]] DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder(https://arxiv.org/abs/2509.25182)
Keywords: diffusion
Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: this https URL.

Title: PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos

Authors: Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25183
Pdf URL: https://arxiv.org/pdf/2509.25183
Copy Paste: [[2509.25183]] PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos(https://arxiv.org/abs/2509.25183)
Keywords: generative
Abstract: We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.

Title: Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding

Authors: Wenrui Bao, Zhiben Chen, Dan Xu, Yuzhang Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.25188
Pdf URL: https://arxiv.org/pdf/2509.25188
Copy Paste: [[2509.25188]] Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding(https://arxiv.org/abs/2509.25188)
Keywords: diffusion
Abstract: Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through iterative denoising. However, current parallel decoding strategies rely on fixed, input-agnostic heuristics (e.g., confidence thresholds), which fail to adapt to input-specific characteristics, resulting in suboptimal speed-quality trade-offs across diverse NLP tasks. In this work, we explore a more flexible and dynamic approach to parallel decoding. We propose Learning to Parallel Decode (Learn2PD), a framework that trains a lightweight and adaptive filter model to predict, for each token position, whether the current prediction matches the final output. This learned filter approximates an oracle parallel decoding strategy that unmasks tokens only when correctly predicted. Importantly, the filter model is learned in a post-training manner, requiring only a small amount of computation to optimize it (minute-level GPU time). Additionally, we introduce End-of-Text Prediction (EoTP) to detect decoding completion at the end of sequence, avoiding redundant decoding of padding tokens. Experiments on the LLaDA benchmark demonstrate that our method achieves up to 22.58$\times$ speedup without any performance drop, and up to 57.51$\times$ when combined with KV-Cache.

Title: Visual Jigsaw Post-Training Improves MLLMs

Authors: Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25190
Pdf URL: https://arxiv.org/pdf/2509.25190
Copy Paste: [[2509.25190]] Visual Jigsaw Post-Training Improves MLLMs(https://arxiv.org/abs/2509.25190)
Keywords: self-supervised, generative
Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: this https URL

Title: VGGT-X: When VGGT Meets Dense Novel View Synthesis

Authors: Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25191
Pdf URL: https://arxiv.org/pdf/2509.25191
Copy Paste: [[2509.25191]] VGGT-X: When VGGT Meets Dense Novel View Synthesis(https://arxiv.org/abs/2509.25191)
Keywords: foundation model
Abstract: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at this https URL