2025-09-30

Title: Pathological Truth Bias in Vision-Language Models

Authors: Yash Thube
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22674
Pdf URL: https://arxiv.org/pdf/2509.22674
Copy Paste: [[2509.22674]] Pathological Truth Bias in Vision-Language Models(https://arxiv.org/abs/2509.22674)
Keywords: generative
Abstract: Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.
摘要：视觉语言模型（VLM）正在迅速改善，但是标准的基准测试可以隐藏减少现实世界信任的系统失败。我们介绍了MATS（真实空间化的多模式审核），这是一种紧凑的行为审核，衡量模型是否拒绝视觉上矛盾的陈述以及两个度量标准的空间一致性评分（SCS）和不正确的协议率（IAR）。指令调整的生成VLM（llava 1.5，Qwenvlchat）表现出非常低的SC和高IAR，而对比度编码器（夹子，siglip）的强劲性要强大得多。激活修补因果关系将失败基因座定位（生成模型的中期至晚期注意力，对比模型的汇总投影组件），并建议使用混凝土修复路径。

Title: Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects

Authors: Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22692
Pdf URL: https://arxiv.org/pdf/2509.22692
Copy Paste: [[2509.22692]] Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects(https://arxiv.org/abs/2509.22692)
Keywords: super-resolution
Abstract: Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at this https URL.
摘要：在深度学习（DL）技术的进步以及对高质量视觉应用的需求增长的驱动下，超级分辨率（SR）在计算机视觉社区中引起了极大的关注。随着该领域的扩展，已经出现了许多调查。大多数现有的调查都集中在特定领域，缺乏对该领域的全面概述。在这里，我们对各种SR方法进行了深入评论，其中包括单个图像超分辨率（SISR），视频超分辨率（VSR），立体声超级分辨率（SSR）和光场超分辨率（LFSR）。我们广泛涵盖了150多种SISR方法，将近70个VSR方法以及SSR和LFSR的大约30种技术。我们分析方法，数据集，评估方案，经验结果和复杂性。此外，我们根据各种目的根据每个骨干结构进行了分类学。我们还探索了该领域有价值但未经研究的公开问题。我们认为，这项工作将是一种宝贵的资源，并向该领域的研究人员提供指导。为了促进相关工作的访问，我们创建了一个在此HTTPS URL上可用的专用存储库。

Title: GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d'Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d'Environnement de Collaboration Humain-Robot

Authors: Ahed Alboody (LINEACT)
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22708
Pdf URL: https://arxiv.org/pdf/2509.22708
Copy Paste: [[2509.22708]] GZSL-MoE: Apprentissage G{é}n{é}ralis{é} Z{é}ro-Shot bas{é} sur le M{é}lange d'Experts pour la Segmentation S{é}mantique de Nuages de Points 3DAppliqu{é} {à} un Jeu de Donn{é}es d'Environnement de Collaboration Humain-Robot(https://arxiv.org/abs/2509.22708)
Keywords: generative
Abstract: Generative Zero-Shot Learning approach (GZSL) has demonstrated significant potential in 3D point cloud semantic segmentation tasks. GZSL leverages generative models like GANs or VAEs to synthesize realistic features (real features) of unseen classes. This allows the model to label unseen classes during testing, despite being trained only on seen classes. In this context, we introduce the Generalized Zero-Shot Learning based-upon Mixture-of-Experts (GZSL-MoE) model. This model incorporates Mixture-of-Experts layers (MoE) to generate fake features that closely resemble real features extracted using a pre-trained KPConv (Kernel Point Convolution) model on seen classes. The main contribution of this paper is the integration of Mixture-of-Experts into the Generator and Discriminator components of the Generative Zero-Shot Learning model for 3D point cloud semantic segmentation, applied to the COVERED dataset (CollabOratiVE Robot Environment Dataset) for Human-Robot Collaboration (HRC) environments. By combining the Generative Zero-Shot Learning model with Mixture-of- Experts, GZSL-MoE for 3D point cloud semantic segmentation provides a promising solution for understanding complex 3D environments, especially when comprehensive training data for all object classes is unavailable. The performance evaluation of the GZSL-MoE model highlights its ability to enhance performance on both seen and unseen classes. Keywords Generalized Zero-Shot Learning (GZSL), 3D Point Cloud, 3D Semantic Segmentation, Human-Robot Collaboration, COVERED (CollabOratiVE Robot Environment Dataset), KPConv, Mixture-of Experts
摘要：生成的零击学习方法（GZSL）在3D点云语义分割任务中表现出了巨大的潜力。 GZSL利用诸如gan或vaes之类的生成模型来综合看不见类的现实特征（实际功能）。这使该模型在测试过程中可以标记看不见的类，尽管仅接受了可见课程的培训。在这种情况下，我们介绍了基于Experts（GZSL-MOE）模型的广义零拍学习。该模型结合了Experts层（MOE）的混合物，以生成相似类似于使用预训练的KPCONV（内核点卷积）模型在可见类中提取的真实功能的假特征。本文的主要贡献是，将专家的混合物集成到3D点云语义分段的生成零摄像机学习模型的发电机和鉴别器组件中，该模型应用于人类机器人协作（HRC）环境中的覆盖数据集（协作机器人环境数据集）。通过将生成性零击学习模型与专家的混合结合在一起，用于3D点云语义细分的GZSL-MOE为理解复杂的3D环境提供了有希望的解决方案，尤其是当所有对象类的全面培训数据不可用时。 GZSL-MOE模型的性能评估突出了其在可见和看不见的类别上增强性能的能力。关键字广义零射击学习（GZSL），3D点云，3D语义细分，人机协作，涵盖（协作机器人环境数据集），KPCONV，专家混合物

Title: LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Authors: Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng, Topojoy Biswas, Evren Korpeoglu, Kaushiki Nag, Kannan Achan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22720
Pdf URL: https://arxiv.org/pdf/2509.22720
Copy Paste: [[2509.22720]] LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning(https://arxiv.org/abs/2509.22720)
Keywords: generation
Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.
摘要：设计现实的多对象场景不仅需要生成图像，还需要计划尊重语义关系和物理合理性的空间布局。一方面，尽管扩散模型的最新进展已实现了高质量的图像生成，但它们缺乏明确的空间推理，从而导致了不切实际的对象布局。另一方面，机器人技术中的传统空间规划方法强调了几何和关系一致性，但它们很难在视觉场景中捕捉语义丰富。为了弥合这一差距，在本文中，我们提出了布局，这是一个代理框架，将视觉性推理统一使用，并以布局生成的组成扩散。给定多个带有目标对象的输入图像，我们的方法首先采用视觉语言模型来通过细分，对象大小估计，场景图构造和及时重写来预处理输入。然后，我们利用构图扩散 - 传统上用于机器人技术中的方法，以合成边界框，尊重在场景图中编码的空间布局中编码的对象关系。最后，通过将对象渲染到由设计的提示指导的计划布局中，将对象渲染到计划的布局中，从而构成了完整的场景。实验表明，布局在布局相干，空间现实主义和审美对齐中的表现优于其他最先进的布局生成模型。

Title: MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Authors: Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, Qing Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22761
Pdf URL: https://arxiv.org/pdf/2509.22761
Copy Paste: [[2509.22761]] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning(https://arxiv.org/abs/2509.22761)
Keywords: generation
Abstract: Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.
摘要：推理提升的机器学习系统已显示出包括图像生成在内的各个领域的性能提高了。但是，现有的基于推理的方法用于图像生成将推理限制为单一模态（图像或文本），或依靠高质量的推理数据进行微调。为了应对这些限制，我们提出了MILR，这是一种测试时间方法，该方法共同考虑了统一潜在矢量空间中的图像和文本。 MILR中的推理是通过通过离散图像和文本令牌的向量表示来执行的。实际上，这是通过图像质量评论家指导的政策梯度方法实现的。我们将MILR实例化在统一的多模式理解和生成（MUG）框架中，该框架本地支持图像合成之前的语言推理，从而促进了跨模式推理。将要优化的中间模型输出用作统一的潜在空间，使MILR能够在测试时完全运行。我们在Geneval，T2i-Compbench和Wise上评估MILR，并在所有基准上取得了最新的结果。值得注意的是，在知识密集的方面，MILR的总体得分为0.63，在基线上提高了80％。我们的进一步分析表明，统一潜在空间中的联合推理是其强劲绩效的关键。此外，我们的定性研究揭示了MILR在时间和文化推理中的非平凡能力，强调了我们推理方法的功效。

Title: DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

Authors: Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22793
Pdf URL: https://arxiv.org/pdf/2509.22793
Copy Paste: [[2509.22793]] DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models(https://arxiv.org/abs/2509.22793)
Keywords: generation
Abstract: Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{this https URL}{DEFTBase}.
摘要：预先训练的文本图像（T2I）模型的有效微调涉及调整模型以适合特定任务或数据集，同时最小化计算资源并限制可训练参数的数量。但是，在与目标分布保持一致之间的权衡方面，它通常面临挑战：从有限的图像中学习新颖的概念，以进行个性化，并保留统一多个任务所需的指导能力，同时保持编辑性（与各种提示或范围内的生成一致）。在这项工作中，我们介绍了一个有效的微调框架，通过将其更新分解为两个具有两个可训练的矩阵的组件，从而适应了预先培训的重量矩阵：（1）投影对低率子空间的互补，可以通过低级别的子空间跨越低级矩阵，通过低级别的矩阵和低位（2）低 - （2）一个低 - （2）一个低rank-rank-rank-rank-rank。单个可训练的低级矩阵定义了子空间，而其他可训练的低率矩阵可以在该子空间内适应灵活的参数。我们在Dreambooth和Dreambench以及用于个性化的数据集上进行了广泛的实验，用于对象和场景适应的INSDET数据集以及通过稳定的扩散和统一模型通过视觉上的内在学习进行通用的图像生成框架，用于通用图像生成框架，用于通用图像生成框架。我们的结果表明了最先进的表现，强调了有效微调的新兴特性。我们的代码可在\ href {此https url} {deftbase}上获得。

Title: VideoScore2: Think before You Score in Generative Video Evaluation

Authors: Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.22799
Pdf URL: https://arxiv.org/pdf/2509.22799
Copy Paste: [[2509.22799]] VideoScore2: Think before You Score in Generative Video Evaluation(https://arxiv.org/abs/2509.22799)
Keywords: generation, generative, quality assessment
Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: this https URL
摘要：文本到视频发电的最新进展产生了越来越现实和多样化的内容，但是评估此类视频仍然是一个根本的挑战，因为它们的多方面性质涵盖了视觉质量，语义一致性和身体一致性。现有的评估者和奖励模型仅限于单个不透明的分数，缺乏可解释性或仅提供粗略的分析，因此不足以捕获视频质量评估的全面性质。我们介绍了Videoscore2，这是一个多维，可解释和人类一致的框架，可显式评估视觉质量，文本对视频对齐以及物理/常识的一致性，同时产生详细的思想链理性的理性链。我们的模型在一个大规模数据集VideofeedBack2上进行了培训，其中包含27,168个人宣传的视频，并使用三个维度的分数和推理轨迹，使用两阶段的监督微调管道，然后对小组相对策略优化（GRPO）进行加强学习，以增强分析性鲁棒性。广泛的实验表明，VideSoscore2在我们的内域基准测试中的精度为44.35（+5.94）的精度可实现卓越的性能，而Moscore-bench-v2和50.37（+4.32）在四个台外基准的平均表现（+4.32）的平均性能（视频基础台上，视频，视频等），同时可促进该启发的评估，并在该范围内进行了启发。项目页面：此HTTPS URL

Title: Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN

Authors: Roie Kazoom, Alon Goldberg, Hodaya Cohen, Ofer Hadar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.22836
Pdf URL: https://arxiv.org/pdf/2509.22836
Copy Paste: [[2509.22836]] Seeing Isn't Believing: Context-Aware Adversarial Patch Synthesis via Conditional GAN(https://arxiv.org/abs/2509.22836)
Keywords: generation, generative
Abstract: Adversarial patch attacks pose a severe threat to deep neural networks, yet most existing approaches rely on unrealistic white-box assumptions, untargeted objectives, or produce visually conspicuous patches that limit real-world applicability. In this work, we introduce a novel framework for fully controllable adversarial patch generation, where the attacker can freely choose both the input image x and the target class y target, thereby dictating the exact misclassification outcome. Our method combines a generative U-Net design with Grad-CAM-guided patch placement, enabling semantic-aware localization that maximizes attack effectiveness while preserving visual realism. Extensive experiments across convolutional networks (DenseNet-121, ResNet-50) and vision transformers (ViT-B/16, Swin-B/16, among others) demonstrate that our approach achieves state-of-the-art performance across all settings, with attack success rates (ASR) and target-class success (TCS) consistently exceeding 99%. Importantly, we show that our method not only outperforms prior white-box attacks and untargeted baselines, but also surpasses existing non-realistic approaches that produce detectable artifacts. By simultaneously ensuring realism, targeted control, and black-box applicability-the three most challenging dimensions of patch-based attacks-our framework establishes a new benchmark for adversarial robustness research, bridging the gap between theoretical attack strength and practical stealthiness.
摘要：对抗贴片攻击对深层神经网络构成了严重威胁，但是大多数现有的方法都依赖于不现实的白框假设，不定范围的目标或产生限制现实世界中适用性的视觉上明显的补丁。在这项工作中，我们介绍了一个新颖的框架，以实现完全可控制的对抗斑块生成，攻击者可以在其中自由选择输入图像X和目标类Y目标，从而决定了确切的错误分类结果。我们的方法结合了生成的U-NET设计与Grad-CAM指导的补丁位置，从而实现了语义感知的本地化，从而最大程度地发挥了攻击效果，同时保留了视觉现实主义。跨卷积网络（Densenet-121，Resnet-50）和视觉变压器（VIT-B/16，SWIN-B/16等）之间进行的广泛实验表明，我们的方法在所有设置中都具有最先进的性能，并且具有攻击成功率（ASR）和目标级成功（TCS），并以超过99％的速度。重要的是，我们表明我们的方法不仅要优于先前的白盒攻击和未靶向的基准，而且还超过了产生可检测到伪影的现有非现实方法。通过同时确保现实主义，有针对性的控制和黑盒适用性 - 基于补丁的攻击的三个最具挑战性的维度 - 我们的框架为对抗性鲁棒性研究建立了新的基准，从而弥合了理论攻击强度和实际隐身性之间的差距。

Title: Adaptive Margin RLHF via Preference over Preferences

Authors: Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.22851
Pdf URL: https://arxiv.org/pdf/2509.22851
Copy Paste: [[2509.22851]] Adaptive Margin RLHF via Preference over Preferences(https://arxiv.org/abs/2509.22851)
Keywords: generative
Abstract: Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.
摘要：基于保证金的优化对于改善分类任务的概括和鲁棒性至关重要。在从人类反馈（RLHF）中学习的偏好中奖励模型学习的背景下，现有方法通常不依赖于优先级评级的简单函数的边距，固定利润或边距。但是，这种公式通常无法说明不同偏好的各种优势，例如，某些偏好与响应之间的较大边缘相关，或者它们依赖于评级中得出的嘈杂的保证金信息。我们认为，建模偏好的强度可以导致更好的概括和更忠实的一致性。此外，许多使用自适应边缘的现有方法都假设访问准确的偏好得分，这对于人类来说可能很难可靠地提供。我们提出了一种利用偏好而不是偏好的方法，即注释，表明两个偏好中的哪个反映了更强的区别。我们使用此序数信号以每位数据的基础来推断适应边缘。我们引入了直接偏好优化（DPO）DPO-POP的扩展，该扩展包括从偏好优先监督中进行自适应边缘，从而提高了歧视性和生成性能。从经验上讲，我们的方法的表现优于香草DPO，具有固定边缘的DPO，并且在超前背式数据集上具有地面余量的DPO。此外，我们表明，判别性能和生成性能之间存在权衡：提高测试分类精度，尤其是通过以牺牲更强的偏好为代价的较弱的偏好标记可以导致生成质量的下降。为了解决这个折衷方案，我们提出了两种抽样策略来收集优先优先标签：一种有利于歧视性能和一个有利于生成性能的良好性能。

Title: ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Authors: Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22864
Pdf URL: https://arxiv.org/pdf/2509.22864
Copy Paste: [[2509.22864]] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models(https://arxiv.org/abs/2509.22864)
Keywords: generation, generative
Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.
摘要：近年来，由于其生物启发的特性（例如高时间分辨率和高动态范围），事件摄像机引起了人们的关注。但是，获得基于事件的视力任务的大规模标记的地面真相数据仍然具有挑战性和昂贵。在本文中，我们提出了一种基于扩散的生成模型，该模型旨在综合以各种控制信号（例如类文本标签，2D骨架和3D身体姿势）指导的高质量事件数据。我们的关键见解是利用基础模型的扩散，例如稳定的扩散，从而使高质量的事件数据生成具有最小的微调和有限的标记数据。我们的方法简化了数据生成过程，并大大降低了产生标记的事件数据集的成本。我们通过合成事件数据以进行视觉识别，2D骨架估计和3D身体姿势估计来证明我们的方法的有效性。我们的实验表明，合成的标记事件数据可增强所有任务中的模型性能。此外，我们的方法可以在培训期间根据看不见的文本标签生成事件，说明从基础模型继承的强大基于文本的生成功能。

Title: Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.22925
Pdf URL: https://arxiv.org/pdf/2509.22925
Copy Paste: [[2509.22925]] Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings(https://arxiv.org/abs/2509.22925)
Keywords: generation
Abstract: One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.
摘要：从掩盖的扩散模型（MDMS）提炼出的一步发电机将多个采样步骤压缩到单个正向通道中，从而实现有效的文本和图像合成。但是，他们遭受了两个关键的局限性：他们继承了教师的建模偏见，以及离散的令牌输出阻止了梯度流，可防止依从后的改进，例如对抗性训练，基于奖励的微调和测试时间嵌入式嵌入式优化（TTEO）。在这项工作中，我们引入了软嵌入，这是一种简单的放松，将离散令牌取代了发电机输出分布下的预期嵌入。软嵌入式保留了一步离散发电机的表示形式，同时提供了完全可区分的连续替代物，该替代物与教师的骨干和令牌解码器兼容。将软嵌入到DI [m] o蒸馏框架（表示软二键[m] o）中使一步发电机端到端训练，并可以直接应用基于GAN的细化，可区分的奖励微调和TTEO。从经验上讲，在多个MDM教师（例如MaskBit，MaskGen）中，软脚[m] o取得了最新的一步结果：改进的班级表现，在Imagenet-256上使用GAN精致的Imagenet-256上的一步1.56，基于GAN的精致，以及较高的基因和HPS的摄入量，并奖励了奖励，并奖励了奖励，并奖励了奖励，并奖励了奖励，并奖励了奖励。

Title: FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning

Authors: Chenghan Yang, Peng Zhou, Dong-Sheng Zhang, Yueyun Wang, Hong-Bin Shen, Xiaoyong Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.22930
Pdf URL: https://arxiv.org/pdf/2509.22930
Copy Paste: [[2509.22930]] FishAI 2.0: Marine Fish Image Classification with Multi-modal Few-shot Learning(https://arxiv.org/abs/2509.22930)
Keywords: generation
Abstract: Traditional marine biological image recognition faces challenges of incomplete datasets and unsatisfactory model accuracy, particularly for few-shot conditions of rare species where data scarcity significantly hampers the performance. To address these issues, this study proposes an intelligent marine fish recognition framework, FishAI 2.0, integrating multimodal few-shot deep learning techniques with image generation for data augmentation. First, a hierarchical marine fish benchmark dataset, which provides a comprehensive data foundation for subsequent model training, is utilized to train the FishAI 2.0 model. To address the data scarcity of rare classes, the large language model DeepSeek was employed to generate high-quality textual descriptions, which are input into Stable Diffusion 2 for image augmentation through a hierarchical diffusion strategy that extracts latent encoding to construct a multimodal feature space. The enhanced visual-textual datasets were then fed into a Contrastive Language-Image Pre-Training (CLIP) based model, enabling robust few-shot image recognition. Experimental results demonstrate that FishAI 2.0 achieves a Top-1 accuracy of 91.67 percent and Top-5 accuracy of 97.97 percent at the family level, outperforming baseline CLIP and ViT models with a substantial margin for the minority classes with fewer than 10 training samples. To better apply FishAI 2.0 to real-world scenarios, at the genus and species level, FishAI 2.0 respectively achieves a Top-1 accuracy of 87.58 percent and 85.42 percent, demonstrating practical utility. In summary, FishAI 2.0 improves the efficiency and accuracy of marine fish identification and provides a scalable technical solution for marine ecological monitoring and conservation, highlighting its scientific value and practical applicability.
摘要：传统的海洋生物图像识别面临不完整数据集的挑战和模型准确性不足的挑战，尤其是对于稀有物种的少数射击条件，数据稀缺会大大阻碍性能。为了解决这些问题，本研究提出了一个智能的海洋鱼类识别框架，Fishai 2.0，将多模式的几射线深度学习技术与图像生成进行了集成，以进行数据增强。首先，用于培训Fishai 2.0模型的分层海洋鱼类基准数据集为后续模型培训提供了全面的数据基础。为了解决稀有类别的数据稀缺性，使用大型语言模型DeepSeek来产生高质量的文本描述，这些描述将通过层次扩散策略输入稳定的扩散2，以增加图像增强，该层次扩散策略可以提取潜在编码来构建多模式特征空间。然后，将增强的视觉文本数据集送入基于对比的语言图像预训练（剪辑）模型，从而实现了可靠的几张图像识别。实验结果表明，Fishai 2.0在家庭水平上达到了91.67％的前1个准确性和97.97％的前1个精度，表现优于基线剪辑和VIT模型，少数族裔少量少于10个培训样品。为了更好地将Fishai 2.0应用于现实世界的情况，在属和物种水平上，Fishai 2.0分别达到了87.58％和85.42％的前1位准确性，证明了实用性。总而言之，Fishai 2.0提高了海洋鱼类识别的效率和准确性，并为海洋生态监测和保护提供了可扩展的技术解决方案，突出了其科学价值和实际适用性。

Title: GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes

Authors: Valentyn Melnychuk, Stefan Feuerriegel
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.22953
Pdf URL: https://arxiv.org/pdf/2509.22953
Copy Paste: [[2509.22953]] GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes(https://arxiv.org/abs/2509.22953)
Keywords: generative
Abstract: Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed GDR-learners are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.
摘要：已经提出了各种深层生成模型，以估计观察数据的潜在结果分布。但是，它们都没有一般性尼曼（Neyman）正交性的理论特性，并且与准官方效率和双重鲁棒性相关。在本文中，我们介绍了一套生成的Neyman-Ottrodonal（双重稳定）学习者，这些套件估计了潜在结果的条件分布。我们提出的GDR-Learners是灵活的，可以与许多最新的深层生成模型进行实例化。特别是，我们基于（a）条件归一化流（我们称为GDR-CNF），（b）条件生成对抗网络（GDR-CGANS），（c）条件变异自动码编码器（GDR-CVAES）和（D）条件扩散模型（GDR-CDMS）。与现有方法不同，我们的GDR-Learners具有准轨道效率的特性和对双重鲁棒性的评价，因此在渐近上是最佳的。在一系列（半）合成实验中，我们证明了我们的GDR-学习者非常有效，并且在估计潜在结果的条件分布方面的现有方法优于现有方法。

Title: Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Authors: Luke Guerdan, Justin Whitehouse, Kimberly Truong, Kenneth Holstein, Zhiwei Steven Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22957
Pdf URL: https://arxiv.org/pdf/2509.22957
Copy Paste: [[2509.22957]] Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas(https://arxiv.org/abs/2509.22957)
Keywords: generative
Abstract: As Generative AI (GenAI) systems see growing adoption, a key concern involves the external validity of evaluations, or the extent to which they generalize from lab-based to real-world deployment conditions. Threats to the external validity of GenAI evaluations arise when the source sample of human raters and system outputs used to obtain a system quality estimate differs from the target distribution at deployment time. In this work, we propose a doubly-robust estimation framework designed to address this evaluation sampling bias. Key to our approach is the use of "persona" ratings produced by prompting an LLM evaluator (i.e., an LLM-as-a-judge) to behave as a human rater with specific sociodemographic characteristics. Our doubly-robust framework combines these informative yet imperfect persona ratings with human ratings obtained under evaluation sampling bias to produce statistically valid system quality estimates. In particular, we show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality. We validate our framework theoretically and via a novel Persona Simulation Framework (PSF) designed to systematically manipulate persona quality and the degree of evaluation sampling bias present in source data. Our work provides a principled foundation for combining imperfect persona ratings with human ratings observed under sampling bias to obtain valid system quality estimates.
摘要：随着生成的AI（Genai）系统看到的采用不断增长，关键问题涉及评估的外部有效性，或者它们从基于实验室到现实世界的部署条件的推广程度。当用于获得系统质量估计的人类评估者和系统输出的源样本与部署时间的目标分布不同时，就会出现对Genai评估外部有效性的威胁。在这项工作中，我们提出了一个双重稳定估计框架，旨在解决此评估采样偏差。我们方法的关键是通过促使LLM评估者（即LLM-AS-A-a-gudge）促使LLM评估者（即具有特定社会人口统计学特征的人类评估者）产生的“角色”评级。我们的双重稳定框架将这些信息丰富但不完善的角色评级与评估采样偏见获得的人类评级相结合，以产生统计有效的系统质量估计。特别是，我们表明我们的方法会产生有效的系统质量估计，当（i）使用角色评级和在抽样偏见下观察到的源数据进行训练以预测人类评级的模型，或者（ii）纠正采样偏见的重新加权模型具有足够的质量。我们从理论上和通过新颖的角色模拟框架（PSF）来验证我们的框架，旨在系统地操纵角色质量和源数据中存在的评估抽样偏差的程度。我们的工作为将不完善的角色评级与在抽样偏见下观察到的人类评级相结合，以获得有效的系统质量估计，这为原则上的基础提供了基础。

Title: Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Authors: Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.22963
Pdf URL: https://arxiv.org/pdf/2509.22963
Copy Paste: [[2509.22963]] Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces(https://arxiv.org/abs/2509.22963)
Keywords: generation
Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.
摘要：强化学习（RL）努力地扩展到许多现实世界中常见的大型组合作用空间。本文介绍了一个新颖的框架，用于训练离散扩散模型作为这些复杂环境中的高效政策。我们的关键创新是一个有效的在线培训过程，可确保稳定有效的政策改进。通过利用策略镜下降（PMD）来定义理想的正规目标策略分布，我们将策略更新视为分配匹配问题，训练表达性扩散模型以复制此稳定的目标。这种解耦方法稳定学习并大大提高了训练表现。我们的方法可在各种具有挑战性的组合基准（包括DNA序列产生，带有宏观运动和多代理系统的RL）中实现最先进的结果和卓越的样品效率。实验表明，与其他基线相比，我们的扩散策略具有优异的性能。

Title: Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery

Authors: Jiayin Liu, Yulong Yang, Vineet Bansal, Christine Allen-Blanchette
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2509.23003
Pdf URL: https://arxiv.org/pdf/2509.23003
Copy Paste: [[2509.23003]] Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery(https://arxiv.org/abs/2509.23003)
Keywords: generation
Abstract: From metronomes to celestial bodies, mechanics underpins how the world evolves in time and space. With consideration of this, a number of recent neural network models leverage inductive biases from classical mechanics to encourage model interpretability and ensure forecasted states are physical. However, in general, these models are designed to capture the dynamics of a single system with fixed physical parameters, from state-space measurements of a known configuration space. In this paper we introduce Symplectic Phase Space GAN (SPS-GAN) which can capture the dynamics of multiple systems, and generalize to unseen physical parameters from. Moreover, SPS-GAN does not require prior knowledge of the system configuration space. In fact, SPS-GAN can discover the configuration space structure of the system from arbitrary measurement types (e.g., state-space measurements, video frames). To achieve physically plausible generation, we introduce a novel architecture which embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. To discover the structure of the configuration space, we optimize the conditional time-series GAN objective with an additional physically motivated term to encourages a sparse representation of the configuration space. We demonstrate the utility of SPS-GAN for trajectory prediction, video generation and symmetry discovery. Our approach captures multiple systems and achieves performance on par with supervised models designed for single systems.
摘要：从监测到天体，力学基于世界如何在时空发展。考虑到这一点，许多最近的神经网络模型利用经典力学的电感偏见来鼓励模型可解释性，并确保预测状态是物理的。但是，通常，这些模型旨在捕获具有固定物理参数的单个系统的动力学，从已知配置空间的状态空间测量值。在本文中，我们介绍了符号相位空间gan（sps-gan），该空间可以捕获多个系统的动力学，并推广到从中看不见的物理参数。此外，SPS-GAN不需要对系统配置空间的先验知识。实际上，SPS-GAN可以通过任意测量类型（例如，状态空间测量，视频帧）发现系统的配置空间结构。为了实现物理上合理的一代，我们引入了一种新型的结构，该结构将哈密顿神经网络复发模块嵌入条件gan骨架中。为了发现配置空间的结构，我们使用额外的出色术语来优化条件的时间序列GAN目标，以鼓励对配置空间的稀疏表示。我们演示了SPS-GAN用于轨迹预测，视频生成和对称性发现的实用性。我们的方法捕获了多个系统，并与专为单个系统设计的有监督模型达到了绩效。

Title: ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View

Authors: Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23008
Pdf URL: https://arxiv.org/pdf/2509.23008
Copy Paste: [[2509.23008]] ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View(https://arxiv.org/abs/2509.23008)
Keywords: generation, generative
Abstract: Despite their exceptional generative quality, diffusion models have limited applicability to world modeling tasks, such as novel view generation from sparse inputs. This limitation arises because diffusion models generate outputs in a non-causal manner, often leading to distortions or inconsistencies across views, and making it difficult to incrementally adapt accumulated knowledge to new queries. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce \textbf{ARSS}, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ a video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose a autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Extensive qualitative and quantitative experiments on public datasets demonstrate that our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models. Our code will be released upon paper acceptance.
摘要：尽管它们具有出色的生成质量，但扩散模型对世界建模任务的适用性有限，例如从稀疏输入中产生的新型视图产生。之所以出现此限制，是因为扩散模型以非毒物方式生成输出，通常会导致视图之间的扭曲或不一致，因此很难将累积的知识逐渐适应新的查询。相比之下，自回归（AR）模型以因果方式运行，基于所有先前生成的令牌生成每个令牌。在这项工作中，我们介绍了\ textbf {arss}，这是一个新型框架，它利用GPT式解码器仅AR模型从单个图像中生成新的视图，并以预定义的相机轨迹为条件。我们采用视频令牌将连续图像序列映射到离散令牌中，并提出一个将相机轨迹转换为3D位置指导的相机编码器。然后，为了提高产生质量，同时保留自回归结构，我们提出了一个自回归变压器模块，该模块随机置入令牌的空间顺序，同时保持其时间顺序。公共数据集上的广泛的定性和定量实验表明，我们的方法的性能比基于扩散模型的最先进的视图合成方法相当或更好。我们的代码将在接受纸上发布。

Title: Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation

Authors: Zetian Wu, Tianshuo Zhou, Stefan Lee, Liang Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23011
Pdf URL: https://arxiv.org/pdf/2509.23011
Copy Paste: [[2509.23011]] Geometry-Aware Losses for Structure-Preserving Text-to-Sign Language Generation(https://arxiv.org/abs/2509.23011)
Keywords: generation
Abstract: Sign language translation from text to video plays a crucial role in enabling effective communication for Deaf and hard--of--hearing individuals. A major challenge lies in generating accurate and natural body poses and movements that faithfully convey intended meanings. Prior methods often neglect the anatomical constraints and coordination patterns of human skeletal motion, resulting in rigid or biomechanically implausible outputs. To address this, we propose a novel approach that explicitly models the relationships among skeletal joints--including shoulders, arms, and hands--by incorporating geometric constraints on joint positions, bone lengths, and movement dynamics. During training, we introduce a parent-relative reweighting mechanism to enhance finger flexibility and reduce motion stiffness. Additionally, bone-pose losses and bone-length constraints enforce anatomically consistent structures. Our method narrows the performance gap between the previous best and the ground-truth oracle by 56.51%, and further reduces discrepancies in bone length and movement variance by 18.76% and 5.48%, respectively, demonstrating significant gains in anatomical realism and motion naturalness.
摘要：从文本到视频的手语翻译在为聋人和努力的人带来有效的沟通方面起着至关重要的作用。一个主要的挑战在于产生准确的自然身体姿势和运动，这些姿势和运动忠实地传达了预期的意义。先前的方法通常忽略了人类骨骼运动的解剖约束和协调模式，从而导致刚性或生物力学上令人难以置信的输出。为了解决这个问题，我们提出了一种新颖的方法，该方法通过在关节位置，骨长和运动动力学上的几何约束来明确地模拟骨骼关节之间的关系（包括肩膀，手臂和手）。在训练过程中，我们引入了一种父母相关的重新加权机制，以增强手指柔韧性并降低运动刚度。另外，骨置损失和骨长度约束会在解剖上执行一致的结构。我们的方法将上一位最佳和地面甲骨文之间的性能差距缩小了56.51％，并进一步将骨长度和运动差异的差异分别降低了18.76％和5.48％，在解剖学现实主义和运动自然性方面取得了显着增长。

Title: Planning with Unified Multimodal Models

Authors: Yihao Sun, Zhilong Zhang, Yang Yu, Pierre-Luc Bacon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23014
Pdf URL: https://arxiv.org/pdf/2509.23014
Copy Paste: [[2509.23014]] Planning with Unified Multimodal Models(https://arxiv.org/abs/2509.23014)
Keywords: generative
Abstract: With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.
摘要：凭借大型语言模型（LLM）和视觉语言模型（VLM）的强大推理能力，许多最近的作品都探索了它们进行决策。但是，这些方法中的大多数仅依赖于基于语言的推理，这限制了其推理和做出明智决定的能力。最近，有前途的新方向已经出现了统一的多峰模型（UMMS），该模型支持多模式输入和输出。我们认为，通过通过产生的视觉内容启用推理，这种模型具有更大的决策潜力。为此，我们提出了Uni-Plan，这是一个基于UMMS的计划框架。在此框架内，单个模型同时用作策略，动力学模型和价值功能。此外，为了避免动态预测中的幻觉，我们提出了一种新型方法自我歧视过滤，其中生成模型充当自我歧视器，以滤除无效的动态预测。长期计划计划任务的实验表明，与基于VLM的方法相比，UNI-计划大大提高了成功率，同时还显示出强大的数据可扩展性，不需要专家演示，并且在相同的培训数据尺寸下实现了更好的性能。这项工作为未来的推理和决策研究奠定了基础。

Title: Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy

Authors: Xiafeng Man, Zhipeng Wei, Jingjing Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23022
Pdf URL: https://arxiv.org/pdf/2509.23022
Copy Paste: [[2509.23022]] Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy(https://arxiv.org/abs/2509.23022)
Keywords: generative
Abstract: The widespread deployment of large vision models such as Stable Diffusion raises significant legal and ethical concerns, as these models can memorize and reproduce copyrighted content without authorization. Existing detection approaches often lack robustness and fail to provide rigorous theoretical underpinnings. To address these gaps, we formalize the concept of copyright infringement and its detection from the perspective of Differential Privacy (DP), and introduce the conditional sensitivity metric, a concept analogous to sensitivity in DP, that quantifies the deviation in a diffusion model's output caused by the inclusion or exclusion of a specific training data point. To operationalize this metric, we propose D-Plus-Minus (DPM), a novel post-hoc detection framework that identifies copyright infringement in text-to-image diffusion models. Specifically, DPM simulates inclusion and exclusion processes by fine-tuning models in two opposing directions: learning or unlearning. Besides, to disentangle concept-specific influence from the global parameter shifts induced by fine-tuning, DPM computes confidence scores over orthogonal prompt distributions using statistical metrics. Moreover, to facilitate standardized benchmarking, we also construct the Copyright Infringement Detection Dataset (CIDD), a comprehensive resource for evaluating detection across diverse categories. Our results demonstrate that DPM reliably detects infringement content without requiring access to the original training dataset or text prompts, offering an interpretable and practical solution for safeguarding intellectual property in the era of generative AI.
摘要：稳定扩散等大型视力模型的广泛部署引起了重大的法律和道德问题，因为这些模型可以在未经授权的情况下记住和重现受版权的内容。现有的检测方法通常缺乏鲁棒性，并且无法提供严格的理论基础。为了解决这些差距，我们从差异隐私（DP）的角度将版权侵权的概念及其检测形式化，并引入条件灵敏度度量，这是类似于DP中灵敏度的概念，可以量化由特定训练数据点的包含或排除引起的扩散模型的偏差。为了实现此指标，我们提出了D-Plus-Minus（DPM），这是一种新型的事后检测框架，可以识别文本到图像扩散模型中的版权侵权。具体而言，DPM通过在两个相反的方向上进行微调模型模拟包含和排除过程：学习或学习。此外，要通过微调引起的全局参数变化，DPM使用统计指标来计算对正交及时分布的置信度得分。此外，为了促进标准化的基准测试，我们还构建了版权侵权检测数据集（CIDD），这是一种评估不同类别检测的综合资源。我们的结果表明，DPM可靠地检测到侵权内容，而无需访问原始培训数据集或文本提示，为生成AI时代提供了可解释且实用的解决方案，以保护知识产权。

Title: Tracing the Representation Geometry of Language Models from Pretraining to Post-training

Authors: Melody Zixuan Li, Kumar Krishna Agrawal, Arna Ghosh, Komal Kumar Teru, Adam Santoro, Guillaume Lajoie, Blake A. Richards
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23024
Pdf URL: https://arxiv.org/pdf/2509.23024
Copy Paste: [[2509.23024]] Tracing the Representation Geometry of Language Models from Pretraining to Post-training(https://arxiv.org/abs/2509.23024)
Keywords: generation
Abstract: Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay ($\alpha$-ReQ). With OLMo (1B-7B) and Pythia (160M-12B) models, we uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining. The initial "warmup" phase exhibits rapid representational collapse. This is followed by an "entropy-seeking" phase, where the manifold's dimensionality expands substantially, coinciding with peak n-gram memorization. Subsequently, a "compression-seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, a transition marked with significant improvement in downstream task performance. We show these phases can emerge from a fundamental interplay of cross-entropy optimization under skewed token frequencies and representational bottlenecks ($d \ll |V|$). Post-training further transforms geometry: SFT and DPO drive "entropy-seeking" dynamics to integrate specific instructional or preferential data, improving in-distribution performance while degrading out-of-distribution robustness. Conversely, RLVR induces "compression-seeking", enhancing reward alignment but reducing generation diversity.
摘要：诸如损失之类的标准培训指标无法解释大语言模型中复杂能力的出现。我们采用频谱方法来研究跨训练和训练后学习的表示形式的几何形状，测量有效等级（Rankme）和特征谱衰减（$ \ alpha $ -req）。使用Olmo（1b-7b）和Pythia（160M-12B）模型，我们在自回旋预审进期间发现了三个几何阶段的一致的非单调序列。最初的“热身”阶段表现出快速的代表性崩溃。接下来是一个“寻求熵”阶段，其中歧管的维度大大扩展，与峰值n-gram记忆一致。随后，“寻求压缩”阶段施加了各向异性的巩固，在收缩他人的同时选择性地保留了沿主要特征的方差，这一标志着下游任务绩效的过渡。我们表明，这些阶段可以从偏斜的令牌频率和代表性瓶颈（$ d \ ll | v | $）下的跨凝结优化的基本相互作用中出现。训练后进一步转换几何形状：SFT和DPO驱动器“寻求熵”动力学，以整合特定的教学或优先数据，改善分布性能，同时降低分布分布的鲁棒性。相反，RLVR引起了“寻求压缩的”，增强了奖励一致性，但减少了产生的多样性。

Title: DPFNAS: Differential Privacy-Enhanced Federated Neural Architecture Search for 6G Edge Intelligence

Authors: Yang Lv, Jin Cao, Ben Niu, Zhe Sun, Fengwei Wang, Fenghua Li, Hui Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23030
Pdf URL: https://arxiv.org/pdf/2509.23030
Copy Paste: [[2509.23030]] DPFNAS: Differential Privacy-Enhanced Federated Neural Architecture Search for 6G Edge Intelligence(https://arxiv.org/abs/2509.23030)
Keywords: generation
Abstract: The Sixth-Generation (6G) network envisions pervasive artificial intelligence (AI) as a core goal, enabled by edge intelligence through on-device data utilization. To realize this vision, federated learning (FL) has emerged as a key paradigm for collaborative training across edge devices. However, the sensitivity and heterogeneity of edge data pose key challenges to FL: parameter sharing risks data reconstruction, and a unified global model struggles to adapt to diverse local distributions. In this paper, we propose a novel federated learning framework that integrates personalized differential privacy (DP) and adaptive model design. To protect training data, we leverage sample-level representations for knowledge sharing and apply a personalized DP strategy to resist reconstruction attacks. To ensure distribution-aware adaptation under privacy constraints, we develop a privacy-aware neural architecture search (NAS) algorithm that generates locally customized architectures and hyperparameters. To the best of our knowledge, this is the first personalized DP solution tailored for representation-based FL with theoretical convergence guarantees. Our scheme achieves strong privacy guarantees for training data while significantly outperforming state-of-the-art methods in model performance. Experiments on benchmark datasets such as CIFAR-10 and CIFAR-100 demonstrate that our scheme improves accuracy by 6.82\% over the federated NAS method PerFedRLNAS, while reducing model size to 1/10 and communication cost to 1/20.
摘要：第六代（6G）网络将普遍的人工智能（AI）设想为核心目标，这是通过在设备数据利用率上通过边缘智能实现的。为了实现这一愿景，联邦学习（FL）已成为跨越边缘设备协作培训的关键范式。但是，边缘数据的敏感性和异质性对FL构成了关键挑战：参数共享风险数据重建，统一的全球模型努力努力适应各种本地分布。在本文中，我们提出了一个新颖的联合学习框架，该框架整合了个性化的差异隐私（DP）和自适应模型设计。为了保护培训数据，我们利用样本级别表示知识共享，并应用个性化的DP策略来抵制重建攻击。为了确保在隐私限制下的分布感知适应，我们开发了一种隐私感知的神经体系结构搜索（NAS）算法，该算法生成本地自定义的体系结构和超参数。据我们所知，这是第一个针对基于代表的FL量身定制的个性化DP解决方案，并提供了理论融合的保证。我们的计划可实现强大的隐私保证，以确保培训数据，同时在模型性能中的最先进方法均明显优于最先进的方法。在基准数据集（例如CIFAR-10和CIFAR-100）上进行的实验表明，我们的方案比联合NAS方法的Perferllnas提高了准确性6.82 \％，而将模型大小降低到1/10，并且通信成本将其降低到1/20。

Title: Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Authors: Lin Long, Changdae Oh, Seongheon Park, Yixuan Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23050
Pdf URL: https://arxiv.org/pdf/2509.23050
Copy Paste: [[2509.23050]] Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding(https://arxiv.org/abs/2509.23050)
Keywords: generation
Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
摘要：大型视觉模型（LVLMS）在多模式任务上实现了强大的性能，但它们经常默认为语言（LP） - 记忆的文本模式来自预训练，同时不足以利用视觉证据。 LP的先前分析主要依赖于输入输出探测，这未能揭示有关视力何时以及如何影响模型行为的内部机制。为了解决这一差距，我们通过插入链的镜头进行了对语言的首次系统分析，该镜头检查了LVLM中的层表示动态。我们的分析揭示了一种普遍现象：每个模型都表现出视觉整合点（VIP），视觉信息开始有意义地重塑隐藏的表示并影响解码。在此观察结果的基础上，我们介绍了总视觉集成（TVI）估计器，该估计值汇总了VIP以外的表示距离，以量化视觉查询对响应产生的影响。在跨越9个现代LVLM和6个基准的54个模型数据组合中，我们证明了VIP始终出现，并且TVI可靠地预测了先前的语言力量。这提供了一个原则上的工具包，用于诊断和理解LVLMS中的语言。

Title: Activation Matching for Explanation Generation

Authors: Pirzada Suhail, Aditya Anand, Amit Sethi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23051
Pdf URL: https://arxiv.org/pdf/2509.23051
Copy Paste: [[2509.23051]] Activation Matching for Explanation Generation(https://arxiv.org/abs/2509.23051)
Keywords: generation
Abstract: In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of $x$. Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
摘要：在本文中，我们介绍了一种基于激活匹配的方法，以在任何给定的图像上对预审计的分类器的决策产生最少，忠实的解释。给定输入图像\（x \）和一个冷冻模型\（f \），我们训练轻巧的自动编码器输出二进制掩码\（m \），以使解释\（e = m \ odot x \）保留模型的预测和\（x \）的中间激活。我们的目标组合：（i）多层激活与KL差异匹配以对齐分布和交叉凝集，以保留图像和解释的顶级1标签；（ii）面具先验 - L1面积最小的区域，酥脆0/1口罩的二进制罚款以及紧凑性的总变化；（iii）忠诚和必要性的绑架限制。这些目标共同产生了较小的，人性化的面具，可以保留分类者行为，同时丢弃无关的输入区域，从而为基础模型的决策提供实用和忠实的极简主义解释。

Title: Dynamics of Learning: Generative Schedules from Latent ODEs

Authors: Matt L. Sampson, Peter Melchior
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23052
Pdf URL: https://arxiv.org/pdf/2509.23052
Copy Paste: [[2509.23052]] Dynamics of Learning: Generative Schedules from Latent ODEs(https://arxiv.org/abs/2509.23052)
Keywords: generative
Abstract: The learning rate schedule is one of the most impactful aspects of neural network optimization, yet most schedules either follow simple parametric functions or react only to short-term training signals. None of them are supported by a comprehensive temporal view of how well neural networks actually train. We present a new learning rate scheduler that models the training performance of neural networks as a dynamical system. It leverages training runs from a hyperparameter search to learn a latent representation of the training process. Given current training metrics, it predicts the future learning rate schedule with the best long-term validation performance. Our scheduler generalizes beyond previously observed training dynamics and creates specialized schedules that deviate noticeably from common parametric functions. It achieves SOTA results for image classification with CNN and ResNet models as well as for next-token prediction with a transformer model. The trained models are located in flatter regions of the loss landscape and thus provide better generalization than those trained with other schedules. Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top of ML experiment-tracking platforms. An implementation of our scheduler will be made available after acceptance.
摘要：学习率计划是神经网络优化的最有影响力的方面之一，但是大多数时间表都遵循简单的参数功能，或者仅对短期培训信号做出反应。他们都没有得到神经网络实际训练的全面时间看法的支持。我们提出了一个新的学习率调度程序，该调度程序将神经网络作为动态系统的训练性能进行建模。它利用培训从超参数搜索进行，以了解培训过程的潜在表示。鉴于当前的培训指标，它可以预测未来的学习率计划，并具有最佳的长期验证绩效。我们的调度程序概括了先前观察到的训练动力学，并创建专门的时间表，这些计划明显偏离了共同的参数功能。它可以通过CNN和Resnet模型以及使用变压器模型的下一步预测来实现SOTA结果。训练有素的模型位于损失景观的平坦地区，因此比接受其他时间表的训练的概括更好。我们的方法是计算有效的，优化的方法，并且可以轻松地放在ML实验跟踪平台的顶部。接受调度程序的实施将在接受后提供。

Title: Follow-Your-Preference: Towards Preference-Aligned Image Inpainting

Authors: Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23082
Pdf URL: https://arxiv.org/pdf/2509.23082
Copy Paste: [[2509.23082]] Follow-Your-Preference: Towards Preference-Aligned Image Inpainting(https://arxiv.org/abs/2509.23082)
Keywords: generative
Abstract: This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: this https URL.
摘要：本文研究了与偏好对齐的图像。我们没有引入一种新颖的方法，而是回到基础知识和重新审视达到这种一致性方面的基本问题。我们利用了对齐培训的突出直接偏好优化方法，并采用公共奖励模型来构建偏好培训数据集。实验是在九个奖励模型，两个基准和两个具有不同结构和生成算法的基线模型之间进行的。我们的主要发现如下：（1）大多数奖励模型为构建偏好数据提供有效的奖励分数，即使其中一些不是可靠的评估者。（2）偏好数据证明了跨模型和基准测试的候选缩放和样品缩放的强劲趋势。（3）奖励模型中的可观察偏见，尤其是在亮度，成分和配色方案中，使它们容易引起奖励黑客。（4）这些模型的简单合奏通过减轻此类偏见产生了可靠和可推广的结果。基于这些观察结果，我们的对齐模型在标准指标，GPT-4评估和人类评估之间大大优于先前模型，而没有对模型结构进行任何更改或使用新数据集的任何更改。我们希望我们的工作能够设定一个简单而坚实的基线，从而推动这一有前途的前沿。我们的代码是开源的：此HTTPS URL。

Title: Causally-Enhanced Reinforcement Policy Optimization

Authors: Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo, Xiangliang Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23095
Pdf URL: https://arxiv.org/pdf/2509.23095
Copy Paste: [[2509.23095]] Causally-Enhanced Reinforcement Policy Optimization(https://arxiv.org/abs/2509.23095)
Keywords: generation
Abstract: Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.
摘要：接受加强目标训练的大型语言模型（LLM）通常通过快捷策略实现表面上正确的答案，将正确的输出与虚假或不忠的推理配对，并在小因果扰动下降级。我们介绍了一个因果增强政策优化（CE-PO），这是一个摘要奖励框架框架，以从提示（z）到理由（x）的一代途径的因果相干来增强策略优化，以替代因果关系（x）。 CE-PO估计具有基于Jacobian的敏感性的模型内部影响，反合对这些信号进行了强化，以抑制滋扰线索，并通过Minkowski（Power-Mean）组合仪与任务准确的反馈融合，从而将其连贯得分融合在一起，从而在准确性和连贯性和连贯之间揭示了单一的可调性和连贯性。统一的奖励与PPO/GRPO集成在一起，而没有建筑变化。在推理基准和因果压力测试中，CE-PO降低了奖励黑客攻击和不忠的思维链，同时提高了对相关性促成逆转和轻度反事实编辑的鲁棒性，所有这些都非常准确。四个数据集的实验结果表明，CE-PO平均提高了基准的准确性5.49％（最高9.58％），同时提高了对相关性促成的鲁棒性和轻度反事实编辑的鲁棒性。

Title: Stochastic Interpolants via Conditional Dependent Coupling

Authors: Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23122
Pdf URL: https://arxiv.org/pdf/2509.23122
Copy Paste: [[2509.23122]] Stochastic Interpolants via Conditional Dependent Coupling(https://arxiv.org/abs/2509.23122)
Keywords: generation, generative
Abstract: Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.
摘要：现有的图像生成模型在计算和忠诚度之间的权衡面临着关键的挑战。具体而言，依赖于预计的变异自动编码器（VAE）的模型遭受信息丢失，有限的细节和无法支持端到端培训的损失。相比之下，直接在像素空间中运行的模型产生了过度的计算成本。尽管级联模型可以减轻计算成本，但阶段的分离可以防止有效的端到端优化，缩减知识共享，并且通常会导致每个阶段内的分布学习不准确。为了应对这些挑战，我们根据提出的条件依赖耦合策略引入了统一的多阶段生成框架。它在多个阶段将生成过程分解为插值轨迹，从而确保在实现端到端优化的同时确保精确的分布学习。重要的是，整个过程被建模为单个统一扩散变压器，消除了对脱节模块的需求，并启用了知识共享。广泛的实验表明，我们的方法在多种分辨率上实现了高忠诚和效率。

Title: Impute-MACFM: Imputation based on Mask-Aware Flow Matching

Authors: Dengyi Liu, Honggang Wang, Hua Fang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23126
Pdf URL: https://arxiv.org/pdf/2509.23126
Copy Paste: [[2509.23126]] Impute-MACFM: Imputation based on Mask-Aware Flow Matching(https://arxiv.org/abs/2509.23126)
Keywords: generative
Abstract: Tabular data are central to many applications, especially longitudinal data in healthcare, where missing values are common, undermining model fidelity and reliability. Prior imputation methods either impose restrictive assumptions or struggle with complex cross-feature structure, while recent generative approaches suffer from instability and costly inference. We propose Impute-MACFM, a mask-aware conditional flow matching framework for tabular imputation that addresses missingness mechanisms, missing completely at random, missing at random, and missing not at random. Its mask-aware objective builds trajectories only on missing entries while constraining predicted velocity to remain near zero on observed entries, using flexible nonlinear schedules. Impute-MACFM combines: (i) stability penalties on observed positions, (ii) consistency regularization enforcing local invariance, and (iii) time-decayed noise injection for numeric features. Inference uses constraint-preserving ordinary differential equation integration with per-step projection to fix observed values, optionally aggregating multiple trajectories for robustness. Across diverse benchmarks, Impute-MACFM achieves state-of-the-art results while delivering more robust, efficient, and higher-quality imputation than competing approaches, establishing flow matching as a promising direction for tabular missing-data problems, including longitudinal data.
摘要：表格数据对于许多应用程序，尤其是医疗保健中的纵向数据是核心数据的至关重要的，在医疗保健中，缺失值是常见的，破坏了模型的保真度和可靠性。先前的插补方法要么施加限制性假设，要么与复杂的交叉功能结构斗争，而最近的生成方法则遭受不稳定和昂贵的推断。我们提出了Impute-MACFM，这是一种掩盖有条件流动匹配框架的表格插图，该框架解决了缺失机制，完全随机丢失，随机丢失，而不是随机丢失。它的面具感知的目标仅在缺失条目上构建轨迹，同时使用灵活的非线性时间表约束预测的速度在观察到的条目中保持零。算法MACFM组合：（i）观察到的位置的稳定性惩罚，（ii）一致性正规化实施局部不变性，以及（iii）数字特征的时间分配的噪声注入。推理使用约束的普通微分方程集成与每个步骤投影，以固定观察值，可选地汇总了多个轨迹以符合鲁棒性。在不同的基准测试中，估算的MACFM可实现最先进的结果，同时与竞争方法相比，提供更强大，高效和更高质量的插奖，将流量匹配确立为表格失踪数据问题的有希望的方向，包括纵向数据。

Title: Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Authors: Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23141
Pdf URL: https://arxiv.org/pdf/2509.23141
Copy Paste: [[2509.23141]] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents(https://arxiv.org/abs/2509.23141)
Keywords: generation
Abstract: Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.
摘要：地球观察（EO）对于理解地球系统不断发展的状态至关重要。尽管最近的MLLM已进行了EO研究，但他们仍然缺乏解决需要多步推理和使用特定领域工具的复杂任务的能力。基于代理的方法提供了一个有希望的方向，但是当前的尝试仍处于起步阶段，仅限于RGB感知，浅水推理和缺乏系统的评估协议。为了克服这些局限性，我们引入了地球代理，这是第一个在基于MCP的工具生态系统中统一RGB和光谱EO数据的第一个代理框架，从而使跨模式，多步骤和定量的时空推理能够超越预处理的MLLM。通过动态调用跨模态的专家工具和模型，地球代理支持复杂的科学任务，例如地球物理参数检索和定量时空分析。为了支持全面的评估，我们进一步提出了Earth Bench，这是248个专家策划任务的基准，其中包含13,729张图像，跨越频谱，产品和RGB模式，并配备了双层评估协议，可评估推理轨迹和最终结果。我们进行了全面的实验，以改变不同的LLM骨架，与一般代理框架进行比较，并在遥感基准上与MLLM进行比较，这既表明地球代理的有效性和潜力。 Earth-Agent建立了一个用于EO分析的新范式，将田地转向了LLM在地球观察中的科学扎根，下一代应用。我们的代码和数据集将公开发布。

Title: WeatherCycle: Unpaired Multi-Weather Restoration via Color Space Decoupled Cycle Learning

Authors: Wenxuan Fang, Jiangwei Weng, Jianjun Qian, Jian Yang, Jun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23150
Pdf URL: https://arxiv.org/pdf/2509.23150
Copy Paste: [[2509.23150]] WeatherCycle: Unpaired Multi-Weather Restoration via Color Space Decoupled Cycle Learning(https://arxiv.org/abs/2509.23150)
Keywords: restoration
Abstract: Unsupervised image restoration under multi-weather conditions remains a fundamental yet underexplored challenge. While existing methods often rely on task-specific physical priors, their narrow focus limits scalability and generalization to diverse real-world weather scenarios. In this work, we propose \textbf{WeatherCycle}, a unified unpaired framework that reformulates weather restoration as a bidirectional degradation-content translation cycle, guided by degradation-aware curriculum regularization. At its core, WeatherCycle employs a \textit{lumina-chroma decomposition} strategy to decouple degradation from content without modeling complex weather, enabling domain conversion between degraded and clean images. To model diverse and complex degradations, we propose a \textit{Lumina Degradation Guidance Module} (LDGM), which learns luminance degradation priors from a degraded image pool and injects them into clean images via frequency-domain amplitude modulation, enabling controllable and realistic degradation modeling. Additionally, we incorporate a \textit{Difficulty-Aware Contrastive Regularization (DACR)} module that identifies hard samples via a CLIP-based classifier and enforces contrastive alignment between hard samples and restored features to enhance semantic consistency and robustness. Extensive experiments across serve multi-weather datasets, demonstrate that our method achieves state-of-the-art performance among unsupervised approaches, with strong generalization to complex weather degradations.
摘要：在多天气条件下，无监督的图像恢复仍然是一个基本但毫无争议的挑战。尽管现有方法通常依赖于特定于任务的物理先验，但它们的狭窄重点将可扩展性和概括性限制为各种现实世界中的天气情况。在这项工作中，我们提出了\ textbf {athecerCycle}，这是一个统一的未配合的框架，以降级感知的课程正则化为指导，将天气恢复重新定义为双向降解 - 对立转换周期。 WeatherCycle以此为核心，采用\ textit {Lumina-Chroma分解}策略，将降解与内容分解，而无需对复杂的天气进行建模，从而使降解图像和干净的图像之间的域转换。为了建模多样化和复杂的降级，我们提出了一个\ textIt {lumina降解指导模块}（LDGM），该模块}（ldgm）从降级的图像池中学习亮度降解先验，并通过频率 - 幅度幅度调制，从而将它们注入干净的图像，从而实现可控制和现实的降级模型。此外，我们结合了一个\ textIt {难度 - 意识到对比度正规化（DACR）}模块，该模块通过基于夹的分类器来识别硬样品，并在硬样品和恢复的特征之间实施对比度对齐，以增强语义一致性和鲁棒性。跨多天气数据集的广泛实验表明，我们的方法在无监督的方法中实现了最新的性能，并对复杂的天气降解进行了强烈的概括。

Title: CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Authors: Prashant Govindarajan, Mathieu Reymond, Antoine Clavaud, Mariano Phielipp, Santiago Miret, Sarath Chandar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23156
Pdf URL: https://arxiv.org/pdf/2509.23156
Copy Paste: [[2509.23156]] CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning(https://arxiv.org/abs/2509.23156)
Keywords: generation, generative
Abstract: In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.
摘要：在硅设计和优化新材料中，主要依赖于执行密度功能理论（DFT）计算的高能原子模拟器。尽管最近的作品展示了机器学习加速材料设计过程的强大潜力，但它们主要由生成方法组成，这些方法不使用直接DFT信号作为反馈来改善培训和发电，这主要是由于DFT的高计算成本。为了通过在线增强学习（RL）在材料设计循环中采用直接DFT信号，我们建议Crystalgym，Crystalgym，这是一种用于结晶材料发现的开源RL环境。使用Crystalgym，我们基于设计以目标特性为条件的各种晶体进行基准的公共价值和基于策略的增强学习算法。具体而言，我们针对具有挑战性的属性进行了优化，例如带隙，散装模量和密度，这些属性是从环境中直接计算的。虽然我们的所有算法都无法解决所有Crystalgym任务，但我们广泛的实验和消融表现出不同的样本效率，并且易于收敛到最佳的不同算法和环境设置。此外，我们还包括一个案例研究，以增强基于DFT的奖励的增强学习范围的微调模型范围。我们的目标是使Crystalgym充当强化学习研究人员和物质科学家的测试床，以解决这些现实世界的设计问题，并使用实际应用。因此，我们引入了一系列新颖的挑战，用于涉及耗时的奖励信号的强化学习方法，为未来的跨学科研究铺平了道路，以实现现实世界应用程序动机。

Title: Dense associative memory on the Bures-Wasserstein space

Authors: Chandan Tankala, Krishnakumar Balasubramanian
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23162
Pdf URL: https://arxiv.org/pdf/2509.23162
Copy Paste: [[2509.23162]] Dense associative memory on the Bures-Wasserstein space(https://arxiv.org/abs/2509.23162)
Keywords: generative
Abstract: Dense associative memories (DAMs) store and retrieve patterns via energy-functional fixed points, but existing models are limited to vector representations. We extend DAMs to probability distributions equipped with the 2-Wasserstein distance, focusing mainly on the Bures-Wasserstein class of Gaussian densities. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity, provide quantitative retrieval guarantees under Wasserstein perturbations, and validate the model on synthetic and real-world distributional tasks. This work elevates associative memory from vectors to full distributions, bridging classical DAMs with modern generative modeling and enabling distributional storage and retrieval in memory-augmented learning.
摘要：密集的关联记忆（DAMS）存储并通过能量功能的固定点检索模式，但现有模型仅限于向量表示。我们将大坝扩展到配备2-wasserstein距离的概率分布，主要集中在高斯密度的Bures-Wasserstein类中。我们的框架在存储的分布上定义了log-sum-exp能量，并以吉布斯加权方式汇总了最佳传输图的检索动力学。固定点对应于自洽的瓦斯坦barycenters，概括了经典的大坝固定点。我们证明了指数级的存储能力，在Wasserstein扰动下提供了定量检索保证，并验证了合成和现实世界分布任务的模型。这项工作将关联记忆从向量提升到完整的分布，将经典的水坝与现代生成建模桥梁，并在记忆启动学习中实现分配存储和检索。

Title: Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction

Authors: Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, Yibo Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23169
Pdf URL: https://arxiv.org/pdf/2509.23169
Copy Paste: [[2509.23169]] Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction(https://arxiv.org/abs/2509.23169)
Keywords: generation, generative
Abstract: For bandwidth-constrained multimedia applications, simultaneously achieving ultra-low bitrate human video compression and accurate vertex prediction remains a critical challenge, as it demands the harmonization of dynamic motion modeling, detailed appearance synthesis, and geometric consistency. To address this challenge, we propose Sparse2Dense, a keypoint-driven generative framework that leverages extremely sparse 3D keypoints as compact transmitted symbols to enable ultra-low bitrate human video compression and precise human vertex prediction. The key innovation is the multi-task learning-based and keypoint-aware deep generative model, which could encode complex human motion via compact 3D keypoints and leverage these sparse keypoints to estimate dense motion for video synthesis with temporal coherence and realistic textures. Additionally, a vertex predictor is integrated to learn human vertex geometry through joint optimization with video generation, ensuring alignment between visual content and geometric structure. Extensive experiments demonstrate that the proposed Sparse2Dense framework achieves competitive compression performance for human video over traditional/generative video codecs, whilst enabling precise human vertex prediction for downstream geometry applications. As such, Sparse2Dense is expected to facilitate bandwidth-efficient human-centric media transmission, such as real-time motion analysis, virtual human animation, and immersive entertainment.
摘要：对于带宽受限的多媒体应用，同时实现超低比特率的人类视频压缩和准确的顶点预测仍然是一个关键的挑战，因为它要求对动态运动建模，详细的外观合成和几何一致性进行协调。为了应对这一挑战，我们提出了Sparse2Dense，这是一种按键驱动的生成框架，它利用极稀疏的3D关键点作为紧凑的传输符号，以实现超低比特率人体视频压缩和精确的人体顶点预测。关键创新是基于多任务的基于多任务的和关键的深层生成模型，它可以通过紧凑的3D关键点编码复杂的人类运动，并利用这些稀疏的关键点，以估算具有时间相干性和现实纹理的视频综合的密集运动。此外，通过与视频生成的关节优化，将顶点预测变量集成以学习人类顶点几何形状，从而确保视觉内容和几何结构之间的对齐。广泛的实验表明，所提出的稀疏2强框架在传统/生成视频编解码器上实现了人类视频的竞争性压缩性能，同时为下游几何应用提供了精确的人体顶点预测。因此，稀疏2浓度有望促进带宽有效的以人为中心的媒体传播，例如实时运动分析，虚拟人类动画和沉浸式娱乐。

Title: Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Authors: Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23194
Pdf URL: https://arxiv.org/pdf/2509.23194
Copy Paste: [[2509.23194]] Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss(https://arxiv.org/abs/2509.23194)
Keywords: generation
Abstract: Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at this http URL.
摘要：无监督的在线3D实例细分是一项基本而又具有挑战性的任务，因为它需要在不依赖注释的培训数据的情况下保持跨激光扫描的对象身份一致。现有的方法（例如单位）已在这个方向上取得了进展，但仍受到训练多样性，僵化的时间采样和对嘈杂伪标签的严重依赖的限制。我们提出了一个新框架，通过合成点云序列的产生丰富了训练分布，从而在不依赖手动标签或仿真引擎的情况下实现了更大的多样性。为了更好地捕获时间动力学，我们的方法结合了一种灵活的抽样策略，该策略利用相邻和非贴剂的框架，从而使模型可以从远距离依赖性以及短期变化中学习。此外，动态加权的损失强调了自信和信息丰富的样本，从而指导网络朝着更强大的表示。通过对Semantickitti，Nuscenes和Pandaset进行的广泛实验，我们的方法始终优于其他无监督的基线，实现了更高的分割精度和更强大的时间关联。该代码将在此HTTP URL上公开可用。

Title: SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

Authors: Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, Jinsong Su
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23232
Pdf URL: https://arxiv.org/pdf/2509.23232
Copy Paste: [[2509.23232]] SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts(https://arxiv.org/abs/2509.23232)
Keywords: generation
Abstract: Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including GSM8K, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at this https URL
摘要：大型语言模型（LLMS）越来越依赖于具有可验证奖励（RLVR）的强化学习来引发可靠的思想链推理。但是，训练过程仍然被计算上昂贵的推出阶段所束缚。现有的加速方法，例如并行化，目标和数据驱动的修改以及重播缓冲区 - 在遍及迭代中会降低回报，引入偏见或忽略冗余。我们确定连续训练时期的推出经常共享大部分重叠段的部分，从而浪费计算。为了解决这个问题，我们提出了Spec-RL，这是一个新颖的框架，将投机解码与RL推出过程集成在一起。 SPEC-RL将先前的轨迹段作为投机前缀，并通过草稿并验证机制扩展它们，从而避免了冗余发电，同时确保了策略一致性。包括GSM8K，Math-500，OlympiaDbench，MMLU-STEM等各种数学推理和概括基准的实验，表明SPEC-RL将推出时间减少了2-3倍，而不会损害政策质量。作为纯粹的推出阶段增强，SPEC-RL与主流算法无缝集成（例如PPO，GRPO，DAPO），为大型推理模型提供了一般且实用的途径。我们的代码可在此HTTPS URL上找到

Title: More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression

Authors: Shayan Alahyari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23240
Pdf URL: https://arxiv.org/pdf/2509.23240
Copy Paste: [[2509.23240]] More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression(https://arxiv.org/abs/2509.23240)
Keywords: generation
Abstract: In many real-world regression tasks, the data distribution is heavily skewed, and models learn predominantly from abundant majority samples while failing to predict minority labels accurately. While imbalanced classification has been extensively studied, imbalanced regression remains relatively unexplored. Deep imbalanced regression (DIR) represents cases where the input data are high-dimensional and unstructured. Although several data-level approaches for tabular imbalanced regression exist, deep imbalanced regression currently lacks dedicated data-level solutions suitable for high-dimensional data and relies primarily on algorithmic modifications. To fill this gap, we propose LatentDiff, a novel framework that uses conditional diffusion models with priority-based generation to synthesize high-quality features in the latent representation space. LatentDiff is computationally efficient and applicable across diverse data modalities, including images, text, and other high-dimensional inputs. Experiments on three DIR benchmarks demonstrate substantial improvements in minority regions while maintaining overall accuracy.
摘要：在许多现实世界回归任务中，数据分布大量偏斜，模型主要从大多数样本中学习，同时无法准确预测少数群体标签。尽管对分类不平衡进行了广泛的研究，但不平衡的回归仍然相对尚未探索。深度不平衡回归（DIR）表示输入数据是高度且非结构化的情况。尽管存在几种数据级方法，但目前存在深层不平衡回归的回归方法，但缺乏适合高维数据的专用数据级解决方案，并且主要依赖于算法修改。为了填补这一空白，我们提出了LitentDiff，这是一个新颖的框架，它使用具有基于优先级生成的条件扩散模型来合成潜在表示空间中的高质量特征。潜伏在计算上是有效的，并且适用于各种数据模式，包括图像，文本和其他高维输入。在三个DIR基准上进行的实验表明，少数族裔地区的实质改善，同时保持整体准确性。

Title: NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning

Authors: Raviteja Anantha, Soheil Hor, Teodor Nicola Antoniu, Layne C. Price
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23252
Pdf URL: https://arxiv.org/pdf/2509.23252
Copy Paste: [[2509.23252]] NanoFlux: Adversarial Dual-LLM Evaluation and Distillation For Multi-Domain Reasoning(https://arxiv.org/abs/2509.23252)
Keywords: generation
Abstract: We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets containing fewer than 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations that target specific reasoning capabilities. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning (GSMHard), +3.6% on scientific reasoning (GenomeBench), and +16.6% on medical reasoning (MultiMedQA), while reducing computational requirements by 3-14x. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, suggesting that future model improvements may lie in the intelligent synthesis of small, precisely targeted training datasets.
摘要：我们提出了Nanoflux，这是一种新型的对抗框架，用于生成有针对性的训练数据以改善LLM推理，在这些培训数据中，较少少于200个示例的对抗生成的数据集优于常规的微调方法。该框架采用了由工具启动法官监督的攻击者和防守者的模型之间的竞争动态，并用针对特定推理能力的解释性注释综合了多步问题。与全基准微调相比，对纳米流量生成数据的4B参数模型与数学推理（GSMHARD）的 +5.9％相比，在科学推理（GSMHARD）中 +5.9％，科学推理（GenomeBench） +3.6％，而对医学推理（多米德QA）的需求 +16.6％。消融研究揭示了数据集特征与模型性能之间的非单调关系，从而发现了问题复杂性和推理质量的域特异性最佳点。 NanoFlux通过基于嵌入的新颖性过滤，刀具调查和多跳的推理来自动化培训数据生成，这表明未来的模型改进可能在于智能合成小型，精确的针对性培训数据集。

Title: OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting

Authors: Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23258
Pdf URL: https://arxiv.org/pdf/2509.23258
Copy Paste: [[2509.23258]] OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting(https://arxiv.org/abs/2509.23258)
Keywords: generative
Abstract: Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.
摘要：由于严重的几何歧义，稀疏视图的新视图合成根本不足。当前的方法是在权衡处被捕的：回归模型在几何忠诚但不完整中是不完整的，而生成模型可以完成场景，但通常会引入结构上的不一致之处。我们提出了Oraclegs，这是一个新颖的框架，可将生成性完整性与回归忠诚度相吻合，以稀疏的视图高斯分裂。我们的“建议和效率”框架不是使用生成模型来修补不完整的重建，而是首先利用预先训练的3D感知扩散模型来合成新颖的视图来提出一个完整的场景。然后，我们使用其注意图揭示了多视图的3D不确定性来验证多视觉立体声（MVS）模型，以验证产生的观点的3D不确定性，并使用其注意图揭示了由多视图证据与由于掩盖不确定的高度不确定的区域，而缺乏文本症状，因此，多视图证据与多视图证据相比，对生成的视图得到很好的支持。该不确定性信号直接通过不确定性加权损失来指导3D高斯拆卸模型的优化。我们的方法在多视图几何证据上的有力生成性先验，过滤幻觉伪像，同时保留不受约束区域的合理完成，在包括MIP-NERF 360和NERF合成的数据集中表现出色的最先进方法。

Title: SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Authors: Yihao Ding, Soyeon Caren Han, Yanbei Jiang, Yan Li, Zechuan Li, Yifan Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23273
Pdf URL: https://arxiv.org/pdf/2509.23273
Copy Paste: [[2509.23273]] SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction(https://arxiv.org/abs/2509.23273)
Keywords: generation, generative
Abstract: Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model's ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.
摘要：特定领域的视觉文档理解（VRDU）提出了重大挑战，这是由于文档在医学，金融和材料科学等领域的复杂性和敏感性。现有的大型（多模式）语言模型（LLMS/MLLM）取得了令人鼓舞的结果，但是面对面的限制，例如幻觉，域的适应不足以及依赖广泛的微调数据集。本文介绍了Syndoc，这是一个新颖的框架，结合了歧视性和生成性模型来应对这些挑战。 Syndoc使用结构信息提取和特定于域的查询生成来产生高质量的注释，采用强大的合成数据生成工作流程。通过自适应教学调整，Syndoc提高了判别模型提取特定领域知识的能力。同时，递归推论机制迭代地完善了这两个模型的输出，以进行稳定和准确的预测。该框架展示了可扩展，高效和精确的文档理解，并弥合了特定于域特异性适应和一般世界知识之间的差距，以进行文档关键信息提取任务。

Title: Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Authors: Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal, Siddharth Roheda
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23279
Pdf URL: https://arxiv.org/pdf/2509.23279
Copy Paste: [[2509.23279]] Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing(https://arxiv.org/abs/2509.23279)
Keywords: generation
Abstract: The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.
摘要：图像到视频（I2V）生成模型的快速进步引入了很大的风险，从而从静态图像中实现了视频综合，并促进了欺骗性或恶意内容的创造。虽然先前的防御能力（例如I2VGuard试图免疫图像，但有效且原则上的防护措施阻止了运动。在这项工作中，我们介绍了Vid-Freeze - 一种新颖的引起注意的对抗性攻击，为图像增添了精心制作的对抗性扰动。我们的方法明确针对I2V模型的注意机制，完全破坏了运动综合，同时保留了输入图像的语义保真度。由此产生的免疫图像产生了静止或近静止的视频，有效地阻止了恶意内容的创建。我们的实验证明了拟议方法提供了令人印象深刻的保护，强调了注意力发作的重要性，这是防止滥用I2V生成模型的强大和主动防御的有希望的方向。

Title: Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection

Authors: Minsun Jeon, Simon S. Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23289
Pdf URL: https://arxiv.org/pdf/2509.23289
Copy Paste: [[2509.23289]] Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection(https://arxiv.org/abs/2509.23289)
Keywords: generative
Abstract: The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: this https URL
摘要：生成AI的快速发展使光真逼真的合成图像的大规模产生，从而模糊了真实和制作的视觉内容之间的边界。在涉及面部操纵的深层场景中，这一挑战尤其明显，但也扩展到涉及完全合成场景的更广泛的AI生成内容（AIGC）案例。由于这种内容变得越来越难以区分现实，因此视觉媒体的完整性受到威胁。为了解决这个问题，我们提出了一个可以物理上可以解释的深泡检测框架，并证明了defocus Blur可以作为有效的法医信号。 Defocus Blur是一种与深度依赖的光学现象，由于镜头焦点和场景几何形状而自然发生在摄像机捕获图像中。相反，合成图像通常缺乏现实的场地（DOF）特征。为了捕获这些差异，我们构建了一个散焦模糊图，并将其用作检测操纵内容的区分功能。与RGB纹理或频域信号不同，defocus模糊是由光学成像原理引起的，并编码物理场景结构。这使其成为一种强大且可推广的法医提示。我们的方法得到了三个深入的特征分析的支持，实验结果证实了Defocus Blur为识别合成图像提供了可靠且可解释的提示。我们的目标是基于散焦的检测管道和解释性工具，以有意义地为媒体取证中的持续研究做出贡献。该实现可公开可用：此HTTPS URL

Title: Seeing the Unseen in Low-light Spike Streams

Authors: Liwen Hu, Yang Li, Mianzhi Liu, Yijia Guo, Shenghao Xie, Ziluo Ding, Tiejun Huang, Lei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23304
Pdf URL: https://arxiv.org/pdf/2509.23304
Copy Paste: [[2509.23304]] Seeing the Unseen in Low-light Spike Streams(https://arxiv.org/abs/2509.23304)
Keywords: generation, generative
Abstract: Spike camera, a type of neuromorphic sensor with high-temporal resolution, shows great promise for high-speed visual tasks. Unlike traditional cameras, spike camera continuously accumulates photons and fires asynchronous spike streams. Due to unique data modality, spike streams require reconstruction methods to become perceptible to the human eye. However, lots of methods struggle to handle spike streams in low-light high-speed scenarios due to severe noise and sparse information. In this work, we propose Diff-SPK, the first diffusion-based reconstruction method for spike camera. Diff-SPK effectively leverages generative priors to supplement texture information in low-light conditions. Specifically, it first employs an \textbf{E}nhanced \textbf{T}exture \textbf{f}rom Inter-spike \textbf{I}nterval (ETFI) to aggregate sparse information from low-light spike streams. Then, ETFI serves as a conditioning input for ControlNet to generate the high-speed scenes. To improve the quality of results, we introduce an ETFI-based feature fusion module during the generation process. Moreover, we establish the first bona fide benchmark for the low-light spike stream reconstruction task. It significantly surpasses existing reconstruction datasets in scale and provides quantitative illumination information. The performance on real low-light spike streams demonstrates the superiority of Diff-SPK.
摘要：Spike Camera是一种具有高速分辨率的神经形态传感器，对高速视觉任务显示出巨大的希望。与传统的摄像机不同，尖峰摄像头连续积聚光子和火灾异步尖峰流。由于独特的数据方式，尖峰流需要重建方法才能使人眼可感知。但是，由于严重的噪音和稀疏信息，许多方法难以在低光高速场景中处理尖峰流。在这项工作中，我们提出了DIFF-SPK，这是Spike相机的第一个基于扩散的重建方法。 DIFF-SPK有效地利用生成先验来补充弱光条件下的纹理信息。具体而言，它首先采用\ textbf {e} nhanced \ textbf {t} exture \ textbf {f} rom spike \ textbf {i} nterval（etfi）从低光峰流中汇总稀疏信息。然后，ETFI用作控制网的条件输入，以生成高速场景。为了提高结果质量，我们在生成过程中引入了基于ETFI的特征融合模块。此外，我们为低光尖峰流重建任务建立了第一个真正的基准。它大大超过了规模的现有重建数据集并提供定量照明信息。真正的弱光尖峰流的性能证明了DIFF-SPK的优势。

Title: A Neural ODE Approach to Aircraft Flight Dynamics Modelling

Authors: Gabriel Jarry, Ramon Dalmau, Xavier Olive, Philippe Very
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23307
Pdf URL: https://arxiv.org/pdf/2509.23307
Copy Paste: [[2509.23307]] A Neural ODE Approach to Aircraft Flight Dynamics Modelling(https://arxiv.org/abs/2509.23307)
Keywords: generation
Abstract: Accurate aircraft trajectory prediction is critical for air traffic management, airline operations, and environmental assessment. This paper introduces NODE-FDM, a Neural Ordinary Differential Equations-based Flight Dynamics Model trained on Quick Access Recorder (QAR) data. By combining analytical kinematic relations with data-driven components, NODE-FDM achieves a more accurate reproduction of recorded trajectories than state-of-the-art models such as a BADA-based trajectory generation methodology (BADA4 performance model combined with trajectory control routines), particularly in the descent phase of the flight. The analysis demonstrates marked improvements across altitude, speed, and mass dynamics. Despite current limitations, including limited physical constraints and the limited availability of QAR data, the results demonstrate the potential of physics-informed neural ordinary differential equations as a high-fidelity, data-driven approach to aircraft performance modelling. Future work will extend the framework to incorporate a full modelling of the lateral dynamics of the aircraft.
摘要：准确的飞机轨迹预测对于空中交通管理，航空公司运营和环境评估至关重要。本文介绍了Node-FDM，这是一种基于快速访问录音机（QAR）数据训练的基于神经微分方程的飞行动力学模型。通过将分析运动学关系与数据驱动的组件相结合，Node-FDM比最先进的模型（例如基于BADA的基于BADA的轨迹生成方法）（BADA4性能模型与轨迹控制例程）更准确地复制了记录的轨迹。该分析表明，跨高度，速度和质量动态的明显改善。尽管当前的局限性，包括有限的物理限制和QAR数据的有限可用性，但结果表明了物理知识神经的普通微分方程的潜力，作为对飞机性能建模的高保真性，数据驱动的方法。未来的工作将扩展框架，以结合飞机横向动态的完整建模。

Title: LRPO: Enhancing Blind Face Restoration through Online Reinforcement Learning

Authors: Bin Wu, Yahui Liu, Chi Zhang, Yao Zhao, Wei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23339
Pdf URL: https://arxiv.org/pdf/2509.23339
Copy Paste: [[2509.23339]] LRPO: Enhancing Blind Face Restoration through Online Reinforcement Learning(https://arxiv.org/abs/2509.23339)
Keywords: restoration
Abstract: Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a Likelihood-Regularized Policy Optimization (LRPO) framework, the first to apply online reinforcement learning (RL) to the BFR task. LRPO leverages rewards from sampled candidates to refine the policy network, increasing the likelihood of high-quality outputs while improving restoration performance on low-quality inputs. However, directly applying RL to BFR creates incompatibility issues, producing restoration results that deviate significantly from the ground truth. To balance perceptual quality and fidelity, we propose three key strategies: 1) a composite reward function tailored for face restoration assessment, 2) ground-truth guided likelihood regularization, and 3) noise-level advantage assignment. Extensive experiments demonstrate that our proposed LRPO significantly improves the face restoration quality over baseline methods and achieves state-of-the-art performance.
摘要：盲人面部修复（BFR）在探索其较大的解决方案空间时遇到了固有的挑战，从而导致了诸如修复的图像中缺少细节和身份歧义之类的常见人工制品。为了应对这些挑战，我们提出了一个可能性调节的政策优化（LRPO）框架，这是第一个将在线增强学习（RL）应用于BFR任务的框架。 LRPO利用采样候选人的奖励来完善政策网络，增加了高质量产出的可能性，同时提高了低质量输入的恢复性能。但是，直接将RL应用于BFR会产生不兼容的问题，从而产生恢复结果，从而显着偏离了地面真相。为了平衡感知质量和忠诚度，我们提出了三个关键策略：1）为面部恢复评估量身定制的综合奖励功能，2）地面确实指导了可能的可能性正规化，以及3）噪声级优势分配。广泛的实验表明，我们提出的LRPO显着提高了面部恢复质量，而不是基线方法，并实现了最先进的性能。

Title: Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport

Authors: Xavier Aramayo Carrasco, Grigoriy Ksenofontov, Aleksei Leonov, Iaroslav Sergeevich Koshelev, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23348
Pdf URL: https://arxiv.org/pdf/2509.23348
Copy Paste: [[2509.23348]] Entering the Era of Discrete Diffusion Models: A Benchmark for Schrödinger Bridges and Entropic Optimal Transport(https://arxiv.org/abs/2509.23348)
Keywords: generative
Abstract: The Entropic Optimal Transport (EOT) problem and its dynamic counterpart, the Schrödinger bridge (SB) problem, play an important role in modern machine learning, linking generative modeling with optimal transport theory. While recent advances in discrete diffusion and flow models have sparked growing interest in applying SB methods to discrete domains, there is still no reliable way to evaluate how well these methods actually solve the underlying problem. We address this challenge by introducing a benchmark for SB on discrete spaces. Our construction yields pairs of probability distributions with analytically known SB solutions, enabling rigorous evaluation. As a byproduct of building this benchmark, we obtain two new SB algorithms, DLightSB and DLightSB-M, and additionally extend prior related work to construct the $\alpha$-CSBM algorithm. We demonstrate the utility of our benchmark by evaluating both existing and new solvers in high-dimensional discrete settings. This work provides the first step toward proper evaluation of SB methods on discrete spaces, paving the way for more reproducible future studies.
摘要：熵最佳运输（EOT）问题及其动态对应物，Schrödinger桥（SB）问题，在现代机器学习中起着重要作用，将生成性建模与最佳运输理论联系起来。尽管离散扩散和流模型的最新进展激发了人们对将SB方法应用于离散域的日益兴趣，但仍然没有可靠的方法来评估这些方法实际上如何解决基本问题。我们通过在离散空间上引入SB的基准来应对这一挑战。我们的构造产生了与分析已知的SB解决方案的概率分布对，从而实现了严格的评估。作为构建此基准测试的副产品，我们获得了两种新的SB算法，DlightSB和DlightSB-M，并将其扩展到先前的相关工作以构建$ \ alpha $ -CSBM算法。我们通过在高维离散设置中评估现有和新的求解器来证明我们的基准标准的实用性。这项工作为在离散空间上正确评估SB方法提供了第一步，为更可重复的未来研究铺平了道路。

Title: Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

Authors: Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, Yu Shi, Zhen Chen, Junshi Huang, Jason Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23352
Pdf URL: https://arxiv.org/pdf/2509.23352
Copy Paste: [[2509.23352]] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling(https://arxiv.org/abs/2509.23352)
Keywords: generation
Abstract: The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.
摘要：将增强学习（RL）集成到文本对图像（T2i）生成的流量匹配模型中，已促进了生成质量的重大进步。但是，由于抽样组的略有变化，这些收益通常以详尽的探索和效率低下的抽样策略为代价。在此洞察力的基础上，我们提出了Dynamic-treerpo，它将滑动窗口采样策略实现为树结构的搜索，并沿深度沿深度进行动态噪声强度。我们在此树结构中执行GRPO引导的优化和约束随机微分方程（SDE）采样。通过共享树的前缀路径，我们的设计有效地摊销了轨迹搜索的计算开销。对于每个树层设计良好的噪声强度，动态 - 特里普可以增强探索的变化，而无需任何额外的计算成本。此外，我们在Dynamic-Treerpo中无缝地整合了监督的微调（SFT）和RL范式，以构建我们提出的LayerTuning-RL，从而重新将SFT作为动态加权进度奖励模型（PRM）而不是单独的预审计方法重新定义。通过将这种加权PRM与动态自适应剪接边界相关联，可以避免动态 - 特里普中的勘探过程中断。我们的模型受益于树结构化的采样和LayerTuning-RL范式，动态地探索了有效的搜索空间。与现有基线相比，我们的方法在既定基准（包括HPS-v2.1，pickscore and ImageReward）上对语义一致性，视觉保真度和人类偏好对齐方面表现出显着优势。特别是，我们的模型在这些基准测试基准上的表现分别优于SOTA $ 4.9 \％$，$ 5.91 \％\％$和$ 8.66 \％$，同时将培训效率提高了近50美元\％$。

Title: Landing with the Score: Riemannian Optimization through Denoising

Authors: Andrey Kharitenko, Zebang Shen, Riccardo de Santi, Niao He, Florian Doerfler
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23357
Pdf URL: https://arxiv.org/pdf/2509.23357
Copy Paste: [[2509.23357]] Landing with the Score: Riemannian Optimization through Denoising(https://arxiv.org/abs/2509.23357)
Keywords: generative
Abstract: Under the data manifold hypothesis, high-dimensional data are concentrated near a low-dimensional manifold. We study the problem of Riemannian optimization over such manifolds when they are given only implicitly through the data distribution, and the standard manifold operations required by classical algorithms are unavailable. This formulation captures a broad class of data-driven design problems that are central to modern generative AI. Our key idea is to introduce a link function that connects the data distribution to the geometric operations needed for optimization. We show that this function enables the recovery of essential manifold operations, such as retraction and Riemannian gradient computation. Moreover, we establish a direct connection between our construction and the score function in diffusion models of the data distribution. This connection allows us to leverage well-studied parameterizations, efficient training procedures, and even pretrained score networks from the diffusion model literature to perform optimization. Building on this foundation, we propose two efficient inference-time algorithms -- Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD) -- and provide theoretical guarantees for both feasibility (approximate manifold adherence) and optimality (small Riemannian gradient norm). Finally, we demonstrate the effectiveness of our approach on finite-horizon reference tracking tasks in data-driven control, highlighting its potential for practical generative and design applications.
摘要：在数据歧管假设下，高维数据集中在低维歧管附近。我们研究了仅通过数据分布隐式给出的Riemannian优化问题，而经典算法要求的标准歧管操作是不可用的。这种表述捕获了广泛的数据驱动的设计问题，这些问题对现代生成AI至关重要。我们的关键想法是引入一个链接功能，该链接功能将数据分布连接到优化所需的几何操作。我们表明，此功能可以恢复基本歧管操作，例如缩回和Riemannian梯度计算。此外，我们在数据分布的扩散模型中建立了构造和得分函数之间的直接连接。这种连接使我们能够利用研究良好的参数化，有效的训练程序，甚至从扩散模型文献中验证的得分网络来进行优化。在这个基础的基础上，我们提出了两种有效的推理时间算法 - 降低着陆流（DLF）和DeNoEmannian梯度下降（DRGD） - 并为可行性（大概歧管依从性）和优化性提供理论保证（小riemannian梯度规范）。最后，我们证明了方法对数据驱动控制中有限的参考跟踪任务的有效性，从而强调了其实用生成和设计应用的潜力。

Title: Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23365
Pdf URL: https://arxiv.org/pdf/2509.23365
Copy Paste: [[2509.23365]] Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought(https://arxiv.org/abs/2509.23365)
Keywords: generation
Abstract: Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.
摘要：先前的工作表明，连续思考的链（连续COT）通过启用隐式平行思维来提高大语言模型（LLMS）的推理能力，随后的工作提供了理论上的洞察力，这表明配备了连续COT的两层变压器可以通过保持多重理性思想的超级构想来有效地实现，可以有效地实现多个跨性别的距离。但是，目前尚不清楚如何从基于梯度的训练方法中自然学到叠加机制。为了填补这一空白，我们从理论上分析了在有向的图形可及性问题上简化两层变压器的训练动力学，以揭示叠加机制在两个训练阶段的训练过程中如何出现的 - （i）自动进取的思想产生阶段，自动锻炼可以自动化地扩展连续的思想，以及（ii）将预测阶段转换为最终的回答。我们的分析表明，在使用持续思考的训练期间，索引匹配的logit是一个重要数量，反映了模型本地搜索能力的强度，将首先提高，然后在温和的假设下保持界限。有限的索引匹配logit有效地平衡了推理过程中的探索和剥削：该模型将利用本地问题结构来识别可行的搜索轨迹，并为多个这样的痕迹分配可比的权重，以探索何时不确定哪个解决方案是正确的，从而导致叠加。我们跟踪逻辑增长的实验结果进一步验证了我们的理论。

Title: Generative Modeling of Shape-Dependent Self-Contact Human Poses

Authors: Takehiko Ohkawa, Jihyun Lee, Shunsuke Saito, Jason Saragih, Fabian Prado, Yichen Xu, Shoou-I Yu, Ryosuke Furuta, Yoichi Sato, Takaaki Shiratori
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23393
Pdf URL: https://arxiv.org/pdf/2509.23393
Copy Paste: [[2509.23393]] Generative Modeling of Shape-Dependent Self-Contact Human Poses(https://arxiv.org/abs/2509.23393)
Keywords: generative
Abstract: One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.
摘要：一个人几乎不能在不考虑潜在的身体形状的情况下对人姿势的自我接触。例如，为低BMI的人摩擦腹部的姿势会导致腹部渗透到BMI高的人的腹部。尽管有相关性，但现有的自我接触数据集缺乏各种自我接触姿势和精确的身体形状，从而限制了自我接触姿势和形状之间的结论性分析。为了解决这个问题，我们首先引入了第一个具有精确身体形状注册的广泛的自我接触数据集，Goliath-SC，由130名受试者构成383K自我接触姿势。使用此数据集，我们提出了基于身体零件的潜在扩散和自我注意力的自我形状参数来调节的自我接触的生成建模。我们将其进一步纳入了单视姿势估计中，同时精炼估计的姿势是接触的。我们的实验表明，形状调节对于自我接触姿势分布的成功建模至关重要，从而改善了自我接触中的单视姿势估计。

Title: WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Authors: Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23402
Pdf URL: https://arxiv.org/pdf/2509.23402
Copy Paste: [[2509.23402]] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving(https://arxiv.org/abs/2509.23402)
Keywords: generation, generative
Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose \textbf{WorldSplat}, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: ((i)) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. ((ii)) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that \textbf{WorldSplat} effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.
摘要：驾驶现场生成和重建方面的最新进展表明，通过产生可扩展和可控的训练数据来增强自动驾驶系统的重要潜力。现有的发电方法主要集中于综合多样化和高保真驾驶视频；但是，由于3D一致性和稀疏观点覆盖范围，它们努力支持方便和高质量的小说合成（NVS）。相反，最近的3D/4D重建方法已大大改善了现实世界驾驶场景的NV，但固有地缺乏生成能力。为了克服场景产生和重建之间的这一困境，我们提出了\ textbf {worldsplat}，这是一个新颖的4D驾驶场景式框架框架。我们的方法通过两个关键步骤有效地生成一致的多轨视频：（（i））我们引入了一个4D感知的潜在扩散模型，该模型集成了多模式信息，以以馈送方式产生与像素相称的4D高斯人。（（ii））随后，我们使用增强的视频扩散模型来完善从这些高斯人呈现的新型视频视频。在基准数据集上进行的大量实验表明，\ textbf {worldsplat}有效地产生了高效率，在时间和空间上是一致的多轨道新型视图驱动视频。

Title: Planner Aware Path Learning in Diffusion Language Models Training

Authors: Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Avishek Joey Bose, Alexander Tong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23405
Pdf URL: https://arxiv.org/pdf/2509.23405
Copy Paste: [[2509.23405]] Planner Aware Path Learning in Diffusion Language Models Training(https://arxiv.org/abs/2509.23405)
Keywords: generation
Abstract: Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation.
摘要：扩散语言模型已成为自回归模型的强大替代方法，从而可以通过灵活和平行的生成路径快速推断。这种灵活性是通过新的抽样策略或计划者来实现的，它可以迭代地选择沿序列来定位的位置，而不是随机地均匀地进行采样。但是，通过修改反向路径，计划人员在训练过程中使用的均匀随机降解路径与推理时使用的基于计划的路径之间引入了不匹配。在这项工作中，我们系统地研究了这一不匹配，理论上表明标准离散扩散训练证据下限（ELBO）在不均匀计划下并未准确描述Denoiser。为了弥合这一差距，我们得出了一个新的计划证据下限（P-ELBO），该证据将基于计划者的反向动态直接纳入训练目标。在此基础上，我们提出了计划者意识路径学习（PAPL），这是对标准掩盖的离散扩散损失的简单有效修改，该损失使计划中的Denoisiser下的培训和推论保持一致。从经验上讲，PAPL在范围内提供一致的改进，包括蛋白质序列建模的40％相对增益，文本生成的淡紫色的4倍提高，人类人体通行证的相对增益为23％@10 in 10。

Title: FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Authors: Tanawan Premsri, Parisa Kordjamshidi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23452
Pdf URL: https://arxiv.org/pdf/2509.23452
Copy Paste: [[2509.23452]] FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing(https://arxiv.org/abs/2509.23452)
Keywords: generation
Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Multimodal Language models, the moment has come to integrate this long-overlooked dimension into these models. In particular, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. For-Sale evaluates the alignment between a given text and an initially generated image, and refines the image based on the Frame of Reference specified in the spatial expressions. It employs vision modules to extract the spatial configuration of the image, while simultaneously mapping the spatial expression to a corresponding camera perspective. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3% using only a single round of correction.
摘要：参考框架（For）是人类用来理解和描述空间的空间推理中的一个基本概念。随着多模式语言模型的快速进步，这一刻已经将这个长期忽视的维度整合到这些模型中。特别是，在文本到图像（T2i）一代中，当从相机以外的其他角度提供空间描述时，即使是最先进的模型也会显示出显着的性能差距。为了解决这一限制，我们提出了基于LLM的扩散编辑（for-sale）中参考引导的空间调整框架，这是T2i的自我校正LLM控制的扩散（SLD）框架的扩展。待售评估给定文本和最初生成的图像之间的对齐方式，并根据空间表达式中指定的参考框架来完善图像。它采用视觉模块来提取图像的空间配置，同时将空间表达式映射到相应的相机透视图。这种统一的观点可以直接评估语言和视觉之间的一致性。检测到未对准时，会生成和应用所需的编辑操作。 For-Sale应用新颖的潜在空间操作来调整生成图像的面向方向和深度。我们在两个专门设计用于评估空间理解的基准上评估了销售。我们的框架将最先进的T2I模型的性能提高了高达5.3％，仅使用一轮更正。

Title: No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation

Authors: Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23457
Pdf URL: https://arxiv.org/pdf/2509.23457
Copy Paste: [[2509.23457]] No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation(https://arxiv.org/abs/2509.23457)
Keywords: generation
Abstract: Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: this https URL
摘要：尽管最近的文本图像（T2I）模型取得了进步，但它们通常无法忠实地呈现复杂提示的所有要素，经常省略或歪曲特定的对象和属性。测试时间优化已成为一种有前途的方法，可以通过精炼生成而无需再进行重新培训来解决这一限制。在本文中，我们提出了一个细粒度的测试时间优化框架，以增强T2i生成中的组成忠诚度。与大多数仅依赖全球图像/文本相似性得分的先前方法不同，我们的方法将输入提示分解为语义概念，并评估全局和概念级别的对齐。剪辑的细粒变体用于计算概念级的对应关系，从而产生有关缺失或不准确概念的详细反馈。该反馈被馈入迭代提示循环中，使大型语言模型能够提出改进的提示。 Drawbench和Compbench上的实验提示表明，我们的方法显着改善了标准测试时间优化和基本T2I模型的概念覆盖范围和人为判断的忠诚。代码可用：此HTTPS URL

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Learning

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23462
Pdf URL: https://arxiv.org/pdf/2509.23462
Copy Paste: [[2509.23462]] Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Learning(https://arxiv.org/abs/2509.23462)
Keywords: generative
Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~6x faster, has 1.3x less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.
摘要：可伸缩的多代理增强学习（MARL）仍然是AI的核心挑战。现有的基于人群的方法，例如策略空间响应Oracles，PSRO，需要存储明确的政策人群并构建全部回报矩阵，产生二次计算和线性内存成本。我们提出了生成的进化元溶剂（GEMS），这是一个无替代的框架，用紧凑的潜在锚和单个摊销生成器代替显式种群。 Gems并没有详尽构建收益矩阵，而是依靠公正的蒙特卡洛推出，乘法元动力学和无模型的经验伯恩斯坦UCB Oracle来适应性地扩展了策略集。最佳响应是使用基于优势的信任区目标在发电机中培训的，从而消除了存储和训练单独的演员的需求。我们在各种两人和多玩家游戏中评估了宝石，例如欺骗性消息游戏，库恩扑克和多粒子环境。我们发现，GEMS的速度快〜6倍，比PSRO的内存使用率少1.3倍，而同时获得更高的奖励。这些结果表明，GEMS保留了PSRO的游戏理论保证，同时克服了其基本的低效率，因此可以在多个领域中可扩展的多机构学习。

Title: RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation

Authors: Shourya Verma, Mengbo Wang, Nadia Atallah Lanman, Ananth Grama
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23480
Pdf URL: https://arxiv.org/pdf/2509.23480
Copy Paste: [[2509.23480]] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation(https://arxiv.org/abs/2509.23480)
Keywords: restoration, generative
Abstract: Current approaches for restoration of degraded images face a critical trade-off: high-performance models are too slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose 'RestoRect', a novel Latent Rectified Flow Feature Distillation method for restoring degraded images. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex theory for physics-based decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality. We demonstrate superior results across 15 image restoration datasets, covering 4 tasks, on 8 metrics.
摘要：当前恢复降级图像的方法面临着关键的权衡：高性能模型对于实际使用而言太慢，而快速模型会产生较差的结果。知识蒸馏将教师知识转移给学生，但是现有的静态功能匹配方法无法捕获现代变压器架构如何动态生成特征。我们提出了一种新型的潜在整流流量蒸馏方法“ Restorect”，用于恢复降解的图像。我们将整流的流程进行重新调整功能蒸馏作为生成过程，学生通过在潜在空间中的可学习轨迹学会合成教师品质的特征。我们的框架将基于物理的分解的视网膜理论与可学习的各向异性扩散约束和三角色空间极化相结合。我们引入了特征层提取损失，以通过基于百分位数的异常值检测来通过交叉标准化变压器特征对准不同的网络体系结构之间的鲁棒知识转移。还原可实现更好的训练稳定性，更快的融合和推理，同时保持恢复质量。我们在15个图像修复数据集中展示了涵盖8个指标的4个任务的卓越结果。

Title: Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis

Authors: Ibne Farabi Shihab, Weiheng Chai, Jiyang Wang, Sanjeda Akter, Senem Velipasalar Gursoy, Anuj Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23535
Pdf URL: https://arxiv.org/pdf/2509.23535
Copy Paste: [[2509.23535]] Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis(https://arxiv.org/abs/2509.23535)
Keywords: super-resolution
Abstract: Driver monitoring systems require not just high accuracy but reliable, well-calibrated confidence scores for safety-critical deployment. While direct low-resolution training yields high overall accuracy, it produces poorly calibrated predictions that can be dangerous in safety-critical scenarios. We propose a resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events. Our approach achieves state-of-the-art performance on safety-centric metrics: best calibration (ECE of 5.8\% vs 6.2\% for LR-trained baselines), highest AUPR for drowsiness detection (0.78 vs 0.74), and superior precision-recall for phone use detection (0.74 vs 0.71). A lightweight artifact detector (0.3M parameters, 5.2ms overhead) provides additional safety by filtering SR-induced hallucinations. While LR-trained video models serve as strong general-purpose baselines, our adaptive framework represents the state-of-the-art solution for safety-critical applications where reliability is paramount.
摘要：驾驶员监控系统不仅需要高精度，而且需要可靠，良好的置信度分数才能安全至关重要。虽然直接的低分辨率训练可产生高度准确性，但它会产生校准较差的预测，在安全至关重要的情况下可能是危险的。我们提出了一个资源感知的自适应超分辨率框架，该框架可针对模型校准优化，并在关键事件上进行高精度重新调整。我们的方法在以安全为中心的指标上实现了最先进的性能：最佳校准（ECE为5.8 \％\％vs 6.2 \％的LR训练式基准线），最高的AUPR用于嗜睡检测（0.78 vs 0.74 vs 0.74），以及用于手机使用检测的卓越精确度（0.74 vs 0.71）。轻巧的伪影检测器（0.30万参数，开销5.2m）通过过滤SR诱导的幻觉提供了额外的安全性。尽管LR训练的视频模型充当强大的通用基线，但我们的自适应框架代表了可靠性至关重要的安全至关重要应用的最新解决方案。

Title: Disentanglement of Variations with Multimodal Generative Modeling

Authors: Yijie Zhang, Yiyang Shen, Weiran Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23548
Pdf URL: https://arxiv.org/pdf/2509.23548
Copy Paste: [[2509.23548]] Disentanglement of Variations with Multimodal Generative Modeling(https://arxiv.org/abs/2509.23548)
Keywords: generation, generative
Abstract: Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.
摘要：多模式数据在各个域中都普遍存在，并且学习此类数据的鲁棒表示对于增强发电质量和下游任务性能至关重要。为了处理不同模态之间的异质性和互连，最近的多模式模型提取了共享和私有（特定于模态的）信息，并使用两个单独的变量提取。尽管试图在这两个变量之间执行分离，但这些方法在可能性模型不足的挑战性数据集中遇到困难。在本文中，我们提出了信息删除的多模式VAE（IDMVAE），以明确解决此问题，并具有严格的基于信息的正规化，包括用于提取共享变量的跨视图相互信息最大化，以及使用生成增长的延期删除的周期矛盾样式损失。我们进一步介绍了扩散模型，以提高潜在先验的能力。这些新提出的组件彼此互补。与现有方法相比，IDMVAE在共享和私人信息之间显示出干净的分离，证明了质量上升的质量和挑战性数据集的语义连贯性。

Title: From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations

Authors: Javed Ahmad, Penggang Gao, Donatien Delehelle, Mennuti Canio, Nikhil Deshpande, Jesús Ortiz, Darwin G. Caldwell, Yonas Teodros Tefera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23555
Pdf URL: https://arxiv.org/pdf/2509.23555
Copy Paste: [[2509.23555]] From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations(https://arxiv.org/abs/2509.23555)
Keywords: generation
Abstract: Neural scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have transformed how 3D environments are modeled, rendered, and interpreted. NeRF introduced view-consistent photorealism via volumetric rendering; 3DGS has rapidly emerged as an explicit, efficient alternative that supports high-quality rendering, faster optimization, and integration into hybrid pipelines for enhanced photorealism and task-driven scene understanding. This survey examines how 3DGS is being adopted across SLAM, telepresence and teleoperation, robotic manipulation, and 3D content generation. Despite their differences, these domains share common goals: photorealistic rendering, meaningful 3D structure, and accurate downstream tasks. We organize the review around unified research questions that explain why 3DGS is increasingly displacing NeRF-based approaches: What technical advantages drive its adoption? How does it adapt to different input modalities and domain-specific constraints? What limitations remain? By systematically comparing domain-specific pipelines, we show that 3DGS balances photorealism, geometric fidelity, and computational efficiency. The survey offers a roadmap for leveraging neural rendering not only for image synthesis but also for perception, interaction, and content creation across real and virtual environments.
摘要：神经场景表示，例如神经辐射场（NERF）和3D高斯裂口（3DG），已经改变了3D环境的建模，渲染和解释。 NERF通过体积渲染引入了视图一致的光真相； 3DGS已迅速成为一种明确，有效的替代方案，该替代方案支持高质量的渲染，更快的优化和集成到混合管道中，以增强了光真相和任务驱动的场景理解。这项调查研究了如何在SLAM，电视和远程操作，机器人操纵和3D内容产生的情况下如何采用3DG。尽管存在差异，但这些领域都有共同的目标：逼真的渲染，有意义的3D结构和准确的下游任务。我们围绕统一的研究问题组织了评论，以解释为什么3DGS越来越多地取代基于NERF的方法：哪些技术优势推动其采用？它如何适应不同的输入方式和特定于域的约束？还有什么限制？通过系统地比较域特异性管道，我们表明3DGS平衡了光真相，几何保真度和计算效率。该调查提供了一个路线图，不仅利用神经渲染来构成图像综合，而且还利用了真实和虚拟环境之间的感知，互动和内容创建。

Title: Towards Interpretable Visual Decoding with Attention to Brain Representations

Authors: Pinyuan Feng, Hossein Adeli, Wenxuan Guo, Fan Cheng, Ethan Hwang, Nikolaus Kriegeskorte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23566
Pdf URL: https://arxiv.org/pdf/2509.23566
Copy Paste: [[2509.23566]] Towards Interpretable Visual Decoding with Attention to Brain Representations(https://arxiv.org/abs/2509.23566)
Keywords: generation, generative
Abstract: Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, helping brain science researchers interpret how the brain represents real-world scenes. However, most current approaches leverage mapping brain signals into intermediate image or text feature spaces before guiding the generative process, masking the effect of contributions from different brain areas on the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals shape the generation process. To this end, we contribute an Image-Brain BI-directional interpretability framework (IBBI) which investigates cross-attention mechanisms across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our results highlight the potential of end-to-end brain-to-image decoding and establish a path toward interpreting diffusion models through the lens of visual neuroscience.
摘要：最近的工作表明，复杂的视觉刺激可以使用深层生成模型从人的大脑活动中解码，从而帮助脑科学研究人员解释了大脑如何代表现实世界的场景。但是，大多数当前方法在引导生成过程之前，将大脑信号映射到中间图像或文本特征空间中，从而掩盖了来自不同大脑区域的贡献对最终重建输出的影响。在这项工作中，我们提出了神经整齐的框架，这是一种直接在大脑表示上的潜在扩散模型，绕开了对中间特征空间的需求。与先前的工作相比，我们的方法证明了公共fMRI数据集的竞争视觉重建质量，同时为大脑信号如何塑造发电过程提供了更大的透明度。为此，我们贡献了一个图像脑双向解释性框架（IBBI），该框架研究了跨扩散的跨性别步骤的跨注意机制，以揭示不同的皮质区域如何影响展开的生成轨迹。我们的结果突出了端到端的大脑到图像解码的潜力，并建立了通过视觉神经科学镜头来解释扩散模型的途径。

Title: RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

Authors: Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23582
Pdf URL: https://arxiv.org/pdf/2509.23582
Copy Paste: [[2509.23582]] RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization(https://arxiv.org/abs/2509.23582)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at this https URL .
摘要：扩散变压器（DIT）最近成为图像生成的强大骨干，表现出优于U-NET体系结构的卓越可扩展性和性能。但是，其实际部署受到大量计算和内存成本的阻碍。虽然量化感知训练（QAT）对U-NET表示有希望，但其在DIT上的应用面临着独特的挑战，这主要是由于激活的灵敏度和分布复杂性。在这项工作中，我们将激活量化视为将DIT推到极低位置设置的主要瓶颈。为了解决这个问题，我们为DIT的系统框架框架提出了名为Robuq的系统框架。我们首先建立强大的三元重量（W1.58A4）DIT基线。在此基础上，我们提出了鲁棒量，以实现强大的激活量化。我们的理论分析表明，Hadamard转换可以将未知的人均分布转换为双向正常分布，从而为这种方法提供了强大的基础。此外，我们提出了AMPN，这是第一个仅用于DIT的仅激活的混合精液网络管道。此方法在整个网络上都应用三元权重，同时将不同的激活精度分配给每一层以消除信息瓶颈。通过对无条件和条件图像生成的广泛实验，我们的Robuq框架实现了以下4位量化配置中DIT量化的最新性能。据我们所知，Robuq是在大型数据集（例如Imagenet-1k）上首次实现稳定且具有竞争力的图像生成，并将激活量化为平均2位。代码和模型将在此HTTPS URL上可用。

Title: VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Authors: Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23584
Pdf URL: https://arxiv.org/pdf/2509.23584
Copy Paste: [[2509.23584]] VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement(https://arxiv.org/abs/2509.23584)
Keywords: restoration, super-resolution, generation, generative
Abstract: Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.
摘要：视频面部增强功能（VFE）试图从退化的视频序列中重建高质量的面部区域，该功能是基于众多应用程序的功能，包括视频会议，电影修复和监视。尽管该领域取得了长足的进步，但主要依赖视频超分辨率和生成框架的当前方法继续面临三个基本挑战：（1）忠实地对复杂的面部纹理进行建模，同时保留时间一致性；（2）由于缺乏高质量的面部视频训练数据而受到限制的模型概括；（3）在推断过程中反复降解步骤引起的低效率。为了应对这些挑战，我们提出了Vividface，这是一个新颖而有效的一步扩散框架，用于增强视频面部。我们的方法建立在概括的WANX视频生成模型的基础上，通过单步流量匹配范式利用强大的时空先验，从而可以直接映射从退化的输入到高质量的输出，并显着减少了推断时间。为了进一步提高效率，我们提出了一种以潜在像素为中心的训练策略，该策略在面部区域优化和全球重建之间采用随机切换，通过渐进的两阶段训练过程在潜在和像素空间中提供明确的监督。此外，我们引入了MLLM驱动的数据策展管道，以自动选择高质量的视频面数据集，从而增强模型概括。广泛的实验表明，生动的面实现最新的实验，可以带来感知质量，身份保存和时间稳定性，同时为研究社区提供实用资源。

Title: Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models

Authors: Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23593
Pdf URL: https://arxiv.org/pdf/2509.23593
Copy Paste: [[2509.23593]] Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models(https://arxiv.org/abs/2509.23593)
Keywords: generation
Abstract: Catastrophic forgetting remains a central obstacle for continual learning in neural models. Popular approaches -- replay and elastic weight consolidation (EWC) -- have limitations: replay requires a strong generator and is prone to distributional drift, while EWC implicitly assumes a shared optimum across tasks and typically uses a diagonal Fisher approximation. In this work, we study the gradient geometry of diffusion models, which can already produce high-quality replay data. We provide theoretical and empirical evidence that, in the low signal-to-noise ratio (SNR) regime, per-sample gradients become strongly collinear, yielding an empirical Fisher that is effectively rank-1 and aligned with the mean gradient. Leveraging this structure, we propose a rank-1 variant of EWC that is as cheap as the diagonal approximation yet captures the dominant curvature direction. We pair this penalty with a replay-based approach to encourage parameter sharing across tasks while mitigating drift. On class-incremental image generation datasets (MNIST, FashionMNIST, CIFAR-10, ImageNet-1k), our method consistently improves average FID and reduces forgetting relative to replay-only and diagonal-EWC baselines. In particular, forgetting is nearly eliminated on MNIST and FashionMNIST and is roughly halved on ImageNet-1k. These results suggest that diffusion models admit an approximately rank-1 Fisher. With a better Fisher estimate, EWC becomes a strong complement to replay: replay encourages parameter sharing across tasks, while EWC effectively constrains replay-induced drift.
摘要：在神经模型中，灾难性遗忘仍然是持续学习的核心障碍。流行的方法 - 重播和弹性重量巩固（EWC） - 具有局限性：重播需要强大的发电机，并且容易分配漂移，而EWC则隐含地假设在任务中具有共享的最佳限度，并且通常使用对角线渔民近似。在这项工作中，我们研究了扩散模型的梯度几何形状，该模型已经可以产生高质量的重播数据。我们提供了理论和经验证据，表明，在低信噪比（SNR）方面，按样本梯度变得强烈地线，从而产生了一个有效级别1并与平均梯度对齐的经验渔民。利用这种结构，我们提出了EWC的等级1变体，该变体与对角线近似一样便宜，但却捕获了主要的曲率方向。我们将这种惩罚与基于重放的方法相结合，以鼓励在减轻漂移的同时跨任务共享参数共享。在类型的图像生成数据集（MNIST，FashionMnist，CIFAR-10，Imagenet-1K）上，我们的方法一致地改善了平均FID，并减少了相对于仅重播和对角线-EWC基准的忘记。特别是，忘记在MNIST和FashionMnist上几乎被消除了，并且在Imagenet-1k上大致减半。这些结果表明，扩散模型允许大约排名1的Fisher。通过更好的Fisher估计，EWC成为重播的强烈补充：重播鼓励跨任务的参数共享，而EWC有效地限制了重播引起的漂移。

Title: VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration

Authors: Han Hu, Zhuoran Zheng, Liang Li, Chen Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23601
Pdf URL: https://arxiv.org/pdf/2509.23601
Copy Paste: [[2509.23601]] VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration(https://arxiv.org/abs/2509.23601)
Keywords: restoration
Abstract: Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling this http URL, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at this https URL.
摘要：最近的基于MAMBA的图像恢复方法已取得了有希望的结果，但仍受到固定扫描模式和效率低下的功能利用的限制。常规的Mamba体系结构依赖于无法适应各种降解的预定路径，从而限制了恢复性能和计算效率。为了克服这些限制，我们提出了Vamamba，这是一个具有两个关键创新的视觉自适应Mamba框架。首先，QCLAM（总部位于队列的Cachelow-RankAdaptiveMory）增强了存储历史表示的FeatureLearningThrougha fifo Cache。当前的洛拉适应和缓存功能之间的相似性指导智能融合，从而有效地控制此HTTP URL，GPS-SS2D（GreedypathScanss2d）引入了casadaptive扫描。视觉变压器生成得分图以估计像素的重要性，贪婪的策略de终止了最佳的前向和向后扫描路径。这些学到的轨迹取代了刚性模式，从而使SS2D能够执行目标特征提取。 QCLAM和GPS-SS2D的集成使Vamamba可以自适应地专注于退化区域，同时保持高计算效率。跨不同恢复任务的广泛实验表明，Vamamba在恢复质量和效率方面始终超过现有方法，为自适应图像恢复建立了新的基准。我们的代码可在此HTTPS URL上找到。

Title: MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising

Authors: Tangtangfang Fang, Jingxi Hu, Xiangjian He, Jiaqi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23603
Pdf URL: https://arxiv.org/pdf/2509.23603
Copy Paste: [[2509.23603]] MAN: Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising(https://arxiv.org/abs/2509.23603)
Keywords: generative
Abstract: While diffusion models have set a new benchmark for quality in Low-Dose Computed Tomography (LDCT) denoising, their clinical adoption is critically hindered by extreme computational costs, with inference times often exceeding thousands of seconds per scan. To overcome this barrier, we introduce MAN, a Latent Diffusion Enhanced Multistage Anti-Noise Network for Efficient and High-Quality Low-Dose CT Image Denoising task. Our method operates in a compressed latent space via a perceptually-optimized autoencoder, enabling an attention-based conditional U-Net to perform the fast, deterministic conditional denoising diffusion process with drastically reduced overhead. On the LDCT and Projection dataset, our model achieves superior perceptual quality, surpassing CNN/GAN-based methods while rivaling the reconstruction fidelity of computationally heavy diffusion models like DDPM and Dn-Dp. Most critically, in the inference stage, our model is over 60x faster than representative pixel space diffusion denoisers, while remaining competitive on PSNR/SSIM scores. By bridging the gap between high fidelity and clinical viability, our work demonstrates a practical path forward for advanced generative models in medical imaging.
摘要：尽管扩散模型为低剂量计算机断层扫描（LDCT）降级的质量树立了新的基准，但其临床采用受到极端计算成本的严重阻碍，而推理时间通常超过每扫描数千秒。为了克服这一障碍，我们介绍了Man，Man是一种潜在扩散增强的多阶段反噪声网络，以实现高效且高质量的低剂量CT图像剥夺任务。我们的方法通过感知优化的自动编码器在压缩的潜在空间中运行，从而使基于注意力的条件U-NET能够执行快速，确定性的条件降解扩散过程，并大大降低了开销。在LDCT和投影数据集上，我们的模型达到了卓越的感知质量，超过了基于CNN/GAN的方法，同时匹配了计算重型扩散模型（如DDPM和DN-DP）的重建忠诚度。最关键的是，在推论阶段，我们的模型比代表性的像素空间扩散denoisers快60倍，同时在PSNR/SSIM分数上保持竞争力。通过弥合高保真度和临床生存能力之间的差距，我们的工作证明了医学成像中先进生成模型的实用途径。

Title: VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Authors: Zeren Xiong, Yue Yu, Zedong Zhang, Shuo Chen, Jian Yang, Jun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23605
Pdf URL: https://arxiv.org/pdf/2509.23605
Copy Paste: [[2509.23605]] VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis(https://arxiv.org/abs/2509.23605)
Keywords: generation
Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
摘要：通过从多个来源融合视觉提示来创建新的图像是图像到图像生成中的一个基本而又没有被忽略的问题，在艺术创作，虚拟现实和视觉媒体中具有广泛的应用。现有方法通常面临两个关键挑战：共存的生成，在没有真正集成的情况下简单地将多个对象并列，并且偏向生成，其中一个对象由于语义不平衡而占主导地位。为了解决这些问题，我们提出了视觉混合扩散（VMDIFF），这是一个简单而有效的基于扩散的框架，通过在噪声和潜在层面上集成两个输入图像来综合一个相干对象。我们的方法包括：（1）结合了指导的denoing，倒置和球形插值与可调节参数的混合抽样过程，以实现结构感知融合，减轻共存的产生；（2）有效的自适应调整模块，该模块引入了一种新型的基于相似性的分数，以自动和自适应地搜索最佳参数，从而对抗语义偏见。在780个概念对的策划基准上进行的验证表明，我们的方法在视觉质量，语义一致性和人评价的创造力方面的表现优于强大的基准。

Title: InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects

Authors: Xinhao Cai, Minghang Zheng, Xin Jin, Yang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23612
Pdf URL: https://arxiv.org/pdf/2509.23612
Copy Paste: [[2509.23612]] InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects(https://arxiv.org/abs/2509.23612)
Keywords: generation
Abstract: We propose a novel task of text-controlled human object interaction generation in 3D scenes with movable objects. Existing human-scene interaction datasets suffer from insufficient interaction categories and typically only consider interactions with static objects (do not change object positions), and the collection of such datasets with movable objects is difficult and costly. To address this problem, we construct the InteractMove dataset for Movable Human-Object Interaction in 3D Scenes by aligning existing human object interaction data with scene contexts, featuring three key characteristics: 1) scenes containing multiple movable objects with text-controlled interaction specifications (including same-category distractors requiring spatial and 3D scene context understanding), 2) diverse object types and sizes with varied interaction patterns (one-hand, two-hand, etc.), and 3) physically plausible object manipulation trajectories. With the introduction of various movable objects, this task becomes more challenging, as the model needs to identify objects to be interacted with accurately, learn to interact with objects of different sizes and categories, and avoid collisions between movable objects and the scene. To tackle such challenges, we propose a novel pipeline solution. We first use 3D visual grounding models to identify the interaction object. Then, we propose a hand-object joint affordance learning to predict contact regions for different hand joints and object parts, enabling accurate grasping and manipulation of diverse objects. Finally, we optimize interactions with local-scene modeling and collision avoidance constraints, ensuring physically plausible motions and avoiding collisions between objects and the scene. Comprehensive experiments demonstrate our method's superiority in generating physically plausible, text-compliant interactions compared to existing approaches.
摘要：我们提出了一项新颖的任务，即具有可移动对象的3D场景中的文本控制的人类对象互动。现有的人类场景交互数据集遭受交互类别不足，通常仅考虑与静态对象的交互（不要更改对象位置），并且具有可移动对象的此类数据集的收集很困难且昂贵。为了解决这个问题，我们通过使现有的人类对象互动数据与场景上下文对齐，以三个关键特征来构造3D场景中可移动的人类对象相互作用的交互数据集，其中包含多个可移动的对象具有文本控制的相互作用规范（包括相同的类别，包括空间和3D场景上下语的相互作用），并与3D场景的相互作用相互作用，并与3D场景的相互作用相互作用），并与3D场景相互作用，2）等）和3）物理上合理的对象操纵轨迹。随着各种可移动对象的引入，此任务变得更具挑战性，因为该模型需要识别要与不同大小和类别的对象进行交互的对象，并避免在可移动对象和场景之间碰撞。为了应对此类挑战，我们提出了一种新颖的管道解决方案。我们首先使用3D视觉接地模型来识别交互对象。然后，我们提出了一个手动对象的负担能力学习，以预测不同手接和物体零件的接触区域，从而可以准确抓握和操纵各种物体。最后，我们优化了与本地场景建模和避免碰撞限制的相互作用，确保身体上合理的动作并避免对象和场景之间的碰撞。全面的实验证明了我们的方法在与现有方法相比产生物理上合理的符合文本相互作用方面具有优势。

Title: BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images

Authors: Cheng Huang, Weizheng Xie, Fan Gao, Yutong Liu, Ruoling Wu, Zeyu Han, Jingxi Qiu, Xiangxiang Wang, Zhenglin Yang, Hao Wang, Yongbin Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23617
Pdf URL: https://arxiv.org/pdf/2509.23617
Copy Paste: [[2509.23617]] BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images(https://arxiv.org/abs/2509.23617)
Keywords: generative
Abstract: Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: this https URL.
摘要：视网膜血管的结构变化是青光眼和其他眼部疾病的发作和进展的关键生物标志物。但是，当前的血管分割方法在很大程度上依赖于监督的学习和广泛的手动注释，这些学习和大量的手动注释在光学相干性层析成像血管造影中易于错误，容易出错并且难以获得。在这里，我们提出了Biovessel-NET，这是一个无监督的生成框架，将血管生物统计学与对抗性细化和半径引导的分割策略相结合。与基于像素的方法不同，Biovessel-net直接模拟具有生物稳定相干性的血管结构，从而实现了准确且可解释的血管提取，而无需标记的数据或高性能计算。为了支持培训和评估，我们介绍了Retinamix，这是一个新的基准数据集，其中包括2D和3D OCTA图像，具有来自不同人群的高分辨率船只细节。实验结果表明，Biovessel-net在视网膜和现有数据集上实现了几乎完美的分割精度，从而大大超过了最先进的监督和半监督的方法。同时，Biovessel-NET和Verinamix为视网膜血管分析提供了无标记，计算效率和可解释的解决方案，具有青光眼监测，血流模型和进展预测的广泛潜力。代码和数据集可用：此HTTPS URL。

Title: DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Authors: Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23624
Pdf URL: https://arxiv.org/pdf/2509.23624
Copy Paste: [[2509.23624]] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation(https://arxiv.org/abs/2509.23624)
Keywords: generation, generative
Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency. Code will be made publicly available.
摘要：深层生成模型具有先进的文本对线笔迹生成（TOHG），该模型旨在综合以文本输入和样式参考为条件的逼真的笔轨迹。但是，大多数现有方法仍主要集中在字符或单词级生成上，导致效率低下，并且在应用于全文线时缺乏整体结构建模。为了解决这些问题，我们提出了Diffink，这是全线手写生成的第一个潜在扩散变压器框架。我们首先介绍Inkvae，这是一种新型的顺序变异自动编码器，增强了两种互补的潜在空间正则化损失：（1）基于OCR的损失强制执行Glyph级的准确性，（2）风格分类损失损失损失损失。这种双重正则化产生了一个语义结构的潜在空间，在该空间中，角色内容和作家样式有效地散布了。然后，我们引入Inkdit，这是一种新型的潜扩散变压器，该变压器集成了目标文本和参考样式以生成连贯的笔轨迹。实验结果表明，Diffink在Glyph的准确性和风格的忠诚度上都优于现有的最新方法，同时显着提高了发电效率。代码将公开可用。

Title: MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing

Authors: Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23635
Pdf URL: https://arxiv.org/pdf/2509.23635
Copy Paste: [[2509.23635]] MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing(https://arxiv.org/abs/2509.23635)
Keywords: generation
Abstract: This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.
摘要：本文提出了MotionVerse，这是一个统一的框架，它利用大型语言模型（LLMS）的功能来理解，生成和编辑单人和多人场景中的人类运动。为了有效地表示运动数据，我们采用了一个具有残差量化的运动令牌，该运动令牌将连续运动序列转换为多流派离散令牌。此外，我们引入了一个\ textIt {delay Parallel}建模策略，该策略在时间上暂时散布了残留令牌流的编码。该设计使LLMS能够有效捕获流程间依赖性，同时保持与单流建模相当的计算效率。此外，为了减轻运动和语言之间的模态干扰，我们使用特定于模态的参数设计了一个\ textit {dual-tower架构}，从而确保对理解和生成任务的运动信息稳定地集成。全面的消融研究证明了每个组件在运动方面的有效性，并且广泛的实验展示了其在各种相关任务中的出色性能。

Title: LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders

Authors: Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.23639
Pdf URL: https://arxiv.org/pdf/2509.23639
Copy Paste: [[2509.23639]] LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders(https://arxiv.org/abs/2509.23639)
Keywords: generation
Abstract: This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at this https URL.
摘要：本文通过解决文本编码器的不利影响来探讨一种新颖的轻质方法Lightfair，以实现公平的文本到图像扩散模型（T2I DMS）。大多数现有方法要么将扩散模型的不同部分用于全参数培训，要么依靠辅助网络进行校正。他们会产生重度培训或抽样负担和表现不佳。由于T2I DM由多个组件组成，文本编码器是最微调和前端模块，因此本文着重于通过微调文本嵌入来缓解偏见。为了验证可行性，我们观察到文本编码器的中性嵌入输出在夹子空间中各种属性的图像嵌入中显示出很大的偏度。更重要的是，噪声预测网络进一步扩大了这种失衡。为了填补文本嵌入，我们提出了一种协作距离受限的偏见策略，该策略平衡嵌入距离以提高公平性而无需辅助参考。但是，缓解偏见会损害原始的一代质量。为了解决这个问题，我们介绍了两阶段的文本指导的采样策略，以限制辩护文本编码器中间。广泛的实验表明Lightfair是有效而有效的。值得注意的是，在稳定的扩散v1.5上，我们的方法以培训负担的$ 1/4美元的价格实现了SOTA偏见，几乎没有增加采样负担。该代码可在此HTTPS URL上找到。

Title: Griffin: Generative Reference and Layout Guided Image Composition

Authors: Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23643
Pdf URL: https://arxiv.org/pdf/2509.23643
Copy Paste: [[2509.23643]] Griffin: Generative Reference and Layout Guided Image Composition(https://arxiv.org/abs/2509.23643)
Keywords: generation, generative
Abstract: Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.
摘要：文本对图像模型已经达到了一定程度的现实主义，使能够产生高度令人信服的图像。但是，当需要更明确的指导时，基于文本的控制可能是一个限制因素。在图像中定义内容及其精确的位置对于实现更精细的控制至关重要。在这项工作中，我们解决了多图像布局控件的挑战，其中通过图像而不是文本指定所需的内容，并指导模型放置每个元素的位置。我们的方法是无训练的，每个参考需要单个图像，并为对象和部分级别组成提供明确而简单的控制。我们证明了其在各种图像组成任务中的有效性。

Title: Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures

Authors: Lu Xiao, Jiale Zhang, Yang Liu, Taicheng Huang, Xin Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23646
Pdf URL: https://arxiv.org/pdf/2509.23646
Copy Paste: [[2509.23646]] Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures(https://arxiv.org/abs/2509.23646)
Keywords: generation
Abstract: The creation of high-fidelity 3D assets is often hindered by a 'pixel-level pain point': the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up, a memory-efficient, high-fidelity texture modeling framework that effectively preserves high-frequency details. We use sparse voxels to guide texture reconstruction and ensure multi-view consistency, while leveraging surface anchoring and view-domain partitioning to break through resolution constraints. Surface anchoring employs a learnable upsampling strategy to constrain voxels to the mesh surface, eliminating over 70% of redundant voxels present in traditional voxel upsampling. View-domain partitioning introduces an image patch-guided voxel partitioning scheme, supervising and back-propagating gradients only on visible local patches. Through these two strategies, we can significantly reduce memory consumption during high-resolution voxel training without sacrificing geometric consistency, while preserving high-frequency details in textures.
摘要：高保真3D资产的创建通常受到“像素级疼痛点”的阻碍：丢失高频细节。现有方法通常经常将一个方面换成另一个方面：要么牺牲跨视图的一致性，导致折磨或漂移纹理，要么被显性体素的分辨率上限捕获，从而消失了细质地细节。在这项工作中，我们提出了稀疏，一种记忆效率高，高保真的纹理建模框架，可有效保留高频细节。我们使用稀疏的体素来指导纹理重建并确保多视图一致性，同时利用表面锚定和观看域分区来破坏分辨率约束。表面锚定采用可学习的上采样策略，将体素限制在网状表面上，从而消除了传统体素上采样中存在的70％以上的冗余体素。视域分区引入了图像贴片引导的体素分区方案，仅在可见的本地贴片上监督和背部传播梯度。通过这两种策略，我们可以在高分辨率体素训练期间显着减少记忆消耗，而无需牺牲几何一致性，同时保留纹理中的高频细节。

Title: HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

Authors: Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23663
Pdf URL: https://arxiv.org/pdf/2509.23663
Copy Paste: [[2509.23663]] HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score(https://arxiv.org/abs/2509.23663)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.
摘要：视觉语言模型（VLM）在多种多模式任务上表现出强大的功能。但是，视觉编码器的大量视觉令牌输出严重阻碍了推断效率，并且先前的研究表明，其中许多令牌并不重要，因此可以安全地修剪。在这项工作中，我们提出了HIVTP，这是一种无培训的方法，可通过使用新型的基于中层的重要性得分来通过分层视觉令牌修剪来提高VLMS效率。具体而言，我们利用从视觉编码器的中层提取的注意图，这些图更好地反映了细粒度和对象级的注意，以估计视觉令牌的重要性。基于此，我们提出了一种分层视觉令牌修剪方法，以保留全球和本地重要的视觉令牌。具体而言，我们将视觉编码器输出的1-D视觉令牌序列重塑为2D空间布局。在全球保留阶段，我们将图像分为区域，并保留每个区域中重要性更高的令牌；然后，在本地保留阶段，我们将图像分为小窗口，并保留每个本地窗口中最重要的令牌。实验结果表明，我们提出的HIVTP可以将LLAVA-V1.5-7B和LLLAVA-NEXT-7B的第一次通行时间（TTFT）降低高达50.0％和55.1％，并将令牌产生提高了60.9％和47.9％和47..3％，而无需牺牲，甚至可以提高其范围，甚至可以提高损失的效果。与先前的工作相比，HIVTP可以达到更好的准确性，同时提供更高的推理效率。

Title: Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability

Authors: Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23666
Pdf URL: https://arxiv.org/pdf/2509.23666
Copy Paste: [[2509.23666]] Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability(https://arxiv.org/abs/2509.23666)
Keywords: generation
Abstract: Early-Exit Deep Neural Networks enable adaptive inference by allowing prediction at intermediary layers, significantly reducing computational costs and latency. Most of the early exit strategies greedily exit a sample at an intermediary layer if the confidence in class prediction exceeds a predefined threshold that is set using a static validation set. This is problematic as the model might be overconfident in a wrong class. Also, they are not robust to distribution shifts encountered in deployment, which can undermine model trustworthiness and accuracy. To address these challenges, we propose UAT that adapts the threshold for exit decisions using a Multi-Armed Bandit framework, enabling online, unsupervised adjustment of exit decisions. UAT makes decisions based on a new reward function that assesses predictive certainty and its reliability to balance computational efficiency and prediction quality while penalizing unnecessary late exits. We provide guarantees on risk achieved by UAT and validate its performance on diverse tasks spanning vision-language understanding, text generation, and classification. Our framework demonstrates consistent improvements in speedup (1.70-2.10x) with a minimal performance drop (<2%) as compared to full model performance. Our source code is available at this https URL.
摘要：早期的深神经网络可以通过允许在中间层进行预测，从而显着降低计算成本和延迟，从而实现自适应推理。如果对班级预测的置信度超过了使用静态验证集设置的预定义阈值，则大多数早期出口策略都会贪婪地退出样本在中间层。这是有问题的，因为该模型在错误的类中可能过分自信。同样，它们对于部署中遇到的分配变化并不强大，这可能会破坏模型的可信度和准确性。为了应对这些挑战，我们提出了使用多臂强盗框架来适应退出决策的阈值，从而在线，无监督的退出决策调整。 UAT基于新的奖励功能做出决定，该功能评估了预测性的确定性及其对平衡计算效率和预测质量的可靠性，同时惩罚不必要的晚期退出。我们为UAT实现的风险提供保证，并验证其在涵盖视觉理解，文本生成和分类的各种任务上的绩效。与完整的模型性能相比，我们的框架表现出一致的速度（1.70-2.10x）的一致改进（1.70-2.10x），其性能下降（<2％）。我们的源代码可在此HTTPS URL上找到。

Title: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23681
Pdf URL: https://arxiv.org/pdf/2509.23681
Copy Paste: [[2509.23681]] QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification(https://arxiv.org/abs/2509.23681)
Keywords: generation
Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in this https URL.
摘要：扩散变压器具有显着的视频生成能力，但它们的计算和内存成本却阻碍了实际部署。模型量化和注意力稀疏是压缩的两个有前途的方向，但是在侵略性压缩下，每种都会遭受严重的性能降解。将它们结合在一起有望使复杂的效率提高，但幼稚的整合是无效的。稀疏性诱导的信息损失加剧了量化噪声，从而导致注意力转移。为了解决这个问题，我们建议\ textbf {quantsparse}，这是一个将模型量化与注意力稀疏集成的统一框架。具体而言，我们介绍\ textit {多尺度显着注意蒸馏}，该蒸馏}利用全球结构指导和局部显着的监督来减轻量化引起的偏见。此外，我们开发\ textIt {二阶稀疏注意重新聚集}，该}利用了二阶残差的时间稳定性，以有效地恢复在稀疏性下丢失的信息。对Hunyuanvideo-13b进行的实验表明，Quantsparse达到20.88 PSNR，在最先进的量化基线Q-VDIT（16.85 psnr）方面的表现基本上超过了，同时提供了\ textbf {3.68 $ \ \ \ \ \ \ \最终的$} $} $最终$} $ textbf intermection $} $} $}。推理。我们的代码将在此HTTPS URL中发布。

Title: PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease

Authors: Shuai Shao, Shu Jiang, Shiyuan Zhao, Di Yang, Yan Wang, Yutong Bai, Jianguo Zhang, Jiangtao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23719
Pdf URL: https://arxiv.org/pdf/2509.23719
Copy Paste: [[2509.23719]] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson's Disease(https://arxiv.org/abs/2509.23719)
Keywords: generative
Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder that severely diminishes patients' quality of life. Its global prevalence has increased markedly in recent decades. Current diagnostic workflows are complex and heavily reliant on neurologists' expertise, often resulting in delays in early detection and missed opportunities for timely intervention. To address these issues, we propose an end-to-end automated diagnostic method for PD, termed PD-Diag-Net, which performs risk assessment and auxiliary diagnosis directly from raw MRI scans. This framework first introduces an MRI Pre-processing Module (MRI-Processor) to mitigate inter-subject and inter-scanner variability by flexibly integrating established medical imaging preprocessing tools. It then incorporates two forms of clinical prior knowledge: (1) Brain-Region-Relevance-Prior (Relevance-Prior), which specifies brain regions strongly associated with PD; and (2) Brain-Region-Aging-Prior (Aging-Prior), which reflects the accelerated aging typically observed in PD-associated regions. Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability. Furthermore, we collected external test data from our collaborating hospital. Experimental results show that PD-Diag-Net achieves 86\% accuracy on external tests and over 96% accuracy in early-stage diagnosis, outperforming existing advanced methods by more than 20%.
摘要：帕金森氏病（PD）是一种常见的神经退行性疾病，严重降低了患者的生活质量。近几十年来，其全球流行率显着增加。当前的诊断工作流程很复杂，并且在神经科医生的专业知识上非常依赖，通常会导致早期发现和及时干预的机会延迟。为了解决这些问题，我们提出了一种用于PD的端到端自动诊断方法，称为PD-DIAG-NET，该方法直接通过RAW MRI扫描执行风险评估和辅助诊断。该框架首先引入了MRI预处理模块（MRI-Processor），以通过灵活地集成已建立的医学成像预处理工具来减轻对象间和扫描仪间可变性。然后，它结合了两种形式的临床先验知识：（1）大脑 - 区域 - 相关优点（相关优点），它指定了与PD密切相关的大脑区域；（2）大脑区域衰老 - 衰老 - 衰老 - 反映了通常在PD相关区域中观察到的加速衰老。 Building on these priors, we design two dedicated modules: the Relevance-Prior Guided Feature Aggregation Module (Aggregator), which guides the model to focus on PD-associated regions at the inter-subject level, and the Age-Prior Guided Diagnosis Module (Diagnoser), which leverages brain age gaps as auxiliary constraints at the intra-subject level to enhance diagnostic accuracy and clinical interpretability.此外，我们从合作医院收集了外部测试数据。实验结果表明，PD-DIAG-NET在外部测试上达到86 \％的准确性，在早期诊断中的精度超过96％，表现优于现有的高级方法超过20％。

Title: DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion

Authors: Zijun Li, Hongyu Yan, Shijie Li, Kunming Luo, Li Lu, Xulei Yang, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23723
Pdf URL: https://arxiv.org/pdf/2509.23723
Copy Paste: [[2509.23723]] DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud Completion(https://arxiv.org/abs/2509.23723)
Keywords: generation, generative
Abstract: Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM' s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.
摘要：潜在扩散模型（LDMS）在各种低级视力任务中表现出显着的生成能力。但是，由于点云的非结构化和不规则的性质，它们的点云完成的潜力仍未得到充实。在这项工作中，我们提出了DiffPCN，这是一种基于扩散的新型粗到框架，用于点云完成。我们的方法分为两个阶段：一个生成粗点云的初始阶段，以及通过点降采样和上采样来提高其质量的精致阶段。具体而言，我们首先将无序和不规则的部分点云投影到结构化的深度图像中，这些深度为结构化的深度图像，这些条件是设计良好的深度，以合成用于形成粗点云的完整的多视图深度图像。这样，我们的DIFFPCN可以通过利用LDM强大的发电和理解能力来产生高质量和高的粗略云。然后，由于LDM不可避免地将离群值引入生成的深度图中，因此我们通过预测每点距离分数来设计一个点授权网络，以从粗点云中删除伪影。最后，我们设计了一个关联点的点采样器，它通过利用输入点云和相应的粗点之间的局部关联特征来指导UP采样过程，从而进一步产生密集和高效率输出。实验结果表明，我们的DIFFPCN在几何准确性和形状完整性方面实现了最先进的性能，从而显着提高了点云完成的稳健性和一致性。

Title: M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation

Authors: Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li, Li Lin, Yuwang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23728
Pdf URL: https://arxiv.org/pdf/2509.23728
Copy Paste: [[2509.23728]] M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation(https://arxiv.org/abs/2509.23728)
Keywords: generation
Abstract: In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 15,080 layouts and over 258k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using a text-conditioned diffusion model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis.
摘要：在文本驱动的3D场景生成中，对象布局是一个至关重要的中间表示形式，该表示将高级语言指令带有详细的几何输出。它不仅为确保物理合理性提供了结构性蓝图，而且还支持语义可控性和互动编辑。但是，当前3D室内布局生成模型的学习能力受到现有数据集的规模，多样性和注释质量的约束。为了解决这个问题，我们介绍了M3Dlayout，这是一个用于3D室内布局生成的大规模的多源数据集。 M3Dlayout包括15,080个布局和超过258K的对象实例，集成了三个不同的来源：现实世界扫描，专业的CAD设计和程序生成的场景。每个布局都与详细的结构化文本配对，描述了全局场景摘要，大家具的关系位置以及较小物品的细粒度安排。这种多样化和丰富的注释资源使模型能够学习各种室内环境的复杂空间和语义模式。为了评估M3Dlayout的潜力，我们使用文本条件的扩散模型建立了基准。实验结果表明，我们的数据集为培训布局生成模型提供了坚实的基础。它的多源组成可以增强多样性，特别是通过Inf3dlayout子集，该子集提供丰富的小型物体信息，从而能够产生更复杂和详细的场景。我们希望M3Dlayout可以作为推进文本驱动3D场景综合研究的宝贵资源。

Title: HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Authors: Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23736
Pdf URL: https://arxiv.org/pdf/2509.23736
Copy Paste: [[2509.23736]] HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation(https://arxiv.org/abs/2509.23736)
Keywords: generation
Abstract: In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
摘要：在这项工作中，我们提出了Hieratok，这是一种新型的多尺度视觉变压器（VIT）的代币仪，它克服了建模单尺度表示的固有限制。这是通过两个关键设计实现的：（1）将多尺度的下采样应用于令牌编码器生成的令牌图，产生一系列多尺度令牌，以及（2）级别的伴侣注意机制，可从低分辨率的全球词性特征逐步流动到高分辨率的全球范围逐渐流向高较高的较高的结构详细信息。结合这些设计，Hieratok在图像重建和发电任务方面都取得了重大改进。在相同的设置下，多尺度的视觉令牌仪的表现优于其单尺度的RFID（$ 1.47 \ rightarrow 1.07 $）的单尺度对应。当集成到下游生成框架中时，它可以达到$ 1.38 \ times $ $更快的收敛速度和18.9 \％的GFID（$ 16.4 \ rightArrow 13.3 $），这可能归因于柔软的，更均匀的分布潜在的潜在空间。此外，通过扩大令牌训练的扩展，我们通过SOTA RFID为0.45，在VIT令牌中证明了其潜力，GFID为1.82。据我们所知，我们是第一个在图像重建和图像生成中引入多尺度基于VIT的令牌的人。我们希望我们的发现和设计可以推进视觉生成任务中基于VIT的引物。

Title: Time-Shifted Token Scheduling for Symbolic Music Generation

Authors: Ting-Kang Wang, Chih-Pin Tan, Yi-Hsuan Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23749
Pdf URL: https://arxiv.org/pdf/2509.23749
Copy Paste: [[2509.23749]] Time-Shifted Token Scheduling for Symbolic Music Generation(https://arxiv.org/abs/2509.23749)
Keywords: generation
Abstract: Symbolic music generation faces a fundamental trade-off between efficiency and quality. Fine-grained tokenizations achieve strong coherence but incur long sequences and high complexity, while compact tokenizations improve efficiency at the expense of intra-token dependencies. To address this, we adapt a delay-based scheduling mechanism (DP) that expands compound-like tokens across decoding steps, enabling autoregressive modeling of intra-token dependencies while preserving efficiency. Notably, DP is a lightweight strategy that introduces no additional parameters and can be seamlessly integrated into existing representations. Experiments on symbolic orchestral MIDI datasets show that our method improves all metrics over standard compound tokenizations and narrows the gap to fine-grained tokenizations.
摘要：象征性音乐的产生面临效率和质量之间的基本权衡。细颗粒的引导性具有很强的连贯性，但会产生较长的序列和高复杂性，而紧凑的引导性提高了效率，以牺牲内在的依赖性为代价。为了解决这个问题，我们调整了基于延迟的调度机制（DP），该机制在解码步骤中扩展了类似复合的令牌，从而在保持效率的同时，可以对内部依赖性进行自动回归建模。值得注意的是，DP是一种轻巧的策略，没有引入其他参数，并且可以无缝集成到现有表示形式中。符号管弦乐中心数据集的实验表明，我们的方法改善了所有指标，而不是标准的复合令牌，并将差距缩小到细粒度的引物。

Title: Anchored Supervised Fine-Tuning

Authors: He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2509.23753
Pdf URL: https://arxiv.org/pdf/2509.23753
Copy Paste: [[2509.23753]] Anchored Supervised Fine-Tuning(https://arxiv.org/abs/2509.23753)
Keywords: generation
Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT's reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.
摘要：大语言模型的培训涉及监督的微调（SFT）之间的基本权衡，该培训有效地模仿了示范，但倾向于记住和增强学习（RL），以更高的计算成本实现更好的概括。动态微调（DFT）最近成为有前途的中间立场，以代币概率重新持续了SFT目标，并在某些推理领域的改善中取得了改善，尽管它在其他任务中表现出不稳定。我们通过奖励加权回归（RWR）框架对DFT进行了分析，表明它与特定的辅助分配选择相对应，该辅助分配选择比标准SFT所产生的更紧密的RL界限。但是，我们的分析也发现了一个关键的局限性：这种结构缺乏分配锚定，导致渐进的漂移破坏了训练稳定性。为了解决这个问题，我们提出了锚定的监督微调（为ASFT），从而增强了DFT轻巧的吉隆坡正规化的重量，以保持紧密度，同时确保稳定。从经验上讲，在数学推理，医学知识基础和代码生成中，急救始终优于SFT和DFT，通过最小的计算开销，实现了实质性改进。我们的RWR框架提供了一种系统的镜头，用于理解训练后方法，并证明原则上的理论分析会导致更强的保证和实际收益。

Title: UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

Authors: Xinyang Song, Libin Wang, Weining Wang, Shaozhen Liu, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23760
Pdf URL: https://arxiv.org/pdf/2509.23760
Copy Paste: [[2509.23760]] UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception(https://arxiv.org/abs/2509.23760)
Keywords: generation
Abstract: The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.
摘要：扩散模型在文本到图像生成中的显着成功激发了人们对将其能力扩展到各种多模式任务的兴趣，包括图像理解，操纵和感知。这些任务需要在视觉和文本方式上进行高级语义理解，尤其是在涉及复杂语义说明的情况下。但是，现有方法通常严重依赖视觉模型（VLM）或模块化设计来进行语义指导，从而导致构造碎片和计算效率低下。为了应对这些挑战，我们提出了Unialignment，这是单个扩散变压器中统一的多模式生成框架。 Unialignment引入了双流扩散训练策略，该策略既结合了内在模式的语义对准和跨模式的语义对齐，从而增强了模型的跨模式一致性和指导遵循的稳健性。此外，我们提出了SEMGEN BENCH，这是一种专门设计的新基准测试，该基准旨在评估复杂的文本指令下的多模式语义一致性。跨多个任务和基准测试的广泛实验表明，非审判的表现优于现有基准，强调了扩散模型在统一的多峰产生中的显着潜力。

Title: GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning

Authors: Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23770
Pdf URL: https://arxiv.org/pdf/2509.23770
Copy Paste: [[2509.23770]] GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning(https://arxiv.org/abs/2509.23770)
Keywords: generation, generative, quality assessment
Abstract: The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at this https URL.
摘要：对比学习的成功取决于高质量正面的构建和利用。但是，当前的方法在两个方面面临着关键的局限性：在施工方面，手工制作和生成的增强均经常遭受有限的多样性和风险语义腐败；在学习方面，缺乏质量评估机制会导致次优的监督，在所有对对待所有对的情况下。为了应对这些挑战，我们提出了GenView ++，这是一个统一的框架，通过引入两项协同创新来解决这两个方面。为了改善配对的结构，GenView ++引入了多源自适应视图生成机制，以通过在图像条件，文本条件和图像text-tearmited策略中动态调节生成参数，从而综合了多样化但具有语义上的相干视图。其次，质量驱动的对比度学习机制评估了每对的语义一致性和多样性，以动态重新训练贡献，优先考虑高质量对，同时抑制冗余或未对准的对。广泛的实验证明了GenView ++在视觉和视觉任务中的有效性。对于视觉表示学习，它将ImaCeNet线性分类的MOCOV2提高 +2.5％。对于视觉学习，它将平均零摄像分类精度提高到夹子上的平均零分类准确性 +12.31％，而在十个数据集中滑倒的平均零分类准确性和 +5.31％，并进一步提高了Flickr30k文本检索R@5 +3.2％。该代码可在此HTTPS URL上找到。

Title: Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

Authors: Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xinyu Zhou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23774
Pdf URL: https://arxiv.org/pdf/2509.23774
Copy Paste: [[2509.23774]] Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution(https://arxiv.org/abs/2509.23774)
Keywords: super-resolution, generative
Abstract: Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.
摘要：基于矢量定量的模型最近显示出具有视觉先验建模的强大潜力。但是，现有的基于VQ的方法只需用最近的代码簿项目编码视觉功能，并通过代码级监督进行训练索引预测器。由于视觉信号的丰富性，VQ编码通常会导致大量量化误差。此外，使用代码级监督的培训预测指标无法考虑最终的重建错误，从而导致优化的先验建模准确性。在本文中，我们解决了以上两个问题，并提出了纹理矢量定量和重建意识预测策略。纹理矢量定量策略利用超分辨率的任务特征，仅引入代码本来对缺失纹理的先验进行建模。而重建意识预测策略则利用直通估计器直接通过图像级监督训练索引预测指标。我们提出的生成SR模型（TVQ＆RAP）能够以较小的计算成本提供照片现实的SR结果。

Title: From Unstable to Playable: Stabilizing Angry Birds Levels via Object Segmentation

Authors: Mahdi Farrokhimaleki, Parsa Rahmati, Richard Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23787
Pdf URL: https://arxiv.org/pdf/2509.23787
Copy Paste: [[2509.23787]] From Unstable to Playable: Stabilizing Angry Birds Levels via Object Segmentation(https://arxiv.org/abs/2509.23787)
Keywords: generation
Abstract: Procedural Content Generation (PCG) techniques enable automatic creation of diverse and complex environments. While PCG facilitates more efficient content creation, ensuring consistently high-quality, industry-standard content remains a significant challenge. In this research, we propose a method to identify and repair unstable levels generated by existing PCG models. We use Angry Birds as a case study, demonstrating our method on game levels produced by established PCG approaches. Our method leverages object segmentation and visual analysis of level images to detect structural gaps and perform targeted repairs. We evaluate multiple object segmentation models and select the most effective one as the basis for our repair pipeline. Experimental results show that our method improves the stability and playability of AI-generated levels. Although our evaluation is specific to Angry Birds, our image-based approach is designed to be applicable to a wide range of 2D games with similar level structures.
摘要：程序内容生成（PCG）技术可以自动创建各种和复杂的环境。尽管PCG促进了更有效的内容创建，但确保始终如一的高质量，行业标准的内容仍然是一个重大挑战。在这项研究中，我们提出了一种识别和修复现有PCG模型产生的不稳定水平的方法。我们使用愤怒的小鸟作为案例研究，证明了我们在既定的PCG方法产生的游戏水平上的方法。我们的方法利用对象分割和电平图像的视觉分析来检测结构间隙并执行有针对性的维修。我们评估了多个对象分割模型，并选择最有效的对象分割模型作为维修管道的基础。实验结果表明，我们的方法提高了AI生成水平的稳定性和可玩性。尽管我们的评估特定于愤怒的小鸟，但我们的基于图像的方法旨在适用于具有相似水平结构的各种2D游戏。

Title: Controllable Generation of Large-Scale 3D Urban Layouts with Semantic and Structural Guidance

Authors: Mengyuan Niu, Xinxin Zhuo, Ruizhe Wang, Yuyue Huang, Junyan Yang, Qiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23804
Pdf URL: https://arxiv.org/pdf/2509.23804
Copy Paste: [[2509.23804]] Controllable Generation of Large-Scale 3D Urban Layouts with Semantic and Structural Guidance(https://arxiv.org/abs/2509.23804)
Keywords: generation
Abstract: Urban modeling is essential for city planning, scene synthesis, and gaming. Existing image-based methods generate diverse layouts but often lack geometric continuity and scalability, while graph-based methods capture structural relations yet overlook parcel semantics. We present a controllable framework for large-scale 3D vector urban layout generation, conditioned on both geometry and semantics. By fusing geometric and semantic attributes, introducing edge weights, and embedding building height in the graph, our method extends 2D layouts to realistic 3D structures. It also enables users to directly control the output by modifying semantic attributes. Experiments show that it produces valid, large-scale urban models, offering an effective tool for data-driven planning and design.
摘要：城市建模对于城市规划，场景综合和游戏至关重要。现有的基于图像的方法会产生各种布局，但通常缺乏几何连续性和可扩展性，而基于图的方法捕获结构关系，但却忽略了包裹语义。我们为大规模3D矢量城市布局产生的可控框架提供了可控的框架，并以几何和语义为条件。通过融合几何和语义属性，引入边缘重量并嵌入图中的建筑高度，我们的方法将2D布局扩展到现实的3D结构。它还使用户可以通过修改语义属性直接控制输出。实验表明，它产生了有效的大规模城市模型，为数据驱动的计划和设计提供了有效的工具。

Title: Space Group Conditional Flow Matching

Authors: Omri Puny, Yaron Lipman, Benjamin Kurt Miller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23822
Pdf URL: https://arxiv.org/pdf/2509.23822
Copy Paste: [[2509.23822]] Space Group Conditional Flow Matching(https://arxiv.org/abs/2509.23822)
Keywords: generation, generative
Abstract: Inorganic crystals are periodic, highly-symmetric arrangements of atoms in three-dimensional space. Their structures are constrained by the symmetry operations of a crystallographic \emph{space group} and restricted to lie in specific affine subspaces known as \emph{Wyckoff positions}. The frequency an atom appears in the crystal and its rough positioning are determined by its Wyckoff position. Most generative models that predict atomic coordinates overlook these symmetry constraints, leading to unrealistically high populations of proposed crystals exhibiting limited symmetry. We introduce Space Group Conditional Flow Matching, a novel generative framework that samples significantly closer to the target population of highly-symmetric, stable crystals. We achieve this by conditioning the entire generation process on a given space group and set of Wyckoff positions; specifically, we define a conditionally symmetric noise base distribution and a group-conditioned, equivariant, parametric vector field that restricts the motion of atoms to their initial Wyckoff position. Our form of group-conditioned equivariance is achieved using an efficient reformulation of \emph{group averaging} tailored for symmetric crystals. Importantly, it reduces the computational overhead of symmetrization to a negligible level. We achieve state of the art results on crystal structure prediction and de novo generation benchmarks. We also perform relevant ablations.
摘要：无机晶体是三维空间中原子的周期性，高度对称的排列。它们的结构受到晶体学\ emph {space group}的对称操作的约束，并仅限于被称为\ emph {wyckoff位置}的特定仿射子空间。原子出现在晶体中的频率，其粗糙定位取决于其Wyckoff位置。预测原子坐标的大多数生成模型忽略了这些对称性约束，从而导致提出的晶体数量不切实际，表现出有限的对称性。我们引入了空间组条件流匹配，这是一个新型的生成框架，该框架可以显着接近高度对称，稳定晶体的目标种群。我们通过在给定的空间组和一组Wyckoff位置上调节整个生成过程来实现这一目标；具体而言，我们定义了有条件的对称噪声基碱分布和组条件的，等效的参数矢量场，该场将原子的运动限制在其初始的Wyckoff位置。我们的群体条件均衡性的形式是使用针对对称晶体量身定制的\ emph {grout均值}的有效重新制定的。重要的是，它将对称化的计算开销降低到可忽略的水平。我们在晶体结构预测和从头产生基准的基准上实现了最先进的结果。我们还执行相关的消融。

Title: Electric Currents for Discrete Data Generation

Authors: Alexander Kolesov, Stepan Manukhov, Vladimir V. Palyulin, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.23825
Pdf URL: https://arxiv.org/pdf/2509.23825
Copy Paste: [[2509.23825]] Electric Currents for Discrete Data Generation(https://arxiv.org/abs/2509.23825)
Keywords: generation
Abstract: We propose $\textbf{E}$lectric $\textbf{C}$urrent $\textbf{D}$iscrete $\textbf{D}$ata $\textbf{G}$eneration (ECD$^{2}$G), a pioneering method for data generation in discrete settings that is grounded in electrical engineering theory. Our approach draws an analogy between electric current flow in a circuit and the transfer of probability mass between data distributions. We interpret samples from the source distribution as current input nodes of a circuit and samples from the target distribution as current output nodes. A neural network is then used to learn the electric currents to represent the probability flow in the circuit. To map the source distribution to the target, we sample from the source and transport these samples along the circuit pathways according to the learned currents. This process provably guarantees transfer between data distributions. We present proof-of-concept experiments to illustrate our ECD$^{2}$G method.
摘要：We propose $\textbf{E}$lectric $\textbf{C}$urrent $\textbf{D}$iscrete $\textbf{D}$ata $\textbf{G}$eneration (ECD$^{2}$G), a pioneering method for data generation in discrete settings that is grounded in electrical engineering theory.我们的方法在电路中的电流流与数据分布之间的概率质量转移之间进行了类比。我们将源分布中的样本解释为电路的当前输入节点，而来自目标分布的样品则将样本作为电流输出节点。然后使用神经网络来学习电流以表示电路中的概率流。为了将源分布映射到目标，我们从源来采样并根据学习电流将这些样品沿电路路径传输。事实证明，此过程可以保证数据分布之间的转移。我们提供概念证明实验，以说明我们的ECD $^{2} $ G方法。

Title: Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation

Authors: Hanyu Zhou, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23828
Pdf URL: https://arxiv.org/pdf/2509.23828
Copy Paste: [[2509.23828]] Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation(https://arxiv.org/abs/2509.23828)
Keywords: generation
Abstract: Vision-language models (VLMs) have demonstrated strong performance in 2D scene understanding and generation, but extending this unification to the physical world remains an open challenge. Existing 3D and 4D approaches typically embed scene geometry into autoregressive model for semantic understanding and diffusion model for content generation. This paradigm gap prevents a single model from jointly handling both tasks, especially in dynamic 4D settings where spatiotemporal modeling is critical. We propose Uni4D-LLM, the first unified VLM framework with spatiotemporal awareness for 4D scene understanding and generation. Our design is guided by two key insights: 1) Unification requires a shared representation. We extract semantic features for understanding and noisy-injected appearance features for generation, incorporate 4D geometric cues, and fuse them into a spatiotemporal-aware visual representation through adaptive cross-attention. 2) Unification requires a shared architecture. Both autoregression and diffusion are built on Transformer backbones, and this enables integration into a single LLM with task-specific heads. By aligning visual and linguistic representations, our Uni4D-LLM produces predictions for both understanding and generation within one Transformer-based framework. We further apply instruction fine-tuning on diverse 4D vision-language datasets to improve generalization across tasks. Extensive experiments on multiple benchmarks demonstrate that Uni4D-LLM achieves competitive or superior results compared to state-of-the-art models and offers the first true unification of 4D scene understanding and generation.
摘要：视觉语言模型（VLM）在2D场景的理解和产生中表现出了很强的表现，但是将这种统一扩展到物理世界仍然是一个开放的挑战。现有的3D和4D方法通常将场景几何形状嵌入自回归模型中，以进行语义理解和内容生成的扩散模型。该范式间隙阻止了单个模型共同处理这两个任务，尤其是在时空建模至关重要的动态4D设置中。我们提出了Uni4D-LLM，这是第一个具有时空意识的统一VLM框架，以实现4D场景的理解和产生。我们的设计以两个关键见解为指导：1）统一需要共享表示。我们提取语义特征，用于理解和嘈杂的注入生成，结合4D几何提示，并通过自适应交叉注意将其融合到时空感知的视觉表示中。 2）统一需要共享的体系结构。自动进度和扩散均建立在变压器骨架上，这使得可以集成到具有特定于任务的头部的单个LLM中。通过使视觉和语言表示形式保持一致，我们的Uni4D-LLM可以在一个基于变压器的框架内对理解和生成产生预测。我们进一步对不同的4D Vision语言数据集进行了微调，以改善跨任务的概括。与最先进的模型相比，在多个基准测试的广泛实验表明，UNI4D-LLM取得了竞争性或优越的结果，并提供了4D场景理解和发电的第一个真实统一。

Title: Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric

Authors: Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23841
Pdf URL: https://arxiv.org/pdf/2509.23841
Copy Paste: [[2509.23841]] Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric(https://arxiv.org/abs/2509.23841)
Keywords: generation, generative, quality assessment
Abstract: Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at this https URL.
摘要：文本到3D（T23D）生成模型的最新进展已使文本提示中的多种高保真3D资产合成。但是，现有的挑战限制了可靠的T23D质量评估（T23DQA）的发展。首先，现有的基准测试过时，分散且粗粒，使细粒度的度量训练变得不可行。此外，当前的客观指标表现出固有的设计局限性，导致非代表性特征提取和度量鲁棒性降低。为了解决这些局限性，我们引入了T23D-Compbench，这是组成T23D代的综合基准。我们为组成提示定义了具有十二个子组件的五个组件，这些组成提示可以从十种最先进的生成模型中生成3,600个纹理网格。进行了大规模的主观实验，以收集不同观点的129,600个可靠的人类评级。基于T23D-COMPBENCH，我们进一步提出了Rank2Score，这是一种有效的评估者，对T23DQA进行了两阶段培训。 rank2score在第一阶段通过监督的对比回归和课程学习来增强成对训练，随后使用平均意见得分来完善预测，以在第二阶段与人类判断更紧密地保持一致性。广泛的实验和下游应用程序表明，rank2score始终超过多个维度的现有指标，并且还可以作为优化生成模型的奖励功能。该项目可在此HTTPS URL上找到。

Title: Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models

Authors: Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai, Xiuying Wang, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23876
Pdf URL: https://arxiv.org/pdf/2509.23876
Copy Paste: [[2509.23876]] Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models(https://arxiv.org/abs/2509.23876)
Keywords: generation
Abstract: Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.
摘要：基于隔壁预测的自动回归（AR）模型正在迅速成为图像生成的强大工具，但它们面临着关键的弱点：通过渐进分辨率缩放引入的跨时间段的补丁之间的信息不一致。这些不一致之处分散了指导信号，导致它们摆脱了条件信息，并留下了模棱两可，不忠实的功能。我们使用信息扎根指导（IgG）来应对这一挑战，这是一种新型机制，可通过注意力将指导锚定为语义上重要的区域。通过在抽样过程中自适应加强信息贴片，IgG确保指导和内容保持紧密状态。在班级条件和文本对图像生成任务中，IgG可提供更清晰，更连贯和语义上的图像，为基于AR的方法设定了新的基准测试。

Title: Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering

Authors: Rakesh Thakur, Yusra Tariq, Rakesh Chandra Joshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23899
Pdf URL: https://arxiv.org/pdf/2509.23899
Copy Paste: [[2509.23899]] Q-FSRU: Quantum-Augmented Frequency-Spectral For Medical Visual Question Answering(https://arxiv.org/abs/2509.23899)
Keywords: generation
Abstract: Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.
摘要：解决需要图像和文本理解的艰难临床问题仍然是医疗保健AI的主要挑战。在这项工作中，我们提出了Q-FSRU，这是一种新模型，将频谱表示和融合（FSRU）与一种称为量子检索效果生成（量子rag）的方法相结合，以用于医学视觉问题答案（VQA）。该模型从医学图像和相关文本中汲取功能，然后使用快速傅立叶变换（FFT）将其转移到频域。这有助于它专注于更有意义的数据，并过滤噪声或更少的有用信息。为了提高准确性并确保答案基于真实知识，我们添加了一个受量子启发的检索系统。它使用基于量子的相似性技术从外部来源获取有用的医学事实。然后将这些细节与基于频率的功能合并，以实现更强的推理。我们使用VQA-RAD数据集评估了我们的模型，其中包括实际的放射学图像和问题。结果表明，Q-FSRU优于早期模型，尤其是在需要图像文本推理的复杂案例上。频率和量子信息的组合可以提高性能和解释性。总体而言，这种方法提供了一种为医生构建智能，清晰和有用的AI工具的有前途的方法。

Title: EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Authors: Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23909
Pdf URL: https://arxiv.org/pdf/2509.23909
Copy Paste: [[2509.23909]] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling(https://arxiv.org/abs/2509.23909)
Keywords: generative
Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.
摘要：指导指导的图像编辑取得了显着的进步，但是当前的模型仍然面临着复杂的指令面临挑战，并且通常需要多个样品来产生预期的结果。强化学习（RL）提供了一个有希望的解决方案，但是由于缺乏高保真，有效的奖励信号，其在图像编辑中的采用受到了严重阻碍。在这项工作中，我们提出了一种综合方法，以克服这一障碍，以最先进的专业奖励模型的发展为中心。我们首先介绍了Editreward Bench，这是一个全面的基准，用于系统地评估编辑质量的奖励模型。在此基准测试的基础上，我们开发了EditScore，这是一系列奖励模型（7b-72b），用于评估指导引导的图像编辑的质量。通过细致的数据策展和过滤，EditsCore有效地匹配了学习专有VLM的性能。此外，再加上针对Editscore的生成性质量身定制的有效自我安装策略，我们最大的变体甚至超过了基准中的GPT-5。然后，我们证明高保真奖励模型是解锁在线RL进行图像编辑的关键。我们的实验表明，尽管即使是最大的开源VLM，也无法提供有效的学习信号，但EditScore可以实现有效且健壮的策略优化。将我们的框架应用于强大的基本模型Omnigen2，导致最终模型显示出实质性和一致的性能提升。总体而言，这项工作提供了从基准测试到奖励建模再到图像编辑中的RL培训的第一个系统路径，表明高保真性，领域专有的奖励模型是释放该域中RL的全部潜力的关键。

Title: MoReact: Generating Reactive Motion from Textual Descriptions

Authors: Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23911
Pdf URL: https://arxiv.org/pdf/2509.23911
Copy Paste: [[2509.23911]] MoReact: Generating Reactive Motion from Textual Descriptions(https://arxiv.org/abs/2509.23911)
Keywords: generation
Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at this https URL.
摘要：建模和产生人类反应对计算机视觉和人类计算机相互作用的广泛应用构成了重大挑战。现有方法要么将多个个体视为一个单一实体，直接产生相互作用，要么仅依靠一个人的运动来产生对方的反应，而无法整合人类互动的丰富语义信息。然而，这些方法通常缺乏自适应响应能力，即能够准确响应多样化和动态的相互作用方案的能力。认识到这一差距，我们的工作介绍了一种量身定制的方法，该方法通过着重于文本驱动的人类反应产生来解决现有模型的局限性。我们的模型专门为个人生成了现实的运动序列，这些运动序列是基于相互作用方案的描述性文本响应对方的动作的。目的是产生运动序列，不仅与对手的运动相辅相成，还可以在语义上符合所描述的相互作用。为了实现这一目标，我们提出了一种基于扩散的方法，旨在依次逐步消除全局轨迹和局部运动的产生。这种方法源于这样的观察，即首先产生全球轨迹对于指导当地运动至关重要，从而确保更好地与给定的动作和文本保持一致。此外，我们引入了一种新颖的相互作用损失，以增强产生的紧密相互作用的现实主义。我们的实验利用了根据两人运动数据集进行的数据，证明了我们方法对这项新任务的疗效，该任务能够产生现实，多样化和可控的反应，不仅与对应物的运动非常匹配，而且还遵守了文本指南。请在此HTTPS URL上找到我们的网页。

Title: Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Authors: Longtao Jiang, Mingfei Han, Lei Chen, Yongqiang Yu, Feng Zhao, Xiaojun Chang, Zhihui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23919
Pdf URL: https://arxiv.org/pdf/2509.23919
Copy Paste: [[2509.23919]] Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models(https://arxiv.org/abs/2509.23919)
Keywords: generation
Abstract: Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to design \textbf{Token Painter}, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics and delivers superior visual results. Codes will be released.
摘要：文本指导的图像介绍旨在在保留背景的同时，基于文本提示，旨在根据文本提示进行遮盖图像区域。尽管基于扩散的方法已经占主导地位，但它们在潜在空间中对整个图像进行建模的属性使得结果使得与及时的详细信息保持良好状态并保持一致的背景变得具有挑战性。为了解决这些问题，我们探索了此任务的面具自动回归（MAR）模型。 MAR自然支持与面罩区域相对应的潜在标记来支撑图像，从而在不改变背景的情况下可以更好地局部可控性。但是，直接将MAR应用于此任务使介入内容可以忽略提示或对背景上下文不和谐。通过分析来自介入图像的注意力图，我们确定了背景令牌对MAR生成期间文本令牌的影响，并利用它来设计\ TextBf {Token Painter}，这是一种基于MAR的无训练文本引导的图像介入方法。我们的方法介绍了两个关键组成部分：（1）双流编码器信息融合（DEIF），该信息融合了频域中文本和背景的语义和上下文信息，以产生新颖的指导令牌，从而使MAR能够产生文本信仰的inpainful Inpainful Inpaining Inpaining Inpain inpain inpaining inpaining inpaining inpaining inpaining inpaining inphight inppaining inppaining inphigation contemist contimious contimious contighation contractions。（2）自适应解码器的注意力评分增强（ADAE），可适应性地提高引导令牌和内置令牌的注意力评分，以进一步增强及时的细节和内容视觉质量的对齐。广泛的实验表明，我们的无训练方法在几乎所有指标上都优于先前的最新方法，并带来卓越的视觉结果。代码将发布。

Title: HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models

Authors: Zhinan Xie, Peisong Wang, Jian Cheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.23928
Pdf URL: https://arxiv.org/pdf/2509.23928
Copy Paste: [[2509.23928]] HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models(https://arxiv.org/abs/2509.23928)
Keywords: generation
Abstract: Speculative decoding is an effective approach for accelerating inference in Large Language models (LLMs), but its adaptation to Vision-Language models (VLMs) remains challenging for additional visual tokens in multimodal inputs. First, owing to the fact that the drafter and the target VLM may derived from different families, the semantic representations of visual tokens in the target VLM are misaligned with those in the drafter, introducing bias into the KV-cache during the prefill stage. Second, the large number of visual tokens substantially slows down the drafter's self-attention during the decoding stage. We propose Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models (HiViS), an explicit-implicit input decomposition framework that alleviates the above inefficiency. All visual tokens are removed from the drafter's input, retaining only textual tokens as explicit inputs, while directly reusing the target VLM's corresponding last-layer hidden states as implicit visual information without additional processing. To train the drafter efficiently, we introduces multi-step self-feedback training strategy with dynamic data selection and sequential embedding supervision to simulate reasoning during training. Our approach compresses the prefill sequence length of the drafter to only 0.7%-1.3% of the target VLM's input, while maintaining lossless generation quality. Extensive experiments across diverse models and tasks demonstrate up to 2.65x speedup, confirming the effectiveness of HiViS in accelerating VLM inference.
摘要：投机解码是加速大型语言模型（LLMS）推断的有效方法，但是它对视觉模型（VLM）的适应性对于多模式输入中的其他视觉令牌仍然具有挑战性。首先，由于起草者和目标VLM可能来自不同家族的事实，因此目标VLM中的视觉令牌的语义表示与起草者中的那些人对齐，在预填充阶段将偏见引入KV-CACHE。其次，在解码阶段，大量的视觉令牌大大减慢了起草者的自我注意力。我们提出了从起草者中隐藏视觉令牌，以在视觉模型（HIVIS）中进行投机解码，这是一种显着的输入分解框架，可减轻上述低效效率。所有视觉令牌都从起草者的输入中删除，仅保留文本令牌作为显式输入，同时直接使用目标VLM的相应的最后层隐藏状态作为隐式视觉信息，而无需其他处理。为了有效地训练起草人，我们通过动态数据选择和顺序嵌入监督介绍了多步自我反馈训练策略，以模拟培训期间的推理。我们的方法将起草者的预填充序列长度压缩至目标VLM输入的0.7％-1.3％，同时保持无损发电质量。跨不同模型和任务的广泛实验表明高达2.65倍的速度，证实了Hivis在加速VLM推理中的有效性。

Title: Brain-language fusion enables interactive neural readout and in-silico experimentation

Authors: Victoria Bosch, Daniel Anthes, Adrien Doerig, Sushrut Thorat, Peter König, Tim Christian Kietzmann
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2509.23941
Pdf URL: https://arxiv.org/pdf/2509.23941
Copy Paste: [[2509.23941]] Brain-language fusion enables interactive neural readout and in-silico experimentation(https://arxiv.org/abs/2509.23941)
Keywords: generative
Abstract: Large language models (LLMs) have revolutionized human-machine interaction, and have been extended by embedding diverse modalities such as images into a shared language space. Yet, neural decoding has remained constrained by static, non-interactive methods. We introduce CorText, a framework that integrates neural activity directly into the latent space of an LLM, enabling open-ended, natural language interaction with brain data. Trained on fMRI data recorded during viewing of natural scenes, CorText generates accurate image captions and can answer more detailed questions better than controls, while having access to neural data only. We showcase that CorText achieves zero-shot generalization beyond semantic categories seen during training. Furthermore, we present a counterfactual analysis that emulates in-silico cortical microstimulation. These advances mark a shift from passive decoding toward generative, flexible interfaces between brain activity and language.
摘要：大型语言模型（LLM）彻底改变了人机的互动，并通过将图像等多种方式（例如图像）嵌入共享语言空间来扩展。然而，神经解码仍然受到静态的非相互作用方法的限制。我们介绍了Cortext，这是一个将神经活动直接集成到LLM的潜在空间的框架，从而使开放式的自然语言与大脑数据相互作用。在观看自然场景期间记录的fMRI数据培训后，Cortext生成了准确的图像标题，并且可以比控件更好地回答更详细的问题，同时仅访问神经数据。我们展示了Cortext在训练过程中所看到的语义类别以外的零拍概括。此外，我们提出了反事实分析，该分析模仿了丝状皮质微刺激。这些进步标志着从被动解码转向大脑活动和语言之间的生成，灵活的界面。

Title: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Authors: Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma, Dianbo Liu, Alex Lamb
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2509.23946
Pdf URL: https://arxiv.org/pdf/2509.23946
Copy Paste: [[2509.23946]] Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm(https://arxiv.org/abs/2509.23946)
Keywords: generation
Abstract: Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of this http URL decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: this https URL
摘要：经过思考链（COT）及其变体显着提高了大语言模型（LLMS）的推理能力，但是它们的整体和自动回归体系结构本质上将高级战略计划与逐步执行的低水平执行相结合，导致计算不足，导致计算效率低下，有限地探索了推理路径的探索，并降低了解释能力。为了克服这些问题，我们提出了探索执行链（$ e^2c $），这是一个结构化的推理框架，将推理分为两个不同的阶段：探索性阶段，该阶段随机地生成简洁的高级计划，然后是执行阶段，然后确定执行选定的计划。 Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of this http URL decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test时间缩放使用可比方法（例如，思想森林）所需的解码令牌的<10％，达到了58.1％的精度，并急剧切断了自持不一致的开销。对于跨域的适应，我们以探索为中心的SFT（EF-SFT）微型调整只有标准SFT使用的令牌的3.5％，但与医疗基准的标准SFT相比，准确性高达14.5％，通过医疗基准，提供最新的效果，强大的概括，强大的普遍性以及通过与执行分离计划。该项目的代码和预培训模型可在以下网址提供：此HTTPS URL

Title: HunyuanImage 3.0 Technical Report

Authors: Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23951
Pdf URL: https://arxiv.org/pdf/2509.23951
Copy Paste: [[2509.23951]] HunyuanImage 3.0 Technical Report(https://arxiv.org/abs/2509.23951)
Keywords: generation, generative
Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at this https URL
摘要：我们提出了Hunyuanimage 3.0，这是一种本机的多模式模型，在自动回归框架内统一了多模式的理解和生成，其图像生成模块可公开可用。 Hunyuanimage 3.0的实现取决于几个关键组成部分，包括细致的数据策展，高级体系结构设计，一项本地经营链模式，渐进式模型预训练，积极进取的模型后培训以及有效的基础架构，可以实现大型培训和选择。随着这些进步，我们成功训练了一个超过800亿个参数的专家型（MOE）模型，在推断过程中，每个令牌都激活了130亿个参数，使其成为迄今为止最大，最强大的开放源图像生成模型。我们进行了广泛的实验以及对文本图像对齐和视觉质量的自动和人类评估的结果表明，Hunyuanimage 3.0竞争对手先前的最新模型。通过释放Hunyuanimage 3.0的代码和权重，我们旨在使社区能够通过最先进的基础模型探索新想法，从而促进动态和充满活力的多模式生态系统。所有开源资产均可在此HTTPS URL上公开获得

Title: ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

Authors: Shilan Zhang, Jirui Huang, Ruilin Yao, Cong Wang, Yaxiong Chen, Peng Xu, Shengwu Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23955
Pdf URL: https://arxiv.org/pdf/2509.23955
Copy Paste: [[2509.23955]] ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation(https://arxiv.org/abs/2509.23955)
Keywords: generation
Abstract: Referring Expression Comprehension (REC) and Referring Expression Generation (REG) are fundamental tasks in multimodal understanding, supporting precise object localization through natural language. However, existing REC and REG datasets rely heavily on manual annotation, which is labor-intensive and difficult to scale. In this paper, we propose ColLab, a collaborative spatial progressive data engine that enables fully automated REC and REG data generation without human supervision. Specifically, our method introduces a Collaborative Multimodal Model Interaction (CMMI) strategy, which leverages the semantic understanding of multimodal large language models (MLLMs) and large language models (LLMs) to generate descriptions. Furthermore, we design a module termed Spatial Progressive Augmentation (SPA) to enhance spatial expressiveness among duplicate instances. Experiments demonstrate that ColLab significantly accelerates the annotation process of REC and REG while improving the quality and discriminability of the generated expressions. In addition to the core methodological contribution, our framework was partially adopted in the data generation pipeline of the ICCV 2025 MARS2 Challenge on Multimodal Reasoning, enriching the dataset with diverse and challenging samples that better reflect real-world reasoning demands.
摘要：引用表达理解（REC）和转介表达生成（REG）是多模式理解中的基本任务，通过自然语言支持精确的对象定位。但是，现有的REC和REG数据集在很大程度上依赖手动注释，该注释是劳动密集型且难以扩展的。在本文中，我们提出了合作，这是一种协作空间渐进数据引擎，可以在没有人类监督的情况下实现完全自动化的REC和REG数据。具体而言，我们的方法引入了协作多模型互动（CMMI）策略，该策略利用了对多模式大语言模型（MLLM）和大语言模型（LLMS）的语义理解来生成描述。此外，我们设计了一个称为“空间渐进式增强”（SPA）的模块，以增强重复实例之间的空间表现力。实验表明，合作可以显着加速REC和REG的注释过程，同时提高生成表达式的质量和可区分性。除了核心方法论贡献外，我们的框架在ICCV 2025 MARS2挑战的数据生成管道中部分采用了多模式推理，以多种多样的挑战性样本丰富了数据集，这些样本更好地反映了现实世界中的推理需求。

Title: Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution

Authors: Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Yulun Zhang, Jian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.23980
Pdf URL: https://arxiv.org/pdf/2509.23980
Copy Paste: [[2509.23980]] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution(https://arxiv.org/abs/2509.23980)
Keywords: super-resolution, generative
Abstract: Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{this https URL}{this https URL}.
摘要：扩散模型最近显示了视频超分辨率（VSR）的有希望的结果。但是，将生成扩散模型直接调整为VSR可能会导致冗余，因为低质量的视频已经保留了大量内容信息。由于模型执行多余的操作，并且必须学会过滤无关的信息，因此这种冗余会导致计算开销和学习负担的增加。为了解决这个问题，我们提出了Oasis，这是一种有效的$ \ textbf {o} $ ne-step扩散模型，使用$ \ textbf {a} $ ttention $ \ textbf {s} $ pecialization for Realld v $ \ textbf {I} OASIS结合了一个注意专业路由，该路由将注意力根据其内在行为分配给不同的模式。这种路线可以减轻冗余，同时有效地保留了经过预定的知识，从而使扩散模型可以更好地适应VSR并实现更强的性能。此外，我们提出了一种简单而有效的渐进式培训策略，该策略始于时间一致的降解，然后转变为不一致的设置。这种策略促进了复杂降解下的学习。广泛的实验表明，OASIS在合成和现实数据集上都能达到最先进的性能。 OASIS还提供了卓越的推理速度，并提供了$ \ textbf {6.2 $ \ times $} $ speedup在一步扩散基线（例如Seedvr2）上。该代码将在\ href {this HTTPS url} {this HTTPS url}上可用。

Title: SIE3D: Single-image Expressive 3D Avatar generation via Semantic Embedding and Perceptual Expression Loss

Authors: Zhiqi Huang, Dulongkai Cui, Jinglu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24004
Pdf URL: https://arxiv.org/pdf/2509.24004
Copy Paste: [[2509.24004]] SIE3D: Single-image Expressive 3D Avatar generation via Semantic Embedding and Perceptual Expression Loss(https://arxiv.org/abs/2509.24004)
Keywords: generation
Abstract: Generating high-fidelity 3D head avatars from a single image is challenging, as current methods lack fine-grained, intuitive control over expressions via text. This paper proposes SIE3D, a framework that generates expressive 3D avatars from a single image and descriptive text. SIE3D fuses identity features from the image with semantic embedding from text through a novel conditioning scheme, enabling detailed control. To ensure generated expressions accurately match the text, it introduces an innovative perceptual expression loss function. This loss uses a pre-trained expression classifier to regularize the generation process, guaranteeing expression accuracy. Extensive experiments show SIE3D significantly improves controllability and realism, outperforming competitive methods in identity preservation and expression fidelity on a single consumer-grade GPU. Project page: this https URL
摘要：从单个图像中生成高保真3D头像很具有挑战性，因为当前方法缺乏通过文本对表达式的细粒度，直观的控制。本文提出了SIE3D，该框架从单个图像和描述性文本中生成表达的3D化身。 SIE3D融合了图像中的身份特征，并从文本到新颖的调理方案进行语义嵌入，从而实现了详细的控制。为了确保生成的表达式准确匹配文本，它引入了创新的感知表达损失函数。该损失使用预先训练的表达分类器来正规化生成过程，从而确保表达精度。广泛的实验表明，SIE3D显着提高了可控性和现实主义，在单个消费级GPU上的身份保存和表达忠诚度优于竞争方法。项目页面：此HTTPS URL

Title: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

Authors: Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2509.24006
Pdf URL: https://arxiv.org/pdf/2509.24006
Copy Paste: [[2509.24006]] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention(https://arxiv.org/abs/2509.24006)
Keywords: generation
Abstract: In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
摘要：在扩散变压器（DIT）模型中，尤其是对于视频产生的模型，由于长度长度和二次复杂性，注意潜伏期是一个主要的瓶颈。我们发现，注意力的重量可以分为两个部分：一小部分具有较高等级的重量，其余的重量非常低。这自然表明将稀疏的加速度应用于第一部分，并将其施加到第二部分。基于这一发现，我们提出了SLA（稀疏线性注意），这是一种可训练的注意方法，融合了稀疏和线性的关注加速扩散模型。 SLA将注意力的重量分为关键，边际和可以忽略不计的类别，将注意力（n^2）应用于关键权重，O（n）注意边缘权重以及跳过可忽略的重量。 SLA将这些计算结合到单个GPU内核中，并支持向前和向后通过。 DIT模型仅使用SLA进行了几个微调步骤，因此注意力计算减少了20倍，从而导致显着加速，而不会损失发电质量。实验表明，SLA将注意力计算降低了95％，而不会降低端到端生成质量，优于基线方法。此外，我们为SLA实施了有效的GPU内核，该核心在注意力计算中产生13.7倍的速度，在WAN2.1-1.3B上的视频生成中的2.2倍端到端加速。

Title: Pretraining Scaling Laws for Generative Evaluations of Language Models

Authors: Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24012
Pdf URL: https://arxiv.org/pdf/2509.24012
Copy Paste: [[2509.24012]] Pretraining Scaling Laws for Generative Evaluations of Language Models(https://arxiv.org/abs/2509.24012)
Keywords: generative
Abstract: Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters\,+\,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
摘要：神经缩放定律在现代机器学习中发挥了核心作用，推动了该领域不断扩展的参数，数据和计算的缩放。尽管许多研究已经进行了拟合缩放定律，并预测了预处理损失和歧视性评估（例如多项选择的提问）的绩效，但对拟合缩放定律的研究相对较少的研究和对诸如数学问题解决问题或软件工程等生成评估的绩效进行了相对较少的研究。我们建议并评估三种不同的预刻度法定定律，以适合生成评估，并使用廉价型号的性能来预测最昂贵的模型的通行证。我们的三个缩放定律在所使用的协变量上有所不同：（1）计算，（2）模型参数和令牌，（3）黄金参考溶液的对数似然。我们做出四个主要贡献：（1）我们展示了生成评估如何提供新的超参数（在我们的环境中，$ k $），研究人员可以用来控制缩放法律参数和性能的可预测性。（2）在缩放法律参数方面，我们发现计算缩放法律和参数\，+\，代币缩放法稳定了最后一个〜$ 1.5 { - } 2.5 $的数量级，而黄金参考的可能性缩放法律稳定了最后一个〜$ 5 $ 5 $。（3）在预测性能方面，我们发现所有三个缩放定律的性能都相当，尽管计算缩放定律预测小$ k $的情况稍差，而黄金参考解决方案的日志可能性预测，对于大$ k $而言，金额的可能性较差。（4）我们建立了一种理论上的联系，即计算缩放定律随着参数和tokens缩放定律的计算最佳包络的出现。我们的框架为研究人员和从业人员提供了洞察力和方法，以预测生成性能。

Title: A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks

Authors: Bo Hu, José C. Príncipe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24076
Pdf URL: https://arxiv.org/pdf/2509.24076
Copy Paste: [[2509.24076]] A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks(https://arxiv.org/abs/2509.24076)
Keywords: generative
Abstract: Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.
摘要：基于成对距离的成本对于自我监督和对比的特征学习至关重要。混合密度网络（MDN）是一种使用神经网络来产生定义高斯混合物的多个中心，是一种广泛使用的生成模型和密度近似方法。通过将MDN与对比成本相结合，本文提出了使用四种类型的二型矩阵成本的数据密度近似：标量成本，矢量 - 矩阵成本，矩阵矩阵成本，矩阵矩阵成本（Schur补充的痕迹）和SVD成本（核标准）（核标准），以学习多个中心来定义混合物量度。

Title: Autoregressive Video Generation beyond Next Frames Prediction

Authors: Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu, Alan Yuille, Yinfei Yang, Jiasen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24081
Pdf URL: https://arxiv.org/pdf/2509.24081
Copy Paste: [[2509.24081]] Autoregressive Video Generation beyond Next Frames Prediction(https://arxiv.org/abs/2509.24081)
Keywords: generation
Abstract: Autoregressive models for video generation typically operate frame-by-frame, extending next-token prediction from language to video's temporal dimension. We question that unlike word as token is universally agreed in language if frame is a appropriate prediction unit? To address this, we present VideoAR, a unified framework that supports a spectrum of prediction units including full frames, key-detail frames, multiscale refinements, and spatiotemporal cubes. Among these designs, we find model video generation using \textit{spatiotemporal} cubes as prediction units, which allows autoregressive models to operate across both spatial and temporal dimensions simultaneously. This approach eliminates the assumption that frames are the natural atomic units for video autoregression. We evaluate VideoAR across diverse prediction strategies, finding that cube-based prediction consistently delivers superior quality, speed, and temporal coherence. By removing the frame-by-frame constraint, our video generator surpasses state-of-the-art baselines on VBench while achieving faster inference and enabling seamless scaling to minute-long sequences. We hope this work will motivate rethinking sequence decomposition in video and other spatiotemporal domains.
摘要：视频生成的自回归模型通常会逐帧运行，将下一步的预测从语言扩展到视频的时间维度。我们质疑与单词不同，如果框架是适当的预测单元，则在语言上普遍同意。为了解决这个问题，我们提出了视频，这是一个统一的框架，该框架支持一系列预测单元，包括完整框架，键尾框架，多尺度修补和时空立方体。在这些设计中，我们发现使用\ textIt {spatiotemporal}立方体作为预测单元的模型视频生成，该单元允许自回旋模型同时在空间和时间维度上运行。这种方法消除了以下假设：框架是自然的视频自动摄影单元。我们评估跨不同预测策略的视频量，发现基于立方体的预测始终提供卓越的质量，速度和时间连贯性。通过删除逐帧约束，我们的视频生成器在VBench上超过了最新的基线，同时实现了更快的推理并使无缝缩放缩放到长时间的序列。我们希望这项工作能够激发视频和其他时空域中的重新思考序列分解。

Title: Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

Authors: Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24099
Pdf URL: https://arxiv.org/pdf/2509.24099
Copy Paste: [[2509.24099]] Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow(https://arxiv.org/abs/2509.24099)
Keywords: generation
Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
摘要：以各种方式来生成现实的，上下文感知的两人运动仍然是计算机图形，动画和人类计算机交互的核心挑战。我们介绍了Dualflow，这是一个多式联运两人运动的统一且有效的框架。双流条件条件3D运动合成在不同的输入上，包括文本，音乐和先前的运动序列。利用整流的流程，它可以在噪声和数据之间达到确定性的直线采样路径，减少推理时间并减轻基于扩散模型中常见的误差积累。为了增强语义接地，Dualflow采用了检索功能的生成（RAG）模块，该模块使用音乐功能和基于LLM的空间关系，身体运动和节奏模式来检索运动示例。我们使用对比目标，以进一步加强与条件信号的一致性并引入同步损失，从而改善了人际关系的协调。跨文本到动作，音乐到动作和多模式互动基准的广泛评估显示出运动质量，响应能力和效率的一致性。双流在时间上产生连贯和节奏的同步运动，在多模式人类运动产生中设定最新的动作。

Title: GANji: A Framework for Introductory AI Image Generation

Authors: Chandon Hamel, Mike Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24128
Pdf URL: https://arxiv.org/pdf/2509.24128
Copy Paste: [[2509.24128]] GANji: A Framework for Introductory AI Image Generation(https://arxiv.org/abs/2509.24128)
Keywords: generation, generative
Abstract: The comparative study of generative models often requires significant computational resources, creating a barrier for researchers and practitioners. This paper introduces GANji, a lightweight framework for benchmarking foundational AI image generation techniques using a dataset of 10,314 Japanese Kanji characters. It systematically compares the performance of a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN), and a Denoising Diffusion Probabilistic Model (DDPM). The results demonstrate that while the DDPM achieves the highest image fidelity, with a Fréchet Inception Distance (FID) score of 26.2, its sampling time is over 2,000 times slower than the other models. The GANji framework is an effective and accessible tool for revealing the fundamental trade-offs between model architecture, computational cost, and visual quality, making it ideal for both educational and research purposes.
摘要：生成模型的比较研究通常需要大量的计算资源，从而为研究人员和从业人员造成障碍。本文介绍了Ganji，这是一个轻巧的框架，用于使用10,314个日本汉字字符的数据集对基础AI图像生成技术进行基础测试。它系统地比较了变量自动编码器（VAE），生成对抗网络（GAN）的性能以及DeNo的扩散概率模型（DDPM）。结果表明，尽管DDPM达到了最高的图像保真度，而Fréchet成立距离（FID）得分为26.2，但其采样时间比其他型号慢2,000倍。 Ganji框架是一种有效且可访问的工具，用于揭示模型架构，计算成本和视觉质量之间的基本权衡，使其非常适合教育和研究目的。

Title: Asymmetric VAE for One-Step Video Super-Resolution Acceleration

Authors: Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24142
Pdf URL: https://arxiv.org/pdf/2509.24142
Copy Paste: [[2509.24142]] Asymmetric VAE for One-Step Video Super-Resolution Acceleration(https://arxiv.org/abs/2509.24142)
Keywords: super-resolution
Abstract: Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE's performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at this https URL.
摘要：扩散模型在现实世界视频超级分辨率领域具有显着优势，并且在过去的研究中表现出了强劲的表现。在最近的基于扩散的视频超分辨率（VSR）模型中，采样步骤的数量已减少到仅一个，但在推理效率方面仍有很大的余地。在本文中，我们提出了FASTVSR，该FASTVSR通过实施高压缩VAE（空间压缩率为16，表示为F16），从而实现了大量的计算成本降低。我们设计了F16 VAE的结构，并引入了稳定的训练框架。我们采用像素式的混音和通道复制来实现额外的上采样。此外，我们提出了一种较低的指导训练策略，该策略引入了更简单的训练目标，作为VAE性能的下限。它使培训过程更稳定，更容易收敛。实验结果表明，与现有的一步型相比，FASTVSR与多步型相比达到111.9倍，而3.92倍的速度达到111.9倍。我们将在此HTTPS URL上发布代码和模型。

Title: LatXGen: Towards Radiation-Free and Accurate Quantitative Analysis of Sagittal Spinal Alignment Via Cross-Modal Radiographic View Synthesis

Authors: Moxin Zhao, Nan Meng, Jason Pui Yin Cheung, Chris Yuk Kwan Tang, Chenxi Yu, Wenting Zhong, Pengyu Lu, Chang Shi, Yipeng Zhuang, Teng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24165
Pdf URL: https://arxiv.org/pdf/2509.24165
Copy Paste: [[2509.24165]] LatXGen: Towards Radiation-Free and Accurate Quantitative Analysis of Sagittal Spinal Alignment Via Cross-Modal Radiographic View Synthesis(https://arxiv.org/abs/2509.24165)
Keywords: generative
Abstract: Adolescent Idiopathic Scoliosis (AIS) is a complex three-dimensional spinal deformity, and accurate morphological assessment requires evaluating both coronal and sagittal alignment. While previous research has made significant progress in developing radiation-free methods for coronal plane assessment, reliable and accurate evaluation of sagittal alignment without ionizing radiation remains largely underexplored. To address this gap, we propose LatXGen, a novel generative framework that synthesizes realistic lateral spinal radiographs from posterior Red-Green-Blue and Depth (RGBD) images of unclothed backs. This enables accurate, radiation-free estimation of sagittal spinal alignment. LatXGen tackles two core challenges: (1) inferring sagittal spinal morphology changes from a lateral perspective based on posteroanterior surface geometry, and (2) performing cross-modality translation from RGBD input to the radiographic domain. The framework adopts a dual-stage architecture that progressively estimates lateral spinal structure and synthesizes corresponding radiographs. To enhance anatomical consistency, we introduce an attention-based Fast Fourier Convolution (FFC) module for integrating anatomical features from RGBD images and 3D landmarks, and a Spatial Deformation Network (SDN) to model morphological variations in the lateral view. Additionally, we construct the first large-scale paired dataset for this task, comprising 3,264 RGBD and lateral radiograph pairs. Experimental results demonstrate that LatXGen produces anatomically accurate radiographs and outperforms existing GAN-based methods in both visual fidelity and quantitative metrics. This study offers a promising, radiation-free solution for sagittal spine assessment and advances comprehensive AIS evaluation.
摘要：青春期特发性脊柱侧弯（AIS）是复杂的三维脊柱畸形，准确的形态评估需要评估冠状和矢状比对。尽管先前的研究在开发无辐射方法进行冠状平面评估方面取得了重大进展，但在没有电离辐射的情况下，可靠，准确地评估了矢状比对，仍然在很大程度上没有被忽略。为了解决这一差距，我们提出了LATXGEN，这是一种新颖的生成框架，从后部红绿色蓝色和深度（RGBD）图像中综合了逼真的侧面脊柱X光片。这使得对矢状脊柱对齐的准确，无辐射估计。 LATXGEN应对两个核心挑战：（1）推断矢状脊柱形态从基于后侧面表面几何形状的横向角度变化，并且（2）（2）执行从RGBD输入到X射线照相域的交叉模式翻译。该框架采用了双阶段体系结构，该体系结构逐渐估算侧面脊柱结构并合成相应的X光片。为了提高解剖学的一致性，我们引入了一个基于注意力的快速傅立叶卷积（FFC）模块，用于整合RGBD图像和3D地标的解剖特征，以及一个空间变形网络（SDN），以模拟横向视图中的形态变化。此外，我们为此任务构建了第一个大型配对数据集，其中包括3,264个RGBD和侧面X光片对。实验结果表明，LATXGEN在视觉保真度和定量指标上产生解剖上准确的X光片，并且胜过现有的基于GAN的方法。这项研究为矢状脊柱评估提供了一种有希望的无辐射解决方案，并进步了全面的AIS评估。

Title: Tumor Synthesis conditioned on Radiomics

Authors: Jonghun Kim, Inye Na, Eun Sook Ko, Hyunjin Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24182
Pdf URL: https://arxiv.org/pdf/2509.24182
Copy Paste: [[2509.24182]] Tumor Synthesis conditioned on Radiomics(https://arxiv.org/abs/2509.24182)
Keywords: generation, generative
Abstract: Due to privacy concerns, obtaining large datasets is challenging in medical image analysis, especially with 3D modalities like Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing generative models, developed to address this issue, often face limitations in output diversity and thus cannot accurately represent 3D medical images. We propose a tumor-generation model that utilizes radiomics features as generative conditions. Radiomics features are high-dimensional handcrafted semantic features that are biologically well-grounded and thus are good candidates for conditioning. Our model employs a GAN-based model to generate tumor masks and a diffusion-based approach to generate tumor texture conditioned on radiomics features. Our method allows the user to generate tumor images according to user-specified radiomics features such as size, shape, and texture at an arbitrary location. This enables the physicians to easily visualize tumor images to better understand tumors according to changing radiomics features. Our approach allows for the removal, manipulation, and repositioning of tumors, generating various tumor types in different scenarios. The model has been tested on tumors in four different organs (kidney, lung, breast, and brain) across CT and MRI. The synthesized images are shown to effectively aid in training for downstream tasks and their authenticity was also evaluated through expert evaluations. Our method has potential usage in treatment planning with diverse synthesized tumors.
摘要：由于隐私问题，在医学图像分析中获得大型数据集具有挑战性，尤其是在计算机断层扫描（CT）和磁共振成像（MRI）等3D模式中。为解决此问题而开发的现有生成模型通常面临输出多样性的限制，因此无法准确代表3D医疗图像。我们提出了一种利用放射线特征作为生成条件的肿瘤生成模型。放射线特征是高维手工的语义特征，具有生物学上良好的基础，因此是适应条件的良好候选者。我们的模型采用基于GAN的模型来生成肿瘤面膜，并采用基于扩散的方法来生成以放射线特征为条件的肿瘤纹理。我们的方法允许用户根据用户指定的放射线特征（例如，在任意位置的大小，形状和纹理）生成肿瘤图像。这使医生可以根据不断变化的放射线特征轻松地可视化肿瘤图像，从而更好地理解肿瘤。我们的方法可以清除，操纵和重新定位肿瘤，在不同情况下产生各种肿瘤类型。该模型已在CT和MRI的四个不同器官（肾脏，肺，乳房和脑）的肿瘤上进行了测试。综合图像显示可有效地帮助培训下游任务，并且还通过专家评估来评估其真实性。我们的方法具有不同合成肿瘤的治疗计划中的潜在用途。

Title: Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning

Authors: Jonghun Kim, Hyunjin Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24185
Pdf URL: https://arxiv.org/pdf/2509.24185
Copy Paste: [[2509.24185]] Simulating Post-Neoadjuvant Chemotherapy Breast Cancer MRI via Diffusion Model with Prompt Tuning(https://arxiv.org/abs/2509.24185)
Keywords: generative
Abstract: Neoadjuvant chemotherapy (NAC) is a common therapy option before the main surgery for breast cancer. Response to NAC is monitored using follow-up dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Accurate prediction of NAC response helps with treatment planning. Here, we adopt maximum intensity projection images from DCE-MRI to generate post-treatment images (i.e., 3 or 12 weeks after NAC) from pre-treatment images leveraging the emerging diffusion model. We introduce prompt tuning to account for the known clinical factors affecting response to NAC. Our model performed better than other generative models in image quality metrics. Our model was better at generating images that reflected changes in tumor size according to pCR compared to other models. Ablation study confirmed the design choices of our method. Our study has the potential to help with precision medicine.
摘要：新辅助化学疗法（NAC）是乳腺癌主要手术前的常见治疗选择。使用后续动态对比增强磁共振成像（DCE-MRI）对NAC的响应进行监测。 NAC反应的准确预测有助于治疗计划。在这里，我们采用来自DCE-MRI的最大强度投影图像，从利用新兴扩散模型的前处理图像中生成处理后图像（即NAC后3或12周）。我们介绍及时调整以说明影响对NAC反应的已知临床因素。我们的模型在图像质量指标中的表现比其他生成模型更好。与其他模型相比，我们的模型更擅长生成反映肿瘤大小变化的图像。消融研究证实了我们方法的设计选择。我们的研究有可能帮助精确医学。

Title: An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation

Authors: Zach Eidex, Mojtaba Safari, Jie Ding, Richard Qiu, Justin Roper, David Yu, Hui-Kuo Shu, Zhen Tian, Hui Mao, Xiaofeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24194
Pdf URL: https://arxiv.org/pdf/2509.24194
Copy Paste: [[2509.24194]] An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation(https://arxiv.org/abs/2509.24194)
Keywords: generation
Abstract: Objective: Gadolinium-based contrast agents (GBCAs) are commonly employed with T1w MRI to enhance lesion visualization but are restricted in patients at risk of nephrogenic systemic fibrosis and variations in GBCA administration can introduce imaging inconsistencies. This study develops an efficient 3D deep-learning framework to generate T1-contrast enhanced images (T1C) from pre-contrast multiparametric MRI. Approach: We propose the 3D latent rectified flow (T1C-RFlow) model for generating high-quality T1C images. First, T1w and T2-FLAIR images are input into a pretrained autoencoder to acquire an efficient latent space representation. A rectified flow diffusion model is then trained in this latent space representation. The T1C-RFlow model was trained on a curated dataset comprised of the BraTS 2024 glioma (GLI; 1480 patients), meningioma (MEN; 1141 patients), and metastases (MET; 1475 patients) datasets. Selected patients were split into train (N=2860), validation (N=612), and test (N=614) sets. Results: Both qualitative and quantitative results demonstrate that the T1C-RFlow model outperforms benchmark 3D models (pix2pix, DDPM, Diffusion Transformers (DiT-3D)) trained in the same latent space. T1C-RFlow achieved the following metrics - GLI: NMSE 0.044 +/- 0.047, SSIM 0.935 +/- 0.025; MEN: NMSE 0.046 +/- 0.029, SSIM 0.937 +/- 0.021; MET: NMSE 0.098 +/- 0.088, SSIM 0.905 +/- 0.082. T1C-RFlow had the best tumor reconstruction performance and significantly faster denoising times (6.9 s/volume, 200 steps) than conventional DDPM models in both latent space (37.7s, 1000 steps) and patch-based in image space (4.3 hr/volume). Significance: Our proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models. Further development may permit a practical method for contrast-agent-free MRI for brain tumors.
摘要：目的：基于Gadolinium的对比剂（GBCA）通常使用T1W MRI使用来增强病变可视化，但受到肾病风险的患者受到限制，GBCA给药的变化可能会引入成像不一致。这项研究开发了一个有效的3D深度学习框架，以从前对比度多参数MRI产生T1对比度增强图像（T1C）。方法：我们提出了用于生成高质量T1C图像的3D潜在整流流（T1C-RFLOW）模型。首先，将T1W和T2-FLAIR图像输入到验证的自动编码器中，以获取有效的潜在空间表示。然后，在此潜在空间表示中训练了整流的流扩散模型。 T1C-RFLOW模型在由Brats 2024 Glioma（GLI； 1480名患者），脑膜瘤（男性； 1141例患者）和转移酶（MET; MET; 1475例患者）数据集成的策划数据集上进行了训练。将选定的患者分为训练（n = 2860），验证（n = 612）和测试（n = 614）集。结果：定性和定量结果都表明，T1C-RFLOW模型的表现优于基准3D模型（PIX2PIX，DDPM，扩散变压器（DIT-3D））在同一潜在空间中训练有素。 T1C-RFlow达到了以下指标-GLI：NMSE 0.044 +/- 0.047，SSIM 0.935 +/- 0.025;男性：NMSE 0.046 +/- 0.029，SSIM 0.937 +/- 0.021; MET：NMSE 0.098 +/- 0.088，SSIM 0.905 +/- 0.082。 T1C-RFlow具有最佳的肿瘤重建性能，并且比潜在空间（37.7 s，1000步）和基于图像空间的贴片（4.3小时/卷）中的常规DDPM模型（4.3小时/卷）中的常规DDPM模型明显更快（6.9 s/soumel，200步）。意义：我们提出的方法生成的合成T1C图像在比以前的扩散模型少的时间内与地面真相T1C非常相似。进一步的开发可能允许用于用于脑肿瘤的无对比度MRI的实用方法。

Title: UniVid: The Open-Source Unified Video Model

Authors: Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24200
Pdf URL: https://arxiv.org/pdf/2509.24200
Copy Paste: [[2509.24200]] UniVid: The Open-Source Unified Video Model(https://arxiv.org/abs/2509.24200)
Keywords: generation
Abstract: Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.
摘要：结合发电和理解能力的统一视频建模越来越重要，但是面临两个关键挑战：由于文本视差令牌失衡而导致基于流的产生过程中的语义忠诚以及整个流程轨迹均匀的跨模式关注的局限性，并有效地将图像中心的MLLMS扩展到视频中，而无需进行昂贵的重新验证。我们提出了一个统一的架构，它通过轻量级适配器将MLLM与扩散解码器结合在一起，从而使视频理解和发电融为一体。我们介绍温度模态对齐方式，以提高迅速粘附和金字塔反射，以通过动态钥匙帧选择有效地进行时间推理。与EasyAnimateV5.1相比，对标准基准测试的广泛实验表明了最先进的性能，与EasyAnimateV5.1相比，Vbench-Long总分提高了2.2％，MSVD-QA和ActivityNet-QA的精度分别提高了1.0％和3.3％，与前面最佳的7B基础相比。

Title: Semantic Editing with Coupled Stochastic Differential Equations

Authors: Jianxin Zhang, Clayton Scott
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2509.24223
Pdf URL: https://arxiv.org/pdf/2509.24223
Copy Paste: [[2509.24223]] Semantic Editing with Coupled Stochastic Differential Equations(https://arxiv.org/abs/2509.24223)
Keywords: generative
Abstract: Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box-without retraining or auxiliary networks-and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI.
摘要：使用预验证的文本对图像模型编辑图像的内容仍然具有挑战性。现有方法通常会扭曲细节或引入意外的文物。我们建议使用耦合的随机微分方程（耦合SDE）指导任何可以通过求解SDE进行采样的预训练生成模型的采样过程，包括扩散和整流流模型。通过使用相同的相关噪声驱动源图像和编辑的图像，我们的方法将新样本引导到所需的语义，同时保持与源的视觉相似性。该方法在没有重新培训或辅助网络的情况下可行，并实现了近像素级的一致性，可以实现高及时的保真度。这些结果位置将SDE耦合为一种简单但功能强大的工具，用于受控生成AI。

Title: FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation

Authors: Seungwook Kim, Seunghyeon Lee, Minsu Cho
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.24241
Pdf URL: https://arxiv.org/pdf/2509.24241
Copy Paste: [[2509.24241]] FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation(https://arxiv.org/abs/2509.24241)
Keywords: generation
Abstract: Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.
摘要：从显式动作轨迹中生成现实的机器人视频是迈向建立有效的世界模型和机器人基础模型的关键一步。我们介绍了两种无训练的推理时间技术，这些技术完全利用了基于扩散的机器人视频生成中的明确动作参数。我们的方法不是将动作向量视为被动调节信号，而是积极地将其纳入了无分类器的指导过程和高斯潜伏期的初始化。首先，无分类器的无分类器引导可以动态调节指导强度与动作幅度成比例，从而增强了对运动强度的可控性。其次，动作尺度噪声截断调节最初采样噪声的分布，以更好地与所需的运动动力学保持一致。对实际机器人操纵数据集的实验表明，这些技术可显着提高不同机器人环境中的动作连贯性和视觉质量。

Title: Latent Visual Reasoning

Authors: Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.24251
Pdf URL: https://arxiv.org/pdf/2509.24251
Copy Paste: [[2509.24251]] Latent Visual Reasoning(https://arxiv.org/abs/2509.24251)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing, thereby enhancing the visual signal along the reasoning trajectories. Nevertheless, these approaches remain fundamentally constrained: reasoning is still confined to the language space, with visual information treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a new paradigm that enables autoregressive reasoning directly in the visual embedding space. A visual encoder first projects images into visual tokens within a joint semantic space shared with the language model. The language model is then trained to generate latent states that reconstruct key visual tokens critical for answering the query, constituting the process of latent visual reasoning. By interleaving LVR with standard text generation, our model achieves substantial gains on perception-intensive visual question answering tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement learning on latent reasoning, further balancing LVR and textual generation. We show that LVR substantially improves fine-grained visual understanding and perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code base and model weights will be released later.
摘要：多模式大型语言模型（MLLM）通过将思想链（COT）推理纳入语言空间中，在各种任务中取得了显着的收益。最近的工作通过利用外部工具进行视觉编辑扩展了这一方向，从而增强了沿推理轨迹的视觉信号。然而，这些方法基本上仍然受到限制：推理仍然局限于语言空间，视觉信息被视为静态前提。我们介绍了潜在的视觉推理（LVR），这是一种新的范式，可直接在视觉嵌入空间中进行自回归推理。视觉编码器首先在与语言模型共享的联合语义空间内将图像投射到视觉令牌中。然后，对语言模型进行训练，以生成潜在状态，以重建对回答查询至关重要的关键视觉令牌，从而构成潜在的视觉推理过程。通过将LVR与标准文本生成交织在一起，我们的模型在感知密集的视觉问题回答任务上取得了可观的收益。此外，我们适应了GRPO算法，以对潜在推理，进一步平衡LVR和文本产生进行增强学习。我们表明，LVR显着提高了细粒度的视觉理解和感知，而QWEN2.5-VL的MMVP达到71.67％，而66.67％。代码库和模型权重将在稍后发布。

Title: Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization

Authors: Yunhao Liang, Pujun Zhang, Yuan Qu, Shaochong Lin, Zuo-jun Max Shen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24256
Pdf URL: https://arxiv.org/pdf/2509.24256
Copy Paste: [[2509.24256]] Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization(https://arxiv.org/abs/2509.24256)
Keywords: generative
Abstract: The pretrain-transfer paradigm, which underpins the success of large language models (LLMs), has demonstrated the immense power of creating foundation models that learn generalizable representations from vast datasets. However, extending this paradigm to Operations Research (OR) problems on graph structures remains challenging due to the fundamental conflict between the statistical flexibility of language and the strict combinatorial constraints of graphs. To bridge this gap, we introduce the Graph Foundation Model (GFM), the first framework capable of solving all distance-based optimization problems on graph structures. By introducing the LLM-like self-supervised pre-training paradigm on the paths generated from random walks in the graph, GFM is compelled to internalize the graph's complex topological and combinatorial rules, where the connectivity of the structure itself can be treated as the supervisory signal. Unlike existing neural methods that learn complex and task-specific solving policies, our approach leverages the pre-trained GFM as a foundational model of the graph's intrinsic structure, which in turn enables a simple generative heuristic to tackle a diverse range of optimization challenges effectively. Comprehensive experiments on networks ranging from 20 to 893 nodes demonstrate that GFM achieves competitive performance against specialized solvers across a variety of distinct optimization task classes, while maintaining significantly faster inference times. Our work establishes a new paradigm of adapting the pretrain-transfer framework to graph optimization, opening the door for applying foundation model innovations to OR.
摘要：支撑大型语言模型（LLMS）成功的预处理转移范式（LLMS）表现出了创建基础模型的巨大力量，这些模型可以从广泛的数据集中学习可概括的表示。但是，由于语言的统计灵活性与图形的严格组合约束之间的基本冲突，将此范式扩展到图形结构上的操作研究（或）问题仍然具有挑战性。为了弥合此差距，我们介绍了图形基础模型（GFM），这是第一个框架，能够解决图形结构上的所有基于距离的优化问题。通过在图中随机步行产生的路径上引入类似LLM的自我监督的预训练范式，GFM被迫内化图形的复杂拓扑和组合规则，其中结构本身的连通性可以视为主管信号。与学习复杂且特定于任务的解决策略的现有神经方法不同，我们的方法利用了预训练的GFM作为图形内在结构的基础模型，这反过来又可以使一种简单的生成启发式方法有效地应对各种各样的优化挑战。在20至893个节点的网络上进行的全面实验表明，GFM在各种不同的优化任务类别中对专门求解器的竞争性能达到了竞争性能，同时保持了更快的推理时间。我们的工作建立了一个新的范式，以适应预处理 - 转移框架以图形优化，为将基础模型创新应用于OR打开了大门。

Title: Cycle Diffusion Model for Counterfactual Image Generation

Authors: Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24267
Pdf URL: https://arxiv.org/pdf/2509.24267
Copy Paste: [[2509.24267]] Cycle Diffusion Model for Counterfactual Image Generation(https://arxiv.org/abs/2509.24267)
Keywords: generation, generative
Abstract: Deep generative models have demonstrated remarkable success in medical image synthesis. However, ensuring conditioning faithfulness and high-quality synthetic images for direct or counterfactual generation remains a challenge. In this work, we introduce a cycle training framework to fine-tune diffusion models for improved conditioning adherence and enhanced synthetic image realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency between generated and original images by incorporating cycle constraints, enabling more reliable direct and counterfactual generation. Experiments on a combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and PPMI) show that our method improves conditioning accuracy and enhances image quality as measured by FID and SSIM. The results suggest that the cycle strategy used in CDM can be an effective method for refining diffusion-based medical image generation, with applications in data augmentation, counterfactual, and disease progression modeling.
摘要：深层生成模型在医学图像综合方面取得了巨大的成功。但是，确保对直接或反事实生成的调理忠诚和高质量的合成图像仍然是一个挑战。在这项工作中，我们将一个周期训练框架引入了微调扩散模型，以改善调节性粘附和增强的合成图像现实主义。我们的方法，即循环扩散模型（CDM），通过合并循环约束来实现生成的图像和原始图像之间的一致性，从而实现了更可靠的直接和反事实生成。在联合3D脑MRI数据集（来自ABCD，HCP老化和年轻人，ADNI和PPMI）上进行的实验表明，我们的方法提高了调理精度并增强了通过FID和SSIM测量的图像质量。结果表明，CDM中使用的循环策略可以是提炼基于扩散的医学图像产生的有效方法，并在数据增强，反事实和疾病进展建模中应用。

Title: SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation

Authors: Hanqi Chen, Zhongyin Zhao, Ye Chen, Zhujin Liang, Bingbing Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24299
Pdf URL: https://arxiv.org/pdf/2509.24299
Copy Paste: [[2509.24299]] SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation(https://arxiv.org/abs/2509.24299)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) is a code-based representation for 2D visuals. Leveraging recent advances in large language models (LLMs), we study text-to-SVG generation and address two persistent gaps: weak generalization and poor adherence to input instructions. We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process and supports the full set of SVG primitives. Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code; we then build stepwise updates that mirror the incremental addition of primitives. On this data, we train an LLM with supervised fine-tuning that exposes its chain-of-thought as intermediate reasoning, improving robustness and reducing errors and hallucinations. Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs while preserving the structural advantages of vector graphics. Unlike image-based methods, our outputs enable precise and hierarchical editing, opening new directions for design, content creation, and automated graphics generation.
摘要：可扩展的向量图形（SVG）是2D视觉效果的基于代码的表示。利用大语言模型（LLM）的最新进展，我们研究文本到SVG的生成并解决了两个持续的差距：对输入指令的概括较弱和依从性不佳。我们提出了SVGTHINKER，这是一个由推理驱动的框架，将SVG代码的生产与可视化过程保持一致，并支持整个SVG原始词。我们的管道首先呈现每个原始的原始性，并使用多模式来注释图像和代码。然后，我们构建逐步更新，以反映原始图的增量添加。在这些数据上，我们通过监督的微调培训LLM，以揭示其作为中间推理的思想链，改善鲁棒性并减少错误和幻觉。针对最先进的基线的实验表明，SVGTHINGER会产生更稳定，可编辑和更高质量的SVG，同时保留矢量图形的结构优势。与基于图像的方法不同，我们的输出启用精确和分层编辑，打开设计，内容创建和自动化图形生成的新方向。

Title: Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Authors: Guolin Ke, Hui Xue
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24335
Pdf URL: https://arxiv.org/pdf/2509.24335
Copy Paste: [[2509.24335]] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation(https://arxiv.org/abs/2509.24335)
Keywords: generation
Abstract: Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.
摘要：自回归（AR）模型对于图像产生而言是有希望的，但连续的AR变体通常会追踪潜在的扩散和掩盖生成模型。核心问题是VAE潜伏期的异质差异，该差异在AR解码过程中被放大，尤其是在无分类器指导下（CFG），并可能导致方差崩溃。我们建议Spherear解决这个问题。它的核心设计是限制所有AR输入和输出（包括CFG之后），以置于固定的-dradius hypersphere（常数$ \ ell_2 $ norm）上，利用超球体VAE。我们的理论分析表明，超球约束消除了比例分量（方差崩溃的主要原因），从而稳定AR解码。从经验上讲，在Imagenet生成上，Spherear-H（943M）为AR模型设定了新的最新技术，实现了FID 1.34。即使在较小的尺度下，Spherear-L（479m）达到FID 1.54，Spherear-B（208M）达到1.92，匹配或超过了更大的基线，例如MAR-H（943m，1.55）和VAR-D30（2B，1.92）。据我们所知，这是首次在可比较的参数量表上超过扩散和掩盖生成模型的纯粹的下一步AR图像发生器。

Title: Expanding Horizons of Level Diversity via Multi-objective Evolutionary Learning

Authors: Qingquan Zhang, Ziqi Wang, Yuchen Li, Keyuan Zhang, Bo Yuan, Jialin Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24341
Pdf URL: https://arxiv.org/pdf/2509.24341
Copy Paste: [[2509.24341]] Expanding Horizons of Level Diversity via Multi-objective Evolutionary Learning(https://arxiv.org/abs/2509.24341)
Keywords: generation, generative
Abstract: In recent years, the generation of diverse game levels has gained increasing interest, contributing to a richer and more engaging gaming experience. A number of level diversity metrics have been proposed in literature, which are naturally multi-dimensional, leading to conflicted, complementary, or both relationships among these dimensions. However, existing level generation approaches often fail to comprehensively assess diversity across those dimensions. This paper aims to expand horizons of level diversity by considering multi-dimensional diversity when training generative models. We formulate the model training as a multi-objective learning problem, where each diversity metric is treated as a distinct objective. Furthermore, a multi-objective evolutionary learning framework that optimises multiple diversity metrics simultaneously throughout the model training process is proposed. Our case study on the commonly used benchmark Super Mario Bros. demonstrates that our proposed framework can enhance multi-dimensional diversity and identify a Pareto front of generative models, which provides a range of tradeoffs among playability and two representative diversity metrics, including a content-based one and a player-centered one. Such capability enables decision-makers to make informed choices when selecting generators accommodating a variety of scenarios and the diverse needs of players and designers.
摘要：近年来，各种各样的游戏水平的产生增强了人们的兴趣越来越多，促进了更丰富，更具吸引力的游戏体验。在文献中已经提出了许多级别的多样性指标，它们是自然的多维指标，导致这些维度之间的冲突，互补或两者之间的关系。但是，现有的水平生成方法通常无法全面评估这些维度的多样性。本文旨在通过考虑培训生成模型时考虑多维多样性来扩大水平多样性的视野。我们将模型培训作为一个多目标学习问题，在该问题中，每个多样性指标都被视为一个独特的目标。此外，提出了一个多目标进化学习框架，在整个模型训练过程中同时优化多样性指标。我们对常用基准超级马里奥兄弟的案例研究表明，我们提出的框架可以增强多维多样性，并确定生成模型的帕累托阵线，该模型在可玩性和两个代表性的多样性指标之间提供了一系列的权衡，包括基于内容的一个和一个以播放器为中心的衡量标准。这种能力使决策者在选择适应各种场景和玩家和设计师的多样需求的发电机时可以做出明智的选择。

Title: NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

Authors: Yixuan Ren, Hanyu Wang, Hao Chen, Bo He, Abhinav Shrivastava
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24353
Pdf URL: https://arxiv.org/pdf/2509.24353
Copy Paste: [[2509.24353]] NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis(https://arxiv.org/abs/2509.24353)
Keywords: generation
Abstract: We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.
摘要：我们提出了神经扩散，这是一个隐性潜在的视频扩散模型，该模型通过产生神经网络重量综合视频。生成的权重可以作为卷积神经网络的参数重新排列，该参数形成隐式神经表示（INR），并将以框架索引作为输入为视频。我们的框架由两个阶段组成：1）基于Hypernetwork的令牌仪，该框架编码了从像素空间到神经参数空间的原始视频，瓶颈潜在用作解码的INR权重。 2）隐式扩散变压器在潜在的INR权重上。与传统的视频引物器相比，将视频编码为框架特征图，神经扩散会压缩并以整体视频作为统一的神经网络生成视频。这可以通过在Denoiser中避免时间跨框架的关注并用专用解码器来解码视频，从而实现有效且高质量的视频综合。为了获得高表现力的高斯分布的INR权重，我们重复使用所有神经层的瓶颈潜在的瓶颈，并改革其重量分配，提高采样连接和输入坐标。我们还引入了SNR自适应减肥体重和计划的抽样，以有效训练隐式扩散模型。 Nerv-Diffusion具有以前的基于INR的模型的较高视频生成质量，并且在包括UCF-101和Kinetics-600（包括UCF-101和Kinetics-600）的现实世界视频基准上的最新最新非图像模型相比。它还带来了平稳的INR重量空间，可促进框架或视频之间的无缝插值。

Title: UI-UG: A Unified MLLM for UI Understanding and Generation

Authors: Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.24361
Pdf URL: https://arxiv.org/pdf/2509.24361
Copy Paste: [[2509.24361]] UI-UG: A Unified MLLM for UI Understanding and Generation(https://arxiv.org/abs/2509.24361)
Keywords: generation
Abstract: Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks.
摘要：尽管多模式的大语言模型（MLLM）已广泛应用于范围内，但它们仍在特定于领域的任务中面临挑战，例如用户界面（UI）了解准确性和UI生成质量。在本文中，我们介绍了UI-UG（用于UI理解和发电的统一的MLLM），从而整合了这两种功能。为了理解任务，我们采用了监督的微调（SFT），并结合小组相对政策优化（GRPO）来增强对现代复杂UI数据的细粒度理解。对于生成任务，我们进一步使用直接偏好优化（DPO）来使我们的模型生成人类偏爱的UI。此外，我们提出了一个具有工业有效的工作流程，包括设计特定于LLM的领域特定语言（DSL），培训策略，渲染过程和评估指标。在实验中，我们的模型在理解任务方面实现了最新的（SOTA）性能，表现优于较大的通用MLLM和类似大小的UI特殊模型。我们的模型在UI生成性能中的这些较大的MLLM也与计算成本的一小部分相提并论。我们还证明，整合理解和生成任务可以提高这两个任务的准确性和质量。

Title: Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models

Authors: Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24365
Pdf URL: https://arxiv.org/pdf/2509.24365
Copy Paste: [[2509.24365]] Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models(https://arxiv.org/abs/2509.24365)
Keywords: generation
Abstract: Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at this https URL
摘要：建立在共享自回归（AR）变压器上的统一多模型（UMMS）以其建筑简单性而具有吸引力。但是，我们确定了一个关键限制：在接受多模式输入的培训时，模态共享的变压器遭受视力和文本之间的严重梯度冲突，尤其是在浅层和深层中。我们将这个问题追溯到图像和文本的根本不同的低级统计特性，同时指出在表示形式变得更加抽象和语义上的中间层中的冲突减少。为了克服这一挑战，我们提出了Uni-X，这是一个两端分离的中间共享的架构。 Uni-X将其初始和最终层专用于特定于模态的处理，同时在中间层中保持共享参数以进行高级语义融合。这种X形设计不仅消除了两端的梯度冲突，而且还可以进一步减轻共享层中的残余冲突。广泛的实验验证了Uni-X的有效性。在相同的培训条件下，与强质基线相比，UNI-X实现了卓越的训练效率。当使用较大的训练数据缩放到3B参数时，Uni-X匹配或超过基于7B的UMMS，在图像生成中获得了82的Geneval评分，以及文本和视觉理解任务的强劲表现。这些结果将UNI-X作为参数有效且可扩展的基础，用于将来的统一多模型。我们的代码可在此HTTPS URL上找到

Title: Watermarking Diffusion Language Models

Authors: Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2509.24368
Pdf URL: https://arxiv.org/pdf/2509.24368
Copy Paste: [[2509.24368]] Watermarking Diffusion Language Models(https://arxiv.org/abs/2509.24368)
Keywords: generation
Abstract: We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.
摘要：我们介绍了针对扩散语言模型（DLMS）量身定制的第一个水印，这是一种能够以任意顺序生成令牌的新兴LLM范式，与标准自动回归语言模型（ARLMS）相反，该模型（ARLMS）依次生成代币。尽管ARLM水印已经进行了很多工作，但试图将这些方案直接应用于DLM设置时的主要挑战是，它们依靠先前生成的令牌，而DLM生成并不总是可用。在这项工作中，我们通过：（i）即使在某些上下文令牌尚未确定的情况下，将水印应用于预期，以及（ii）促进令牌，这些令牌在用作其他代币的背景时会增加水印强度。这是在保持水印探测器不变的同时完成的。我们的实验评估表明，DLM水印导致> 99％的真实正率，质量影响最小，并且与现有ARLM水印相似，这是首次可靠的DLM水印。

Title: From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis

Authors: Khawlah Bajbaa, Abbas Anwar, Muhammad Saqib, Hafeez Anwar, Nabin Sharma, Muhammad Usman
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2509.24369
Pdf URL: https://arxiv.org/pdf/2509.24369
Copy Paste: [[2509.24369]] From Satellite to Street: A Hybrid Framework Integrating Stable Diffusion and PanoGAN for Consistent Cross-View Synthesis(https://arxiv.org/abs/2509.24369)
Keywords: generation, generative
Abstract: Street view imagery has become an essential source for geospatial data collection and urban analytics, enabling the extraction of valuable insights that support informed decision-making. However, synthesizing street-view images from corresponding satellite imagery presents significant challenges due to substantial differences in appearance and viewing perspective between these two domains. This paper presents a hybrid framework that integrates diffusion-based models and conditional generative adversarial networks to generate geographically consistent street-view images from satellite imagery. Our approach uses a multi-stage training strategy that incorporates Stable Diffusion as the core component within a dual-branch architecture. To enhance the framework's capabilities, we integrate a conditional Generative Adversarial Network (GAN) that enables the generation of geographically consistent panoramic street views. Furthermore, we implement a fusion strategy that leverages the strengths of both models to create robust representations, thereby improving the geometric consistency and visual quality of the generated street-view images. The proposed framework is evaluated on the challenging Cross-View USA (CVUSA) dataset, a standard benchmark for cross-view image synthesis. Experimental results demonstrate that our hybrid approach outperforms diffusion-only methods across multiple evaluation metrics and achieves competitive performance compared to state-of-the-art GAN-based methods. The framework successfully generates realistic and geometrically consistent street-view images while preserving fine-grained local details, including street markings, secondary roads, and atmospheric elements such as clouds.
摘要：Street View图像已成为地理空间数据收集和城市分析的重要来源，从而可以提取支持知情决策的宝贵见解。但是，由于这两个域之间的外观和观看透视观点的实质性差异，从相应的卫星图像中综合街道视图图像提出了重大挑战。本文提出了一个混合框架，该框架集成了基于扩散的模型和有条件的生成对抗网络，以从卫星图像中生成地理上一致的街景图像。我们的方法使用多阶段训练策略，该策略将稳定的扩散纳入双分支体系结构中的核心组件。为了增强框架的功能，我们整合了有条件的生成对抗网络（GAN），该网络能够产生地理位置一致的全景景观。此外，我们实施了一种融合策略，该策略利用了这两种模型的优势来创建强大的表示形式，从而提高了生成的街景图像的几何一致性和视觉质量。提出的框架是在具有挑战性的跨视图美国（CVUSA）数据集上评估的，这是跨视图图像合成的标准基准。实验结果表明，与最新的基于GAN的方法相比，我们的混合方法的表现优于多个评估指标的扩散方法，并实现竞争性能。该框架成功地生成了逼真的，几何的街景图像，同时保留了细粒度的当地细节，包括街道标记，次要道路和大气元素，例如云。

Title: Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping

Authors: Hao Chen, Fang Xu, Tamer Saleh, Weifeng Hao, Gui-Song Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24374
Pdf URL: https://arxiv.org/pdf/2509.24374
Copy Paste: [[2509.24374]] Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping(https://arxiv.org/abs/2509.24374)
Keywords: generation
Abstract: Recent advances in remote sensing technology have made submeter resolution imagery increasingly accessible, offering remarkable detail for fine-grained land cover analysis. However, its full potential remains underutilized - particularly for large-scale land cover mapping - due to the lack of sufficient, high-quality annotated datasets. Existing labels are typically derived from pre-existing products or manual annotation, which are often unreliable or prohibitively expensive, particularly given the rich visual detail and massive data volumes of submeter imagery. Inspired by the spatial autocorrelation principle, which suggests that objects of the same class tend to co-occur with similar visual features in local neighborhoods, we propose the Mask Clustering-based Annotation Engine (MCAE), which treats semantically consistent mask groups as the minimal annotating units to enable efficient, simultaneous annotation of multiple instances. It significantly improves annotation efficiency by one to two orders of magnitude, while preserving label quality, semantic diversity, and spatial representativeness. With MCAE, we build a high-quality annotated dataset of about 14 billion labeled pixels, referred to as HiCity-LC, which supports the generation of city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. It is the first publicly available submeter resolution city-level land cover benchmark, highlighting the scalability and practical utility of MCAE for large-scale, submeter resolution mapping. The dataset is available at this https URL
摘要：遥感技术的最新进展使Subrestor分辨率图像越来越容易访问，为细粒度的土地覆盖分析提供了非凡的细节。但是，由于缺乏足够，高质量的注释数据集，其全部潜力仍未得到充分利用 - 尤其是对于大型土地覆盖地图。现有标签通常源自先前存在的产品或手动注释，这些标签通常是不可靠或昂贵的，尤其是考虑到富裕的视觉细节和巨大的数据量的数据量。受空间自相关原理的启发，这表明同一类的对象倾向于在当地社区中与类似的视觉特征共同发生，我们提出了基于面具聚类的注释引擎（MCAE），该引擎（MCAE）将语义上一致的蒙版组视为最小的注释单元，以启用有效的，同时同时的多个实例。它可显着提高注释效率一到两个数量级，同时保留标签质量，语义多样性和空间代表性。借助MCAE，我们建立了一个高质量的注释数据集，该数据集约为140亿个标有像素，称为Hicity-LC，该数据集支持在五个主要中国城市的城市规模土地覆盖地图的产生，分类精度高于85％。它是第一个公开可用的秘密分辨率城市级土地覆盖基准，强调了MCAE在大规模的，sumbleser的分辨率映射中的可扩展性和实用性。该数据集可在此HTTPS URL上找到

Title: RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis

Authors: Seungwook Kim, Yichun Shi, Kejie Li, Minsu Cho, Peng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24410
Pdf URL: https://arxiv.org/pdf/2509.24410
Copy Paste: [[2509.24410]] RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis(https://arxiv.org/abs/2509.24410)
Keywords: generative
Abstract: Generating synthetic multi-view images from a text prompt is an essential bridge to generating synthetic 3D assets. In this work, we introduce RapidMV, a novel text-to-multi-view generative model that can produce 32 multi-view synthetic images in just around 5 seconds. In essence, we propose a novel spatio-angular latent space, encoding both the spatial appearance and angular viewpoint deviations into a single latent for improved efficiency and multi-view consistency. We achieve effective training of RapidMV by strategically decomposing our training process into multiple steps. We demonstrate that RapidMV outperforms existing methods in terms of consistency and latency, with competitive quality and text-image alignment.
摘要：从文本提示中生成合成的多视图图像是生成合成3D资产的必不可少的桥梁。在这项工作中，我们介绍了RapidMV，这是一种新型的文本到媒体视图生成模型，可以在短短5秒内生成32个多视图合成图像。从本质上讲，我们提出了一个新颖的空间 - 角潜在空间，将空间外观和角度视点偏差编码为一个潜在的偏差，以提高效率和多视图一致性。我们通过将培训过程分解为多个步骤，从而实现RapidMV的有效培训。我们证明，RapidMV在一致性和延迟方面优于现有方法，具有竞争力和文本图像对齐方式。

Title: CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Authors: Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24416
Pdf URL: https://arxiv.org/pdf/2509.24416
Copy Paste: [[2509.24416]] CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers(https://arxiv.org/abs/2509.24416)
Keywords: generation
Abstract: Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{this https URL}{this https URL}.
摘要：随着扩散变压器（DIT）的快速进步，视觉生成质量得到了极大的促进，这归因于模型大小和复杂性的缩放。但是，这些归因也阻碍了DIT在边缘设备上的实际部署，从而限制了它们的开发和应用。作为一种有效的模型压缩技术，模型训练后量化（PTQ）可以通过不可避免的性能降低来减少记忆消耗并加快推理的速度。为了减轻降解，我们提出了CLQ，这是一种基于正交的DIT的跨层引导的量化方法。具体来说，CLQ由三个关键设计组成。首先，我们观察到大多数PTQ方法使用的校准数据无法诚实地表示激活的分布。因此，我们提出了跨块校准（CBC）以获得准确的校准数据，可以更好地指导量化。其次，我们提出了基于正交的平滑（obs），它量化了每个通道的离群得分，并利用了块hadamard矩阵，以使离群值可忽略不计。第三，我们建议跨层参数搜索（CLP）进行搜索。我们通过图像产生和视频生成模型评估CLQ，并成功地将模型压缩到W4A4中，视觉质量和指标的降解忽略不计。 CLQ可实现3.98倍的存储器节省和3.95倍的加速。我们的代码可在\ hyperlink {this HTTPS url} {此https url}上获得。

Title: A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models

Authors: Pei-Han Chen, Szu-Chi Chung
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2509.24420
Pdf URL: https://arxiv.org/pdf/2509.24420
Copy Paste: [[2509.24420]] A Data-Centric Perspective on the Influence of Image Data Quality in Machine Learning Models(https://arxiv.org/abs/2509.24420)
Keywords: quality assessment
Abstract: In machine learning, research has traditionally focused on model development, with relatively less attention paid to training data. As model architectures have matured and marginal gains from further refinements diminish, data quality has emerged as a critical factor. However, systematic studies on evaluating and ensuring dataset quality in the image domain remain limited. This study investigates methods for systematically assessing image dataset quality and examines how various image quality factors influence model performance. Using the publicly available and relatively clean CIFAKE dataset, we identify common quality issues and quantify their impact on training. Building on these findings, we develop a pipeline that integrates two community-developed tools, CleanVision and Fastdup. We analyze their underlying mechanisms and introduce several enhancements, including automatic threshold selection to detect problematic images without manual tuning. Experimental results demonstrate that not all quality issues exert the same level of impact. While convolutional neural networks show resilience to certain distortions, they are particularly vulnerable to degradations that obscure critical visual features, such as blurring and severe downscaling. To assess the performance of existing tools and the effectiveness of our proposed enhancements, we formulate the detection of low-quality images as a binary classification task and use the F1 score as the evaluation metric. Our automatic thresholding method improves the F1 score from 0.6794 to 0.9468 under single perturbations and from 0.7447 to 0.8557 under dual perturbations. For near-duplicate detection, our deduplication strategy increases the F1 score from 0.4576 to 0.7928. These results underscore the effectiveness of our workflow and provide a foundation for advancing data quality assessment in image-based machine learning.
摘要：在机器学习中，研究从传统上关注模型开发，对培训数据的关注相对较少。随着模型架构的成熟和边际收益从进一步的改进中减少，数据质量已成为关键因素。但是，关于评估和确保图像域中数据集质量的系统研究仍然有限。这项研究研究了系统评估图像数据集质量的方法，并研究了各种图像质量因素如何影响模型性能。使用公开可用且相对干净的CIFAKE数据集，我们确定了常见的质量问题并量化其对培训的影响。在这些发现的基础上，我们开发了一条管道，该管道集成了两个社区开发的工具，即CleanVision和FastDup。我们分析了它们的基本机制，并引入了几种增强功能，包括自动阈值选择，以检测有问题的图像而无需手动调整。实验结果表明，并非所有质量问题都具有相同的影响水平。尽管卷积神经网络对某些扭曲表现出韧性，但它们尤其容易受到模糊视觉特征（例如模糊和严重降尺度）的降解。为了评估现有工具的性能以及我们提出的增强功能的有效性，我们制定了将低质量图像作为二进制分类任务的检测，并将F1分数用作评估指标。我们的自动阈值方法在单扰动下将F1分数从0.6794提高到0.9468，在双扰动下从0.7447提高到0.7447。对于近乎解放的检测，我们的重复数据删除策略将F1得分从0.4576提高到0.7928。这些结果强调了我们工作流的有效性，并为在基于图像的机器学习中推进数据质量评估的基础。

Title: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24427
Pdf URL: https://arxiv.org/pdf/2509.24427
Copy Paste: [[2509.24427]] UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark(https://arxiv.org/abs/2509.24427)
Keywords: generation, generative
Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.
摘要：生成扩散模型正在迅速发展，并且由于其广泛的应用而引起了越来越多的关注。图像到视频（I2V）生成已成为视频综合领域的主要重点。但是，现有的评估基准主要集中于视频质量和时间一致性等方面，同时很大程度上忽略了模型在输入图像中理解特定主题的语义的能力，或者确保生成的视频与物理定律和人类常识保持一致。为了解决这一差距，我们提出了UI2V板凳，这是一种用于评估I2V模型的新基准，重点是语义理解和推理。它引入了四个主要评估维度：空间理解，属性绑定，类别理解和推理。为了评估这些维度，我们根据多模式大语言模型（MLLM）设计了两种评估方法：实例级别的管道，用于精细的语义理解，以及基于反馈的推理管道，可实现逐步的因果评估，以进行更准确的评估。 UI2V基座包括大约500个经过精心构造的文本图像对，并评估所有定义的维度上的一系列开源和封闭源I2V模型。我们进一步纳入了人类评估，这些评估表现出与拟议的基于MLLM的指标的紧密相结合。总体而言，UI2V板凳通过强调语义理解和推理能力，提供强大的框架和数据集来支持该领域的未来研究和模型开发，从而填补了I2V评估的关键差距。

Title: NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding

Authors: Yanpeng Zhao, Shanyan Guan, Yunbo Wang, Yanhao Ge, Wei Li, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24441
Pdf URL: https://arxiv.org/pdf/2509.24441
Copy Paste: [[2509.24441]] NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding(https://arxiv.org/abs/2509.24441)
Keywords: generation
Abstract: We introduce NeoWorld, a deep learning framework for generating interactive 3D virtual worlds from a single input image. Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments where only the regions actively explored by the user are rendered with high visual realism through object-centric 3D representations. Unlike previous approaches that rely on global world generation or 2D hallucination, NeoWorld models key foreground objects in full 3D, while synthesizing backgrounds and non-interacted regions in 2D to ensure efficiency. This hybrid scene structure, implemented with cutting-edge representation learning and object-to-3D techniques, enables flexible viewpoint manipulation and physically plausible scene animation, allowing users to control object appearance and dynamics using natural language commands. As users interact with the environment, the virtual world progressively unfolds with increasing 3D detail, delivering a dynamic, immersive, and visually coherent exploration experience. NeoWorld significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark.
摘要：我们介绍了Neoworld，这是一个深度学习框架，用于从单个输入图像中生成交互式3D虚拟世界。受到科幻小说小说Simulacron-3（1964）中的按需世界建设概念的启发，我们的系统构建了广阔的环境，在这些环境中，只有用户积极探索的区域才通过以对象为中心的3D表示，以高视觉现实主义的形式呈现。与以前依靠全球世界一代或二维幻觉的方法不同，Neoworld模型以完整的3D模型，同时综合背景和2D中的非相互作用区域以确保效率。这种混合场景结构是通过尖端表示学习和对象对3D技术实现的，可以灵活地进行视点操纵和物理上合理的场景动画，从而使用户可以使用自然语言命令来控制对象外观和动态。当用户与环境互动时，虚拟世界随着3D细节的增加而逐渐展开，提供了动态，沉浸式和视觉上连贯的探索体验。 Neoworld在世界分类基准上的现有2D和深度层面的2.5D方法极大地胜过。

Title: Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA

Authors: Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.24445
Pdf URL: https://arxiv.org/pdf/2509.24445
Copy Paste: [[2509.24445]] Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA(https://arxiv.org/abs/2509.24445)
Keywords: generative
Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5\% on STAR (+4.9\%) and a 7B model to 80.8\% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.
摘要：视频问题回答（VideoQA）模型的性能从根本上受到其监督的性质的约束，该模型通常由孤立的，事实的提问 - 答案对组成。这种“事实袋”方法无法捕获事件的基本叙事和因果结构，从而将模型限制为对视频内容的浅理解。为了超越这种范式，我们引入了一个框架，以综合富裕的监督信号。我们提出了两种互补策略：基于问题的释义（QBP），该策略从视频的现有问题集合中综合了各种询问（什么，如何，如何，原因），以重建视频事件结构的整体叙事段；以及基于问题的字幕（QBC），它产生细粒度的视觉理由，以特定的相关证据对每个问题的答案扎根。利用强大的生成模型，我们使用此合成数据在统一的下一步预测目标下训练VideoQA模型。对Star和Next-QA进行的广泛实验验证了我们的方法，证明了明显的准确性并建立了新的最先进的结果，例如将3B模型提高到Star（+4.9 \％）的72.5％，而7B模型在Next-QA上提高到80.8 \％。除了准确性之外，我们的分析表明，QBP和QBC都显着增强了交叉概括，QBP还将模型收敛加速超过2.5倍。这些结果表明，将数据合成从孤立的事实转移到叙事相干性，而接地的理由会产生更准确，有效和可推广的训练范式。

Title: LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

Authors: Heechang Kim, Gwanghyun Kim, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24469
Pdf URL: https://arxiv.org/pdf/2509.24469
Copy Paste: [[2509.24469]] LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation(https://arxiv.org/abs/2509.24469)
Keywords: generation
Abstract: Diverse human motion generation is an increasingly important task, having various applications in computer vision, human-computer interaction and animation. While text-to-motion synthesis using diffusion models has shown success in generating high-quality motions, achieving fine-grained expressive motion control remains a significant challenge. This is due to the lack of motion style diversity in datasets and the difficulty of expressing quantitative characteristics in natural language. Laban movement analysis has been widely used by dance experts to express the details of motion including motion quality as consistent as possible. Inspired by that, this work aims for interpretable and expressive control of human motion generation by seamlessly integrating the quantification methods of Laban Effort and Shape components into the text-guided motion generation models. Our proposed zero-shot, inference-time optimization method guides the motion generation model to have desired Laban Effort and Shape components without any additional motion data by updating the text embedding of pretrained diffusion models during the sampling step. We demonstrate that our approach yields diverse expressive motion qualities while preserving motion identity by successfully manipulating motion attributes according to target Laban tags.
摘要：多样化的人类运动生成是一项越来越重要的任务，在计算机视觉，人类计算机互动和动画中具有各种应用。虽然使用扩散模型的文本到动作合成在产生高质量动作方面已经成功，但实现细粒度的表达运动控制仍然是一个重大挑战。这是由于数据集缺乏运动风格多样性以及在自然语言中表达定量特征的困难。舞蹈专家已广泛使用Laban运动分析，以表达运动质量（包括尽可能一致的运动质量）的细节。受此启发，这项工作旨在通过将拉班努力和形状组件的量化方法无缝地整合到文本引导的运动生成模型中，以对人类运动产生的可解释和表达控制。我们提出的零射击，推理时间优化方法指导运动生成模型在采样步骤中更新预处理扩散模型的文本嵌入，而无需任何其他运动数据就具有所需的Laban努力和形状组件。我们证明我们的方法会产生各种表现力的运动质量，同时通过根据目标拉班标签成功操纵运动属性来保持运动身份。

Title: CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models

Authors: Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24526
Pdf URL: https://arxiv.org/pdf/2509.24526
Copy Paste: [[2509.24526]] CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models(https://arxiv.org/abs/2509.24526)
Keywords: generation
Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.
摘要：通过学习扩散模型的ODE解决方案的跳远，诸如一致性模型（CM）和平均流量（MF）之类的流图模型可以生成几步，但是训练仍然不稳定，对超参数敏感，并且敏感。从预先训练的扩散模型初始化有助于，但仍需要将无穷小步骤转换为长跳图，从而使不稳定性无法解决。我们介绍了中期训练，是第一个概念和实用方法，它插入了（扩散）训练（即扩散）训练和最终流量图训练（即，后训练）之间的轻量级中级阶段。具体而言，一致性中期训练（CMT）是一个紧凑而有原则的阶段，该阶段训练一个模型，从先前的样品开始，直接到求解器生成的清洁样品，沿求解器轨迹沿求解器轨迹绘制求解器轨迹。它产生轨迹一致且稳定的初始化。该初始化器的表现优于随机和扩散的基线，并在没有启发式的情况下实现快速，稳健的收敛。用CMT权重初始化训练后训练进一步简化了流图学习。从经验上讲，CMT在CIFAR-10上实现了ART状态两步的FID：Imagenet 64x64上的1.97，Imagenet 512x512上的1.32，而与CMS相比，Imagenet 512x512上的1.32在Imagenet 512x512上达到1.84，而使用少98％。在ImageNet 256x256上，与从头开始的MF相比，CMT达到1步FID 3.34，同时将总训练时间减少约50％（FID 3.43）。这将CMT建立为训练流图模型的原则，高效和一般框架。

Title: CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

Authors: Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24528
Pdf URL: https://arxiv.org/pdf/2509.24528
Copy Paste: [[2509.24528]] CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D(https://arxiv.org/abs/2509.24528)
Keywords: generation
Abstract: 3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.
摘要：3D场景的理解对于体现的AI和机器人技术至关重要，支持可靠的互动和导航感知。最近的方法通过将嵌入向量分配给通过视觉语言模型（VLMS）生成的2D类无形性掩码，并将其投影到3D中，从而实现了零射击，开放式摄影3D语义映射。但是，由于直接使用原始掩码，这些方法通常会产生碎片的面罩和不准确的语义分配，从而限制了它们在复杂环境中的有效性。为了解决这个问题，我们利用渐进的粒度细化来利用语义来产生更准确和众多的对象级掩模，从而减轻在蒙版生成模型（例如香草SAM）中通常观察到的过度分割，并改善下游3D语义分段。为了进一步增强语义上下文，我们采用了上下文感知的剪辑编码策略，该策略使用经验确定的加权来整合每个面具的多个上下文视图，从而提供了更丰富的视觉上下文。我们在多个基准数据集中评估了多个3D场景理解任务的方法，包括3D语义分割和从语言查询中的对象检索。实验结果表明，对现有方法的改善显着，突出了我们方法的有效性。

Title: Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

Authors: Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24531
Pdf URL: https://arxiv.org/pdf/2509.24531
Copy Paste: [[2509.24531]] Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis(https://arxiv.org/abs/2509.24531)
Keywords: super-resolution
Abstract: Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at this https URL.
摘要：扩散桥和流量匹配都表明在任意分布之间的转换中具有令人信服的经验表现。但是，对于哪种方法通常是可取的，仍然令人困惑，并且其建模假设和实际实现的实质差异阻碍了对其相对优点的统一理论说明。我们首次提供了这两个模型的统一理论和实验验证。我们通过随机最佳控制的镜头重塑了它们的框架，并证明了扩散桥的成本函数较低，从而将系统引导到更稳定和更自然的轨迹。同时，从最佳运输的角度来看，当培训数据大小减小时，插值系数$ t $和$ 1-T $的流量匹配变得越来越无效。为了证实这些理论主张，我们提出了一种用于潜在变压器的扩散桥的新颖，强大的结构，并实现具有相同结构的流匹配模型，以在各种实验中进行公平的性能比较。全面的实验是在图像介绍，超分辨率，去蓝色，变形，翻译和样式转移任务的跨图像上进行的，这些任务在系统上改变了分布差异（不同的难度）和训练数据大小。广泛的经验结果与我们的理论预测完全吻合，并使我们能够描述这两个模型的各个优势和缺点。我们的代码可在此HTTPS URL上找到。

Title: Training-Free Multimodal Guidance for Video to Audio Generation

Authors: Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2509.24550
Pdf URL: https://arxiv.org/pdf/2509.24550
Copy Paste: [[2509.24550]] Training-Free Multimodal Guidance for Video to Audio Generation(https://arxiv.org/abs/2509.24550)
Keywords: generation
Abstract: Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.
摘要：视频对审计（V2A）的一代旨在从无声视频中综合现实和语义上的音频，并在视频编辑，Foley Sound Design和辅助多媒体中使用潜在的应用。尽管取得了良好的结果，但现有方法要么需要在大规模配对数据集上进行昂贵的联合培训，要么依赖可能无法捕获全球多模式相干性的成对相似性。在这项工作中，我们为V2A扩散提出了一种新型的无训练的多模式指导机制，该机制利用模态嵌入跨越的体积来强制视频，音频和文本跨越统一的对齐。提出的多模式扩散引导（MDG）提供了一个轻巧的插件控制信号，可以在任何预验证的音频扩散模型的顶部应用，而无需重新培训。对VGGSOUND和AUDIOCAPS的实验表明，与基准相比，我们的MDG始终提高感知质量和多模式对齐，证明了V2A联合多模式指导的有效性。

Title: NeMo: Needle in a Montage for Video-Language Understanding

Authors: Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.24563
Pdf URL: https://arxiv.org/pdf/2509.24563
Copy Paste: [[2509.24563]] NeMo: Needle in a Montage for Video-Language Understanding(https://arxiv.org/abs/2509.24563)
Keywords: generation
Abstract: Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: this https URL.
摘要：视频大语模型（视频学）的最新进展要求新的评估协议和基准测试，以在视频语言理解中进行复杂的时间推理。受到LLM广泛使用的干草堆测试的启发，我们在蒙太奇（NEMO）中引入了一项新颖的针头任务，旨在评估视频学的关键推理能力，包括长篇小说召回和时间基础。为了生成视频问题为我们的任务回答数据，我们开发了可扩展的自动数据生成管道，以促进高质量的数据综合。我们基于提议的管道，提出Nemobench，这是一个以我们的任务为中心的视频语言基准。具体而言，我们的整套Nemobench具有31,378个自动生成的问答（QA）对，从13,486个视频，各种持续时间从几秒钟到小时不等。实验表明，我们的管道可以可靠并自动生成高质量的评估数据，从而使Nemobench能够通过最新视频不断更新。我们在基准上评估了20种最先进的模型，从而为其能力和局限性提供了广泛的结果和关键见解。我们的项目页面可用：此HTTPS URL。

Title: SAIP: A Plug-and-Play Scale-adaptive Module in Diffusion-based Inverse Problems

Authors: Lingyu Wang, Xiangming Meng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.24580
Pdf URL: https://arxiv.org/pdf/2509.24580
Copy Paste: [[2509.24580]] SAIP: A Plug-and-Play Scale-adaptive Module in Diffusion-based Inverse Problems(https://arxiv.org/abs/2509.24580)
Keywords: restoration
Abstract: Solving inverse problems with diffusion models has shown promise in tasks such as image restoration. A common approach is to formulate the problem in a Bayesian framework and sample from the posterior by combining the prior score with the likelihood score. Since the likelihood term is often intractable, estimators like DPS, DMPS, and $\pi$GDM are widely adopted. However, these methods rely on a fixed, manually tuned scale to balance prior and likelihood contributions. Such a static design is suboptimal, as the ideal balance varies across timesteps and tasks, limiting performance and generalization. To address this issue, we propose SAIP, a plug-and-play module that adaptively refines the scale at each timestep without retraining or altering the diffusion backbone. SAIP integrates seamlessly into existing samplers and consistently improves reconstruction quality across diverse image restoration tasks, including challenging scenarios.
摘要：解决扩散模型的逆问题已显示出在诸如图像恢复之类的任务中的希望。一种常见的方法是将贝叶斯框架中的问题和后部样本中的样本结合起来，将先前的分数与似然得分相结合。由于可能性术语通常是棘手的，因此诸如DPS，DMP和$ \ pi $ gdm之类的估计器被广泛采用。但是，这些方法依赖于固定的，手动调整的量表来平衡先前和可能性的贡献。这样的静态设计是次优的，因为理想平衡在时间段和任务之间有所不同，从而限制了性能和概括。为了解决此问题，我们提出了SAIP，SAIP是一个插件模块，可自适应地完善每个时间步中的比例，而无需重新训练或更改扩散主链。 SAIP无缝集成到现有采样器中，并始终提高各种图像恢复任务（包括具有挑战性的场景）的重建质量。

Title: FreeRet: MLLMs as Training-Free Retrievers

Authors: Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24621
Pdf URL: https://arxiv.org/pdf/2509.24621
Copy Paste: [[2509.24621]] FreeRet: MLLMs as Training-Free Retrievers(https://arxiv.org/abs/2509.24621)
Keywords: generation, generative
Abstract: Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
摘要：多模式的大语言模型（MLLM）正在成为混合模式检索的多功能基础。但是，他们通常需要进行大量事后培训，以将其转换为对比编码器进行检索。这项工作要求：现成的MLLM可以在没有额外培训的情况下用作强大的猎犬吗？我们提出了FreeRet，这是一个插件框架，将任何MLLM变成两阶段的回猎人。 Freeret首先直接从模型中衍生出语义扎根的嵌入，以进行快速候选搜索，然后利用其推理能力进行精确的重新疗法。该框架贡献了三个进步：绕过词汇对准层以获取语义上忠实的嵌入，具有明确的先验的条件代表性生成，并通过中性选择框架来降低框架效应。在跨越46个数据集的MMEB和MMEB-V2基准测试中，Freeret基本上优于以数百万对训练的模型。除了基准测试之外，Freeret还具有模型不足的作用，并且在MLLM家族和大小之间无缝地缩放，保留其生成能力，支持任意模态组合，并将检索，重新播放和生成并在单个模型中产生端到端rag。我们的发现表明，经过仔细利用的MLLM可以作为强大的检索引擎而无需训练，从而缩小了他们作为通才的角色的危险差距。

Title: RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement

Authors: Zhu, Libo, Zhou, Zihan, Liu, Xiaoyang, Zhang, Weihang, Shi, Keyu, Fu, Yifan, Zhang, Yulun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24644
Pdf URL: https://arxiv.org/pdf/2509.24644
Copy Paste: [[2509.24644]] RIFLE: Removal of Image Flicker-Banding via Latent Diffusion Enhancement(https://arxiv.org/abs/2509.24644)
Keywords: restoration
Abstract: Capturing screens is now routine in our everyday lives. But the photographs of emissive displays are often influenced by the flicker-banding (FB), which is alternating bright%u2013dark stripes that arise from temporal aliasing between a camera's rolling-shutter readout and the display's brightness modulation. Unlike moire degradation, which has been extensively studied, the FB remains underexplored despite its frequent and severe impact on readability and perceived quality. We formulate FB removal as a dedicated restoration task and introduce Removal of Image Flicker-Banding via Latent Diffusion Enhancement, RIFLE, a diffusion-based framework designed to remove FB while preserving fine details. We propose the flicker-banding prior estimator (FPE) that predicts key banding attributes and injects it into the restoration network. Additionally, Masked Loss (ML) is proposed to concentrate supervision on banded regions without sacrificing global fidelity. To overcome data scarcity, we provide a simulation pipeline that synthesizes FB in the luminance domain with stochastic jitter in banding angle, banding spacing, and banding width. Feathered boundaries and sensor noise are also applied for a more realistic simulation. For evaluation, we collect a paired real-world FB dataset with pixel-aligned banding-free references captured via long exposure. Across quantitative metrics and visual comparisons on our real-world dataset, RIFLE consistently outperforms recent image reconstruction baselines from mild to severe flicker-banding. To the best of our knowledge, it is the first work to research the simulation and removal of FB. Our work establishes a great foundation for subsequent research in both the dataset construction and the removal model design. Our dataset and code will be released soon.
摘要：捕获屏幕现在是我们日常生活中的常规。但是，发射显示器的照片通常受闪烁式（FB）的影响，闪光灯（FB）是由摄像机的滚动式读数和显示器的亮度调制而产生的明亮％U2013 dark条纹。与已进行了广泛研究的Moire降解不同，尽管FB频繁且严重影响了可读性和质量，但FB仍未得到充实。我们将FB删除作为专用的恢复任务，并通过潜在扩散增强，步枪（一种基于扩散的框架）引入图像闪烁束缚，旨在删除FB，同时保留细节。我们提出了闪烁式循环先验估计器（FPE），该估计器预测关键的束带属性并将其注入恢复网络。此外，提出了蒙面损失（ML），以将监督集中在带状区域，而无需牺牲全球保真度。为了克服数据稀缺性，我们提供了一个模拟管道，该管道将亮度域中的FB与随机抖动，带角，带间距和频带宽度合成。羽毛边界和传感器噪声也用于更逼真的模拟。为了进行评估，我们收集了一个配对的现实世界FB数据集，该数据集和通过长时间曝光捕获的像素对齐的无带谱带的参考。在我们的实际数据集中的定量指标和视觉比较中，步枪始终优于最新的图像重建基线，从轻度到重度闪烁。据我们所知，这是研究FB的模拟和删除的第一项工作。我们的工作为随后的数据集构建和拆卸模型设计奠定了良好的基础。我们的数据集和代码将很快发布。

Title: Learning Object-Centric Representations Based on Slots in Real World Scenarios

Authors: Adil Kaan Akan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24652
Pdf URL: https://arxiv.org/pdf/2509.24652
Copy Paste: [[2509.24652]] Learning Object-Centric Representations Based on Slots in Real World Scenarios(https://arxiv.org/abs/2509.24652)
Keywords: generation, generative
Abstract: A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine-grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object-level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot-conditioned modules for objects, reducing text-conditioning bias and achieving state-of-the-art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer-based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object-centric generative modeling for images and videos. By bridging human object-based perception and machine learning, it expands the design space for interactive, structured, and user-driven generative tools in creative, scientific, and practical domains.
摘要：AI中的一个核心目标是将场景表示为离散对象的组成，从而实现细粒度，可控的图像和视频生成。然而，领先的扩散模型可以整体处理图像并依赖文本调节，从而为对象级编辑创造了不匹配。该论文引入了一个框架，该框架适应了以对象为中心的合成的强大预验扩散模型，同时保持其生成能力。我们确定了一个核心挑战：平衡全局场景连贯性与分离的对象控制。我们的方法将基于轻巧的基于插槽的调节整合到预验证的模型中，在提供特定于对象的操作的同时保留其视觉先验。对于图像，SLOTADAPT增强了带有寄存器令牌的扩散模型，用于对象的背景/样式和插槽条件模块，减少文本条件偏置并实现最新的最先进，从而导致对象发现，分段，组成编辑和可控制的图像生成。我们进一步将框架扩展到视频。我们的方法使用不变的插槽注意（ISA）将对象身份与姿势和基于变压器的时间聚合器分开，我们的方法在跨帧之间保持一致的对象表示和动态。这将在无监督的视频对象分割和重建中产生新的基准测试，并支持高级编辑任务，例如删除对象，替换和插入，而无需明确的监督。总体而言，这项工作为图像和视频建立了一种以对象为中心的生成建模的方法。通过桥接基于对象的感知和机器学习，它扩展了在创意，科学和实用领域中的交互式，结构化和用户驱动的生成工具的设计空间。

Title: SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Authors: Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24695
Pdf URL: https://arxiv.org/pdf/2509.24695
Copy Paste: [[2509.24695]] SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer(https://arxiv.org/abs/2509.24695)
Keywords: generation
Abstract: We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
摘要：我们介绍了Sana-Video，这是一种小型扩散模型，可以有效地生成高达720x1280分辨率和微小长度持续时间的视频。 Sana-Video综合了高分辨率，高质量和长视频，具有强烈的文本视频对齐方式，以非常快的速度，可在RTX 5090 GPU上部署。两种核心设计可确保我们的高效，有效和长时间的视频生成：（1）线性DIT：我们利用线性注意作为核心操作，鉴于视频生成中处理了大量的标记，这比香草的注意力更有效。（2）用于块线性注意的恒定内存KV缓存：我们通过采用恒定内存状态来设计长时间视频生成的障碍自回归方法，该方法源自线性注意的累积属性。此KV缓存以固定的内存成本提供了线性DIT，以全局上下文，从而消除了对传统的KV缓存的需求，并实现了高效的，长时间的视频生成。此外，我们还探索了有效的数据过滤器和模型培训策略，将培训成本缩小到64 H100 GPU的12天，这仅是电影gen成本的1％。鉴于其低成本，Sana-Video与现代最先进的小型扩散模型（例如WAN 2.1-1.3B和Skyreel-V2-1.3B）相比，达到了竞争性能，而在测得的延迟中的速度也快16倍。此外，SANA-VIDEO可以用NVFP4精度部署在RTX 5090 GPU上，从而加速了从71s到29s（2.4倍速度）生成5秒720p视频的推理速度。总而言之，Sana-Video可实现低成本，高质量的视频生成。

Title: T-POP: Test-Time Personalization with Online Preference Feedback

Authors: Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24696
Pdf URL: https://arxiv.org/pdf/2509.24696
Copy Paste: [[2509.24696]] T-POP: Test-Time Personalization with Online Preference Feedback(https://arxiv.org/abs/2509.24696)
Keywords: generation
Abstract: Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.
摘要：将大型语言模型（LLM）个性化为个别用户偏好是产生一般有用响应的关键步骤。但是，当前的个性化方法不适合新用户，因为它们通常需要缓慢，资源密集的微调或大量已有用户数据，从而造成了一个重大的冷启动问题。为了应对这一挑战，我们通过在文本生成期间收集的在线成对偏好反馈中学习，引入了一个新的实时个性化范式。我们提出了T-pop（通过在线偏好反馈的测试时间个性化}），这是一种新型算法，可以协同结合测试时间对齐与决斗匪徒。在不更新LLM参数的情况下，T-Pop通过在线学习捕获用户偏好的奖励功能来引导冷冻LLM的解码过程。通过利用决斗匪徒，T-Pop智能地查询用户在探索他们的偏好和利用学习知识以生成个性化文本之间有效平衡。广泛的实验表明，T-Pop实现了快速和数据有效的个性化，显着优于现有基准，并随着更多的用户交互作用表现出一致的改进。

Title: Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

Authors: Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24702
Pdf URL: https://arxiv.org/pdf/2509.24702
Copy Paste: [[2509.24702]] Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility(https://arxiv.org/abs/2509.24702)
Keywords: generation
Abstract: Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.
摘要：扩散模型可以生成逼真的视频，但是现有的方法依赖于从大规模的文本视频数据集中隐含地学习物理推理，该数据集是代价高昂，难以扩展的，并且仍然容易产生违反基本物理定律的令人难以置信的动作。我们介绍了一个无训练的框架，该框架通过明确推理不可能的理由并指导一代人远离推理，从而提高了推理时间的身体合理性。具体来说，我们采用轻量级物理学的推理管道来构建故意编码物理侵入行为的反事实提示。然后，我们提出了一种新型同步的解次指导（SDG）策略，该策略通过同步方向归一化来利用这些提示，以抵消滞后的抑制和轨迹耦合的deno，以减轻累积轨迹偏见，从而确保立即抑制了不可能的含量在整个过程中抑制，并始终如一地抑制了整个DENOO。跨不同物理领域的实验表明，尽管不需要额外的培训，但我们的方法在维持光真相的同时会大大提高物理保真度。消融研究证实了物理感知推理成分和可持续发展目标的互补有效性。特别是，上述两种可持续发展目标的设计也可以单独验证，以促进不可行的内容的抑制和物理合理性的整体增长。这为视频生成建立了一个新的和插件的物理意识范式。

Title: IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Authors: Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24709
Pdf URL: https://arxiv.org/pdf/2509.24709
Copy Paste: [[2509.24709]] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?(https://arxiv.org/abs/2509.24709)
Keywords: generation
Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at this https URL.
摘要：网页对代码任务需要模型来了解网页的视觉表示并生成相应的代码。但是，现有基准主要集中于静态屏幕截图任务，从而忽略了现实世界Web应用程序基本的动态交互。为了解决此限制，本文介绍了IWR Bench，这是一种新颖的基准，用于评估视频中交互式网页重建中大型视觉模型（LVLM）的功能。 IWR-BENCH包括来自100个现实世界网站的113个精心策划的任务，其中包含1,001个动作，并具有不同的交互复杂性（例如，网络游戏），视觉样式和域。与标准的Web开发实践保持一致，每个任务不仅包括用户交互视频，还包括所有爬行的静态资产（例如，图像，视频）。该基准测试评估了两个基本挑战的模型：从视频和资产中推断交互逻辑的全面多模式推理，以及将此逻辑转化为功能代码的高级代码生成。具有综合度量系统的代理商AS-A-A-A-Gudge框架会自动评估生成的网页的功能正确性和视觉保真度。在28个LVLM上进行的广泛实验表明了一个重大挑战：最佳模型仅达到36.35％的总分，因为功能正确性（24.39％IFS）显着落后于视觉保真度（64.25％VFS）。这些结果突出了当前模型推理时间动态并综合事件驱动逻辑的能力中的临界局限性，从而确立了IWR基础座位，成为视觉研究的挑战性边界。将公开提供基准和评估代码。代码可在此HTTPS URL上找到。

Title: Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Authors: Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24739
Pdf URL: https://arxiv.org/pdf/2509.24739
Copy Paste: [[2509.24739]] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation(https://arxiv.org/abs/2509.24739)
Keywords: generation
Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset comprising 1,567,062 paired CT-PET images and corresponding 2,757 full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks, including medical report generation and visual question answering. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, particularly in low-resource languages, and improving their clinical relevance in Vietnamese healthcare.
摘要：通过大规模多模式数据集培训的Vision语言基础模型（VLMS）通过实现丰富的跨模式推理来促进人工智能的重大进步。尽管在一般领域取得了成功，但由于各种成像方式和多语言临床数据的可用性有限，将这些模型应用于医学成像仍然具有挑战性。大多数现有的医学VLM都接受了成像方式的一部分培训，并主要关注高资源语言，从而限制了它们的普遍性和临床用途。为了解决这些局限性，我们引入了一个新型的越南语多模式数据集，其中包括1,567,062配对的CT-PET图像和相应的2,757个全长临床报告。该数据集旨在填补医疗AI开发中的两个压力差距：（1）现有VLMS培训语料库中缺乏PET/CT成像数据，这阻碍了能够处理功能成像任务的模型的开发；（2）在医学视觉研究中，低资源语言（尤其是越南语言）的代表性不足。据我们所知，这是第一个在越南提供全面的宠物/CT报告对的数据集。我们进一步介绍了一个培训框架，以增强VLMS的学习，包括数据增强和专家验证的测试集。我们进行全面的实验，对下游任务进行最新的VLM，包括医疗报告生成和视觉问题答案。实验结果表明，合并我们的数据集可显着提高现有VLM的性能。我们认为，该数据集和基准将是推进更强大的医学成像VLM的关键步骤，尤其是在低资源语言中，并提高其在越南医疗保健方面的临床意义。

Title: ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Authors: Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24758
Pdf URL: https://arxiv.org/pdf/2509.24758
Copy Paste: [[2509.24758]] ExGS: Extreme 3D Gaussian Compression with Diffusion Priors(https://arxiv.org/abs/2509.24758)
Keywords: restoration
Abstract: Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce \textbf{ExGS}, a novel feed-forward framework that unifies \textbf{Universal Gaussian Compression} (UGC) with \textbf{GaussPainter} for \textbf{Ex}treme 3D\textbf{GS} compression. \textbf{UGC} performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas \textbf{GaussPainter} leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over $100\times$ compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at \href{this https URL}{here}.
摘要：神经场景表征，例如3D高斯裂（3DG），已经实现了高质量的神经渲染。但是，它们的大量存储和传输成本阻碍了资源受限环境中的部署。现有的压缩方法要么依赖于昂贵的优化，该优化是缓慢且特定于场景的，要么采用无训练的修剪和量化，从而在高压缩比下降低了呈现质量。相比之下，最近以数据驱动的方法为克服这一权衡提供了一个有希望的方向，可以在保持高渲染质量的同时有效地压缩。我们介绍了\ textbf {exgs}，这是一个新颖的馈送框架，该框架将\ textbf {Universal Gaussian Compression}（ugc）与\ textbf {gausspainter}一起\ textbf {ex} treme} treme 3d \ textbf {gs textbf {gs} compression \ textbf {gausspainter}。 \ textbf {ugc}执行无抗化的修剪，以积极地减少高斯原语，同时仅保留基本信息，而\ textbf {gausspainter}利用具有面膜引导的精炼来利用强大的扩散先验，以恢复来自高级质量的高级竞技场的高级竞技场。与常规的涂料不同，高斯粉末不仅填充缺失的区域，而且还可以增强可见像素，从而实现降级效果的大幅改善。为了确保实用性，它采用了轻巧的VAE和一步扩散设计，从而实现了实时恢复。我们的框架甚至可以达到$ 100 \ times $压缩（将典型的354.77 MB型号降低至约3.31 MB），同时保留忠诚度并在挑战性条件下显着提高图像质量。这些结果突出了扩散先验在弥合极端压缩和高质量神经渲染之间差距中的核心作用。我们的代码存储库将以\ href {this HTTPS url} {tere}发布。

Title: MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Authors: Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2509.24779
Pdf URL: https://arxiv.org/pdf/2509.24779
Copy Paste: [[2509.24779]] MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models(https://arxiv.org/abs/2509.24779)
Keywords: generative
Abstract: Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.
摘要：分子动力学（MD）是用于探测蛋白质功能的强大计算显微镜。但是，需要细粒整合和生物分子事件的长时间尺度使MD计算上的昂贵。为了解决这个问题，已经提出了几种生成模型以较低的成本生成替代轨迹。然而，这些模型通常学习固定延迟的过渡密度，从而导致训练信号由频繁但非信息性过渡主导。我们介绍了一类新的生成模型MSM模拟器，相反，该模型学会了通过基础马尔可夫州模型（MSM）定义的离散状态进行采样过渡。我们与马尔可夫空间流量匹配（MARS-FM）实例化，其采样提供了两个以上的数量级速度，与隐式或显式溶剂或显式溶剂MD模拟相比。我们基于MARS-FM通过结构可观察物（例如RMSD，回旋半径和二级结构含量）重现MD统计的能力。我们的评估跨越具有明显的化学和结构多样性的蛋白质结构域（最多500个残基），包括展开事件，并在训练和测试集之间实施严格的序列差异以评估概括。在所有指标中，MARS-FM的表现均优于现有方法，通常要大幅度。

Title: Assessing the risk of future Dunkelflaute events for Germany using generative deep learning

Authors: Felix Strnad, Jonathan Schmidt, Fabian Mockert, Philipp Hennig, Nicole Ludwig
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2509.24788
Pdf URL: https://arxiv.org/pdf/2509.24788
Copy Paste: [[2509.24788]] Assessing the risk of future Dunkelflaute events for Germany using generative deep learning(https://arxiv.org/abs/2509.24788)
Keywords: generation, generative
Abstract: The European electricity power grid is transitioning towards renewable energy sources, characterized by an increasing share of off- and onshore wind and solar power. However, the weather dependency of these energy sources poses a challenge to grid stability, with so-called Dunkelflaute events -- periods of low wind and solar power generation -- being of particular concern due to their potential to cause electricity supply shortages. In this study, we investigate the impact of these events on the German electricity production in the years and decades to come. For this purpose, we adapt a recently developed generative deep learning framework to downscale climate simulations from the CMIP6 ensemble. We first compare their statistics to the historical record taken from ERA5 data. Next, we use these downscaled simulations to assess plausible future occurrences of Dunkelflaute events in Germany under the optimistic low (SSP2-4.5) and high (SSP5-8.5) emission scenarios. Our analysis indicates that both the frequency and duration of Dunkelflaute events in Germany in the ensemble mean are projected to remain largely unchanged compared to the historical period. This suggests that, under the considered climate scenarios, the associated risk is expected to remain stable throughout the century.
摘要：欧洲电力电网正在向可再生能源过渡，其特征是越来越多的陆上风能和太阳能份额。但是，这些能源的天气依赖性对电网稳定性构成了挑战，因为他们可能会引起电力供应短缺的潜力，因此所谓的dunkelflaute事件（低风能和太阳能发电的时期）特别关注。在这项研究中，我们研究了这些事件对未来几年和几十年来德国电力生产的影响。为此，我们将最近开发的生成深度学习框架调整为CMIP6集合的降低气候模拟。我们首先将他们的统计数据与从ERA5数据中获取的历史记录进行比较。接下来，我们使用这些缩放的模拟来评估德国在乐观的低（SSP2-4.5）和高（SSP5-8.5）发射方案下，德国事件发生合理的未来发生。我们的分析表明，与历史时期相比，整体平均值中德国事件的频率和持续时间都在很大程度上保持不变。这表明，在考虑到气候的情况下，相关风险预计在整个世纪将保持稳定。

Title: Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Authors: Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24798
Pdf URL: https://arxiv.org/pdf/2509.24798
Copy Paste: [[2509.24798]] Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation(https://arxiv.org/abs/2509.24798)
Keywords: generation
Abstract: We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method enables causal interventions on target attributes, consistently propagating their effects to causal dependents without altering the core identity of the image. In contrast to prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling augmented with two attribute regularization strategies: prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and a conditioned token contrastive loss to disentangle attribute factors and reduce spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, with up to 91\% MAE reduction on Pendulum for accurate attribute control and 87\% FID reduction on ADNI for high-fidelity MRI image generation. These results show that our approach enables robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation.
摘要：我们提出了因果适配器，这是一个模块化框架，可适应冷冻文本对图像扩散式主链以产生反事实图像。我们的方法可以使因果干预目标对目标属性进行，从而始终如一地传播其对因果关系依赖者的影响，而不会改变图像的核心身份。与先前依赖于没有明确因果结构的迅速工程的方法相反，因果适应器利用结构性因果建模增强了两种属性正则化策略的增强：迅速平衡的注入将因果属性与文本嵌入的因素与精确的语义控制和较低的对相反的因素和不合时宜的因素相吻合。因果适配器在合成和现实世界数据集上都达到了最先进的性能，对于准确的属性控制，降低了91 \％的MAE MAE，而高效率MRI图像生成的ADNI降低了87 \％\％FID。这些结果表明，我们的方法可以通过忠实的属性修改和强大的身份保存实现强大的，可推广的反事实编辑。

Title: Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

Authors: Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2509.24840
Pdf URL: https://arxiv.org/pdf/2509.24840
Copy Paste: [[2509.24840]] Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data(https://arxiv.org/abs/2509.24840)
Keywords: generation, generative
Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.
摘要：单细胞RNA测序通过在细胞分辨率下测量基因表达来改变生物学，从而为细胞类型，状态和疾病环境提供信息。最近，单细胞基础模型已成为直接从表达配置文件中学习可转移表示的强大工具，从而提高了分类和聚类任务的性能。但是，这些模型仅限于离散的预测头，这些预测头将细胞复杂性崩溃为预定义的标签，这些标签无法捕获生物学家所需的更丰富，上下文解释。我们介绍了Cell2Text，这是一种多模式生成框架，将Scrna-Seq概况转换为结构化的自然语言描述。通过将来自单细胞基础模型的基因级嵌入与预处理的大语言模型相结合，Cell2Text生成了相干的摘要，这些摘要捕获细胞身份，组织起源，疾病关联和途径活性，从而推广到看不见的细胞。从经验上讲，Cell2Text优于分类精度的基准，使用基于Pagerank的相似性指标表现出强大的本体论一致性，并且在文本生成中实现了很高的语义忠诚。这些结果表明，与自然语言耦合表达数据既具有更强的预测性能，又提供了可解释的输出，这指出了可扩展的不看到细胞的标签有效表征的路径。

Title: ELPG-DTFS: Prior-Guided Adaptive Time-Frequency Graph Neural Network for EEG Depression Diagnosis

Authors: Jingru Qiu, Jiale Liang, Xuanhan Fan, Mingda Zhang, Zhenli He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24860
Pdf URL: https://arxiv.org/pdf/2509.24860
Copy Paste: [[2509.24860]] ELPG-DTFS: Prior-Guided Adaptive Time-Frequency Graph Neural Network for EEG Depression Diagnosis(https://arxiv.org/abs/2509.24860)
Keywords: generation
Abstract: Timely and objective screening of major depressive disorder (MDD) is vital, yet diagnosis still relies on subjective scales. Electroencephalography (EEG) provides a low-cost biomarker, but existing deep models treat spectra as static images, fix inter-channel graphs, and ignore prior knowledge, limiting accuracy and interpretability. We propose ELPG-DTFS, a prior-guided adaptive time-frequency graph neural network that introduces: (1) channel-band attention with cross-band mutual information, (2) a learnable adjacency matrix for dynamic functional links, and (3) a residual knowledge-graph pathway injecting neuroscience priors. On the 128-channel MODMA dataset (53 subjects), ELPG-DTFS achieves 97.63% accuracy and 97.33% F1, surpassing the 2025 state-of-the-art ACM-GNN. Ablation shows that removing any module lowers F1 by up to 4.35, confirming their complementary value. ELPG-DTFS thus offers a robust and interpretable framework for next-generation EEG-based MDD diagnostics.
摘要：重大抑郁症（MDD）的及时和客观筛查至关重要，但诊断仍然依赖于主观尺度。脑电图（EEG）提供了低成本的生物标志物，但现有的深层模型将光谱视为静态图像，修复通道间图并忽略先验知识，限制准确性和解释性。我们提出了ELPG-DTF，这是一种先前指导的自适应时间频率图神经网络，引入：（1）带有跨波段共同信息的通道波段注意，（2）一个可学习的动态功能链接的可学习的邻接矩阵，（3）残基知识途径向神经科学注射神经科学途径。在128通道MODMA数据集（53个受试者）上，ELPG-DTFS的精度为97.63％，97.33％的F1超过了2025个最先进的ACM-GNN。消融表明，将任何模块删除降低了F1高达4.35，从而确认其互补值。因此，ELPG-DTF为下一代基于EEG的MDD诊断提供了强大且可解释的框架。

Title: Environment-Aware Satellite Image Generation with Diffusion Models

Authors: Nikos Kostagiolas, Pantelis Georgiades, Yannis Panagakis, Mihalis A. Nicolaou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24875
Pdf URL: https://arxiv.org/pdf/2509.24875
Copy Paste: [[2509.24875]] Environment-Aware Satellite Image Generation with Diffusion Models(https://arxiv.org/abs/2509.24875)
Keywords: generation, generative
Abstract: Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.
摘要：基于扩散的基础模型最近由于生成高质量和忠诚度的图像的能力而引起了生成建模领域的广泛关注。尽管并不简单，但他们最近在遥感领域的应用表明，首次成功的试验旨在利用包含多模式信息的大量公开数据集。尽管他们成功了，但现有的方法面临着相当大的局限性：它们依靠有限的环境环境，与缺失或损坏的数据斗争，并且通常无法可靠地反映生成的输出中的用户意图。在这项工作中，我们提出了一个以环境环境为条件的新型扩散模型，该模型能够通过从三个不同的控制信号的任何组合调节来生成卫星图像：a）文本，b）元数据和c）视觉数据。与以前的作品相反，提出的方法是i）据我们的知识，这是在动态环境条件下以卫星图像生成作为控制信号的一部分的第一个来调节卫星图像的生成； ii）融合了元数据融合策略，该策略将嵌入相互作用归因于部分腐败和/或失踪观测值归因于嵌入相互作用。我们的方法在单图像和时间生成的试验中，我们的方法均优于先前的方法（对缺失元数据的鲁棒性（对输入的较高响应，对输入的响应性更高）和定量（使用6种不同指标测量的富裕性，准确性和世代质量）。报告的结果支持我们的假设，即在环境环境上的调节可以改善卫星图像基础模型的性能，并使我们的模型成为下游任务中使用的有前途的候选人。据我们所知，收集到的3型数据集是第一个结合这三种不同媒介的数据的公开数据集。

Title: ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

Authors: Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.24878
Pdf URL: https://arxiv.org/pdf/2509.24878
Copy Paste: [[2509.24878]] ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation(https://arxiv.org/abs/2509.24878)
Keywords: generative
Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: this http URL
摘要：配对的RGB热数据对于视觉热传感器融合和交叉模式任务至关重要，包括重要的应用，例如多模式图像对齐和检索。但是，同步和校准的RGB热图像的稀缺性呈现出这些领域进步的主要障碍。为了克服这一挑战，RGB至热（RGB-T）图像翻译已成为一种有前途的解决方案，从而实现了来自丰富的RGB数据集的热图像的合成，以供训练。在这项研究中，我们提出了Theylgen，这是一种基于自适应流量的生成模型，用于RGB-T图像翻译，结合了RGB图像调节体系结构和样式触发性机制。为了支持大规模培训，我们策划了八个公共卫星卫星，空中和地面RGB-T配对数据集，并推出了三个新的大型卫星卫星 - 卫星RGB-T数据集 - dji-day，bosonplus-day，bosonplus-day和Bosonplus-night and Bosonplus-nighter-bosonplus-night time times times times times跨不同的时代，感官类型和地理区域。对多个RGB-T基准测试的广泛评估表明，与现有基于GAN的基于GAN和基于扩散的方法相比，Thermal具有可比性或出色的翻译性能。据我们所知，Thermalgen是第一个能够合成热图像的RGB-T图像翻译模型，这些模型反映了观点，传感器特征和环境条件中的显着变化。项目页面：此HTTP URL

Title: MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment

Authors: Fankai Jia, Daisong Gan, Zhe Zhang, Zhaochi Wen, Chenchen Dan, Dong Liang, Haifeng Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.24888
Pdf URL: https://arxiv.org/pdf/2509.24888
Copy Paste: [[2509.24888]] MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment(https://arxiv.org/abs/2509.24888)
Keywords: quality assessment
Abstract: Magnetic resonance imaging (MRI) quality assessment is crucial for clinical decision-making, yet remains challenging due to data scarcity and protocol variability. Traditional approaches face fundamental trade-offs: signal-based methods like MRIQC provide quantitative metrics but lack semantic understanding, while deep learning approaches achieve high accuracy but sacrifice interpretability. To address these limitations, we introduce the Multimodal MRI Quality Assessment (MMRQA) framework, pioneering the integration of multimodal large language models (MLLMs) with acquisition-aware signal processing. MMRQA combines three key innovations: robust metric extraction via MRQy augmented with simulated artifacts, structured transformation of metrics into question-answer pairs using Qwen, and parameter-efficient fusion through Low-Rank Adaptation (LoRA) of LLaVA-OneVision. Evaluated on MR-ART, FastMRI, and MyConnectome benchmarks, MMRQA achieves state-of-the-art performance with strong zero-shot generalization, as validated by comprehensive ablation studies. By bridging quantitative analysis with semantic reasoning, our framework generates clinically interpretable outputs that enhance quality control in dynamic medical settings.
摘要：磁共振成像（MRI）质量评估对于临床决策至关重要，但由于数据稀缺和协议可变性，因此仍然具有挑战性。传统方法面临基本的权衡：基于信号的方法（例如MRIQC）提供了定量指标，但缺乏语义理解，而深度学习方法则具有很高的准确性，但牺牲了可解释性。为了解决这些局限性，我们介绍了多模式MRI质量评估（MMRQA）框架，开创了多模式大语言模型（MLLM）与收购感知信号处理的整合。 MMRQA结合了三个关键创新：通过MRQY通过模拟伪影增强的鲁棒指标提取，使用QWEN将指标变成问答对的结构化转换，以及通过LLAVA-ONEVISION的低秩适应性（Lora）通过低秩适应性（Lora）进行参数效率融合。 MMRQA对MR-ART，FASTMRI和MYCONNECNECTOME基准测试进行了评估，通过全面的消融研究验证了最先进的性能。通过用语义推理桥接定量分析，我们的框架产生了可解释的输出，从而增强了动态医疗环境中的质量控制。

Title: VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines

Authors: Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24891
Pdf URL: https://arxiv.org/pdf/2509.24891
Copy Paste: [[2509.24891]] VAGUEGAN: Stealthy Poisoning and Backdoor Attacks on Image Generative Pipelines(https://arxiv.org/abs/2509.24891)
Keywords: generation, generative
Abstract: Generative models such as GANs and diffusion models are widely used to synthesize photorealistic images and to support downstream creative and editing tasks. While adversarial attacks on discriminative models are well studied, attacks targeting generative pipelines where small, stealthy perturbations in inputs lead to controlled changes in outputs are less explored. This study introduces VagueGAN, an attack pipeline combining a modular perturbation network PoisonerNet with a Generator Discriminator pair to craft stealthy triggers that cause targeted changes in generated images. Attack efficacy is evaluated using a custom proxy metric, while stealth is analyzed through perceptual and frequency domain measures. The transferability of the method to a modern diffusion based pipeline is further examined through ControlNet guided editing. Interestingly, the experiments show that poisoned outputs can display higher visual quality compared to clean counterparts, challenging the assumption that poisoning necessarily reduces fidelity. Unlike conventional pixel level perturbations, latent space poisoning in GANs and diffusion pipelines can retain or even enhance output aesthetics, exposing a blind spot in pixel level defenses. Moreover, carefully optimized perturbations can produce consistent, stealthy effects on generator outputs while remaining visually inconspicuous, raising concerns for the integrity of image generation pipelines.
摘要：诸如gan和扩散模型之类的生成模型被广泛用于合成逼真的图像并支持下游的创意和编辑任务。虽然对对歧视模型的对抗性攻击进行了充分的研究，但针对生成管道的攻击，在输入中，小而隐秘的扰动导致输出的受控变化较少。这项研究介绍了Vaguegan，这是一种攻击管道，将模块化扰动网络毒药与发电机歧视对生成对制造隐形触发器，从而导致产生图像的目标变化。使用自定义代理度量评估攻击功效，而通过感知和频域度量分析隐形。该方法向基于现代扩散管道的转移性通过控制网的指导编辑进一步研究。有趣的是，实验表明，与清洁同行相比，中毒输出可以显示出更高的视觉质量，从而挑战了中毒一定会降低保真度的假设。与传统的像素级扰动不同，gan和扩散管道中的潜在空间中毒可以保留甚至增强输出美学，从而在像素级防御中暴露了盲点。此外，精心优化的扰动会对发电机输出产生一致的隐秘影响，同时视觉上不起眼，这引起了对图像生成管道的完整性的关注。

Title: Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Authors: Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24899
Pdf URL: https://arxiv.org/pdf/2509.24899
Copy Paste: [[2509.24899]] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer(https://arxiv.org/abs/2509.24899)
Keywords: generation
Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce \textit{Attention Surgery}, an efficient framework for \textit{linearizing} or \textit{hybridizing} attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40\% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.
摘要：基于变压器的视频扩散模型（VDMS）提供了最先进的视频生成质量，但受到自我注意力的二次成本的约束，使长序列和高分辨率在计算上昂贵。虽然线性注意力提供了次级的复杂性，但先前的尝试无法与软敏注意的表现力相匹配而无需昂贵的再训练。我们介绍了\ textit {注意手术}，这是\ textIt {线性化}或\ textit {杂交}的有效框架，而无需从scratch培训的情况下，请注意VDM的注意。受到语言模型的最新进展的启发，我们的方法结合了一种新型的混合注意机制，将软性蒸馏和线性代币混合使用，带有轻量级的蒸馏和微调管道，只需几个GPU即可。此外，我们结合了一种成本感知的扩展策略，以平衡各个层的表现力和效率。注意手术应用于最先进的VDM WAN2.1 1.3B，它实现了第一个竞争性的亚二次注意视频扩散模型，从而将注意力成本降低了40 \％，同时维持在标准VBench和VBENCH和VBENCH和VBENCH-2.0 BENCHMARKS上衡量的发电质量。

Title: OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Authors: Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24900
Pdf URL: https://arxiv.org/pdf/2509.24900
Copy Paste: [[2509.24900]] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing(https://arxiv.org/abs/2509.24900)
Keywords: generation
Abstract: The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
摘要：统一的多模型用于图像生成和编辑的性能从根本上受到其培训数据的质量和全面性的限制。尽管现有数据集涵盖了基本任务，例如样式传输和简单的对象操纵，但它们通常缺乏现实世界应用所需的系统结构和具有挑战性的方案。为了解决这个瓶颈，我们介绍了使用新型方法构建的大规模数据集OpenGPT-4O图像，该方法将层次结构的任务分类法与自动数据生成结合在一起。我们的分类法不仅包括文本渲染和样式控制等基本功能，而且还引入了高度实用但具有挑战性的类别，例如化学插图的科学图像和复杂的教学编辑，需要同时执行多个操作。通过利用结构化资源池和GPT-4O的自动管道，我们生成具有控制多样性的80K高质量指令对，涵盖了11个主要域和51个子任务。广泛的实验表明，我们数据集中的微调领先模型可在多个基准测试中实现显着的性能增长，在编辑任务（IMGEDIT BENCON上的UNIWORLD-V1）上提高了18％，而生成任务的13％（Harmon on Geneval上）。我们的工作表明，系统数据构建是提高多模式AI功能的关键。

Title: Segmentor-Guided Counterfactual Fine-Tuning for Image Synthesis

Authors: Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.24913
Pdf URL: https://arxiv.org/pdf/2509.24913
Copy Paste: [[2509.24913]] Segmentor-Guided Counterfactual Fine-Tuning for Image Synthesis(https://arxiv.org/abs/2509.24913)
Keywords: generation
Abstract: Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient's age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: this https URL.
摘要：反事实图像生成是增强培训数据，偏见数据集和建模疾病的强大工具。当前的方法依靠外部分类器或回归剂来提高主题级干预措施的有效性（例如，改变患者的年龄）。对于特定于结构的干预措施（例如，更改胸部X光片中左肺的面积），我们表明这是不足的，并且可能导致整个图像域的不良全局效应。先前的工作使用像素级标签图作为指导，要求用户提供乏味且难以获得的假设分割。我们提出了分段引导的反事实微调（SEG-CFT），该调整保留了介入标量值，结构特异性变量的简单性，同时产生局部相干和有效的反事实。我们证明了产生逼真的胸部X光片的能力，并显示了对冠状动脉疾病进行建模的有希望的结果。代码：此HTTPS URL。

Title: Scalable GANs with Transformers

Authors: Sangeek Hyun, MinKyu Lee, Jae-Pil Heo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.24935
Pdf URL: https://arxiv.org/pdf/2509.24935
Copy Paste: [[2509.24935]] Scalable GANs with Transformers(https://arxiv.org/abs/2509.24935)
Keywords: generation, generative
Abstract: Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.96) on ImageNet-256 in just 40 epochs, 6x fewer epochs than strong baselines.
摘要：可伸缩性促进了生成建模的最新进展，但其原理仍未得到逆转学习。我们通过两个设计选择研究了生成对抗网络（GAN）的可伸缩性，这些设计在其他类型的生成模型中有效：紧凑的变异自动编码器潜在空间中的训练，并采用纯粹的基于变压器的生成器和歧视器。潜在空间中的培训可以在保持感知忠诚度的同时进行有效的计算，并且这种效率与普通变形金刚的效率搭配，其绩效与计算预算相比。在这些选择的基础上，我们分析了天真缩放gan时出现的故障模式。具体而言，我们发现发电机中早期层的利用不足，并且随着网络量表的优化不稳定性。因此，我们提供简单且规模友好的解决方案，作为轻量级中间监督和宽度感知的学习率调整。我们的实验表明，GAT是一种纯粹的基于变压器的潜在空间gan，可以在各种能力（s至XL）上可靠地可靠地训练。此外，GAT-XL/2在Imagenet-256上实现了最先进的单步，课堂生成性能（FID为2.96），仅40个时代，比强质基线少6倍。

Title: OAT-FM: Optimal Acceleration Transport for Improved Flow Matching

Authors: Angxiao Yue, Anqi Dong, Hongteng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.24936
Pdf URL: https://arxiv.org/pdf/2509.24936
Copy Paste: [[2509.24936]] OAT-FM: Optimal Acceleration Transport for Improved Flow Matching(https://arxiv.org/abs/2509.24936)
Keywords: generative
Abstract: As a powerful technique in generative modeling, Flow Matching (FM) aims to learn velocity fields from noise to data, which is often explained and implemented as solving Optimal Transport (OT) problems. In this study, we bridge FM and the recent theory of Optimal Acceleration Transport (OAT), developing an improved FM method called OAT-FM and exploring its benefits in both theory and practice. In particular, we demonstrate that the straightening objective hidden in existing OT-based FM methods is mathematically equivalent to minimizing the physical action associated with acceleration defined by OAT. Accordingly, instead of enforcing constant velocity, OAT-FM optimizes the acceleration transport in the product space of sample and velocity, whose objective corresponds to a necessary and sufficient condition of flow straightness. An efficient algorithm is designed to achieve OAT-FM with low complexity. OAT-FM motivates a new two-phase FM paradigm: Given a generative model trained by an arbitrary FM method, whose velocity information has been relatively reliable, we can fine-tune and improve it via OAT-FM. This paradigm eliminates the risk of data distribution drift and the need to generate a large number of noise data pairs, which consistently improves model performance in various generative tasks. Code is available at: this https URL
摘要：作为生成建模的强大技术，流匹配（FM）旨在学习从噪声到数据的速度字段，通常将其解释并实现为解决最佳传输（OT）问题。在这项研究中，我们桥接FM和最新的最佳加速运输理论（OAT），开发了一种改进的FM方法，称为OAT-FM，并在理论和实践中探索其益处。特别是，我们证明了隐藏在现有的基于OT的FM方法中的拉直目标在数学上与最大程度地减少与燕麦定义的加速度相关的物理动作。因此，OAT-FM没有执行恒定速度，而是优化了样品和速度产品空间中的加速器传输，其目标对应于流动直率的必要和充分条件。有效的算法旨在实现低复杂性的燕麦-FM。 OAT-FM激发了新的两阶段FM范式：给定一个由任意FM方法训练的生成模型，该模型的速度信息相对可靠，我们可以通过OAT-FM进行微调和改进。该范例消除了数据分布漂移的风险，以及需要生成大量噪声数据对的需要，从而始终如一地改善各种生成任务中的模型性能。代码可用：此HTTPS URL

Title: On-the-Fly Data Augmentation for Brain Tumor Segmentation

Authors: Ishika Jain, Siri Willems, Steven Latre, Tom De Schepper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24973
Pdf URL: https://arxiv.org/pdf/2509.24973
Copy Paste: [[2509.24973]] On-the-Fly Data Augmentation for Brain Tumor Segmentation(https://arxiv.org/abs/2509.24973)
Keywords: generative
Abstract: Robust segmentation across both pre-treatment and post-treatment glioma scans can be helpful for consistent tumor monitoring and treatment planning. BraTS 2025 Task 1 addresses this by challenging models to generalize across varying tumor appearances throughout the treatment timeline. However, training such generalized models requires access to diverse, high-quality annotated data, which is often limited. While data augmentation can alleviate this, storing large volumes of augmented 3D data is computationally expensive. To address these challenges, we propose an on-the-fly augmentation strategy that dynamically inserts synthetic tumors using pretrained generative adversarial networks (GliGANs) during training. We evaluate three nnU-Net-based models and their ensembles: (1) a baseline without external augmentation, (2) a regular on-the-fly augmented model, and (3) a model with customized on-the-fly augmentation. Built upon the nnU-Net framework, our pipeline leverages pretrained GliGAN weights and tumor insertion methods from prior challenge-winning solutions. An ensemble of the three models achieves lesion-wise Dice scores of 0.79 (ET), 0.749 (NETC), 0.872 (RC), 0.825 (SNFH), 0.79 (TC), and 0.88 (WT) on the online BraTS 2025 validation platform. This work ranked first in the BraTS Lighthouse Challenge 2025 Task 1- Adult Glioma Segmentation.
摘要：在治疗前和治疗后神经胶质瘤扫描中均可有助于一致的肿瘤监测和治疗计划。 Brats 2025任务1通过挑战模型在整个治疗时间表中跨越不同的肿瘤外观的概括来解决这一问题。但是，培训这种广义模型需要访问多种高质量的注释数据，这通常受到限制。尽管数据扩展可以减轻这一点，但存储大量增强的3D数据在计算上很昂贵。为了应对这些挑战，我们提出了一种直立的增强策略，该策略在训练过程中使用预验证的生成对抗网络（Gligans）动态插入合成肿瘤。我们评估了三个基于NNU的基于NNU的模型及其集团：（1）基线无外部增强，（2）常规的即时增强模型，以及（3）具有自定义的即时增强的模型。基于NNU-NET框架，我们的管道利用了先前挑战的解决方案的预读的Gligan重量和肿瘤插入方法。这三种模型的合奏在在线Brats 2025验证平台上的病变骰子得分为0.79（ET），0.749（NETC），0.872（RC），0.825（SNFH），0.79（TC）和0.88（WT）。这项工作在Brats Lighthouse Challenge 2025任务1-成人神经胶质瘤分段中排名第一。

Title: Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel

Authors: Haotian Dong, Wenjing Wang, Chen Li, Di Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24979
Pdf URL: https://arxiv.org/pdf/2509.24979
Copy Paste: [[2509.24979]] Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel(https://arxiv.org/abs/2509.24979)
Keywords: generation
Abstract: RGBA video generation, which includes an alpha channel to represent transparency, is gaining increasing attention across a wide range of applications. However, existing methods often neglect visual quality, limiting their practical usability. In this paper, we propose \textit{Wan-Alpha}, a new framework that generates transparent videos by learning both RGB and alpha channels jointly. We design an effective variational autoencoder (VAE) that encodes the alpha channel into the RGB latent space. Then, to support the training of our diffusion transformer, we construct a high-quality and diverse RGBA video dataset. Compared with state-of-the-art methods, our model demonstrates superior performance in visual quality, motion realism, and transparency rendering. Notably, our model can generate a wide variety of semi-transparent objects, glowing effects, and fine-grained details such as hair strands. The released model is available on our website: \href{this https URL}{this https URL}.
摘要：RGBA视频生成包括代表透明度的Alpha通道，在广泛的应用中引起了人们的关注。但是，现有方法通常会忽略视觉质量，从而限制其实际可用性。在本文中，我们建议\ textit {wan-alpha}，这是一个新框架，通过共同学习RGB和Alpha渠道来生成透明的视频。我们设计了一个有效的变异自动编码器（VAE），该变量编码器（VAE）将alpha通道编码为RGB潜在空间。然后，为了支持我们扩散变压器的训练，我们构建了高质量和多样化的RGBA视频数据集。与最先进的方法相比，我们的模型在视觉质量，运动现实主义和透明度渲染方面表现出了卓越的性能。值得注意的是，我们的模型可以生成各种半透明的物体，发光的效果和细粒细节，例如发束。已发布的模型可在我们的网站上找到：\ href {this HTTPS url} {this HTTPS url}。

Title: SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation

Authors: Shuang Liang, Jing He, Chuanmeizhi Wang, Lejun Liao, Guo Zhang, Yingcong Chen, Yuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24980
Pdf URL: https://arxiv.org/pdf/2509.24980
Copy Paste: [[2509.24980]] SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation(https://arxiv.org/abs/2509.24980)
Keywords: generation, generative
Abstract: Pre-trained diffusion models provide rich multi-scale latent features and are emerging as powerful vision backbones. While recent works such as Marigold~\citep{ke2024repurposing} and Lotus~\citep{he2024lotus} adapt diffusion priors for dense prediction with strong cross-domain generalization, their potential for structured outputs (e.g., human pose estimation) remains underexplored. In this paper, we propose \textbf{SDPose}, a fine-tuning framework built upon Stable Diffusion to fully exploit pre-trained diffusion priors for human pose estimation. First, rather than modifying cross-attention modules or introducing learnable embeddings, we directly predict keypoint heatmaps in the SD U-Net's image latent space to preserve the original generative priors. Second, we map these latent features into keypoint heatmaps through a lightweight convolutional pose head, which avoids disrupting the pre-trained backbone. Finally, to prevent overfitting and enhance out-of-distribution robustness, we incorporate an auxiliary RGB reconstruction branch that preserves domain-transferable generative semantics. To evaluate robustness under domain shift, we further construct \textbf{COCO-OOD}, a style-transferred variant of COCO with preserved annotations. With just one-fifth of the training schedule used by Sapiens on COCO, SDPose attains parity with Sapiens-1B/2B on the COCO validation set and establishes a new state of the art on the cross-domain benchmarks HumanArt and COCO-OOD. Furthermore, we showcase SDPose as a zero-shot pose annotator for downstream controllable generation tasks, including ControlNet-based image synthesis and video generation, where it delivers qualitatively superior pose guidance.
摘要：预训练的扩散模型提供了丰富的多尺度潜在特征，并成为强大的视觉骨架。虽然最近的作品，例如Marigold〜 \ citep {ke2024 repurposing}和lotus〜 \ citep {He2024lotus}适应了通过强烈的跨域概括进行密集预测的扩散率，但它们的强烈交叉概括，它们的潜在结构化输出的潜力（例如，人类的姿势估计）仍然不受影响。在本文中，我们提出了\ textbf {sdpose}，这是一个基于稳定扩散的微调框架，以完全利用预训练的扩散先验进行人体姿势估计。首先，我们直接预测SD U-NET图像潜在空间中的关键点热图，而不是修改跨意义模块或引入可学习的嵌入方式，以保留原始的生成先验。其次，我们通过轻巧的卷积姿势头将这些潜在特征映射到关键点热图中，从而避免破坏预训练的主链。最后，为了防止过度拟合和增强分布的鲁棒性，我们结合了一个辅助RGB重建分支，该分支可保留可转移域的生成语义。为了评估域移动下的鲁棒性，我们进一步构建了\ textbf {可可-OOD}，这是一种带有保留注释的可可的样式转移变体。 SDPOSE仅在Coco上使用的培训时间表中只有五分之一，因此在可可验证集中与Sapiens-1b/2b达到了均等，并在跨域基准HumanArt和Coco-OOD上建立了新的最新技术。此外，我们将SDPOSE展示为用于下游可控生成任务的零拍姿势注释器，包括基于控制网络的图像综合和视频生成，它在质量上提供了优越的姿势指导。

Title: PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

Authors: Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.24997
Pdf URL: https://arxiv.org/pdf/2509.24997
Copy Paste: [[2509.24997]] PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion(https://arxiv.org/abs/2509.24997)
Keywords: generation
Abstract: Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.
摘要：生成一个完整且可探索的360度视觉世界可实现广泛的下游应用程序。尽管先前的作品已经提高了该领域，但它们仍受到狭窄的视野限制的限制，这阻碍了连续和整体场景的综合，或者摄像机可控性不足，从而限制了用户或自主代理的自由探索。为了解决这个问题，我们提出了Panoworld-X，这是一个具有多种相机轨迹的高保真和可控全景的新型框架。具体而言，我们首先通过通过虚幻引擎在虚拟3D环境中模拟摄像头轨迹来构建一个大型全景视频探索路线对。随着传统视频扩散的感应先验的全景数据未对准球形几何形状，然后我们引入了一个球体意识到的扩散变压器结构，该构建体将等效的特征重新投影到球形表面上，以模拟潜在空间的几何邻接，从而显着增强了视觉速度和斑点的连续性。广泛的实验表明，我们的panoworld-X在各个方面都取得了卓越的性能，包括运动范围，控制精度和视觉质量，强调了其对现实世界应用的潜力。

Title: Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

Authors: Spyros Kondylatos, Gustau Camps-Valls, Ioannis Papoutsis
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.25017
Pdf URL: https://arxiv.org/pdf/2509.25017
Copy Paste: [[2509.25017]] Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting(https://arxiv.org/abs/2509.25017)
Keywords: generation
Abstract: Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.
摘要：野火是最严重的自然危害之一，对人类和自然生态系统构成了重大威胁。野火的风险不断增长，增加了对不仅准确而且可靠的预测模型的需求。深度学习（DL）在预测野火危险方面表现出了希望。但是，对其预测的可靠性的担忧阻碍了其采用，其中一些源于缺乏不确定性量化。为了应对这一挑战，我们提出了一个不确定性感知的DL框架，该框架共同捕获了认知（模型）和Aleatoric（数据）不确定性，以增强短期野火危险预测。在第二天的预测中，与确定性基线相比，我们表现最佳的模型将F1分数提高了2.3％，并将预期的校准误差降低了2.1％，从而提高了预测性技能和校准。我们的实验证实了不确定性估计的可靠性，并说明了其对决策支持的实际实用性，包括识别不确定性阈值，用于拒绝低信心预测以及生成良好的野火危险图，并伴有不确定性层。将预测范围延长至十天，我们观察到，随着时间的流逝，不确定性增加，在环境条件下显示出更大的变化，而认知不确定性仍然稳定。最后，我们表明，尽管在低确定性案例中这两种不确定性类型可能是多余的，但它们在更具挑战性的条件下提供了互补的见解，突显了其对强大的野火危险预测的联合建模的价值。总而言之，我们的方法大大提高了野火危险预测的准确性和可靠性，从而推进了值得信赖的野火DL系统的发展。

Title: MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts

Authors: Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, Ning Miao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.25020
Pdf URL: https://arxiv.org/pdf/2509.25020
Copy Paste: [[2509.25020]] MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts(https://arxiv.org/abs/2509.25020)
Keywords: generation
Abstract: The current paradigm for reasoning in large language models (LLMs) involves models "thinking out loud" via a sequence of tokens, known as chain-of-thought (CoT). This approach, while effective, has several significant drawbacks. Firstly, inference requires autoregressive generation of often thousands of CoT tokens, which is slow and computationally expensive. Secondly, it constrains reasoning to the discrete space of tokens, creating an information bottleneck across reasoning steps. Thirdly, it fundamentally entangles reasoning with token generation, forcing LLMs to "think while speaking," which causes potentially short-sighted reasoning. In light of these limitations, we re-imagine reasoning in LLMs and present a new paradigm: MARCOS. In our approach, rather than autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts". Each reasoning step involves a transition of the internal thoughts, where explicit reasoning steps (which may consist of hundreds of tokens) serve as observable variables, which are windows to peek into the implicit thoughts. Since this latent process is incompatible with the standard supervised learning, we further propose a two-phase variational training scheme. Our experiments on three benchmarks demonstrate that MARCOS outperforms existing continuous reasoning methods and, for the first time, achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference. Beyond this, MARCOS offers additional advantages, such as step-level instead of token-level control over randomness, opening significant opportunities for reinforcement learning and reasoning in LLMs.
摘要：当前用于大语模型推理的范式（LLMS）涉及模型通过一系列令牌（称为theque of-thought（COT））“大声思考”。这种方法虽然有效，但却有几个重要的缺点。首先，推理需要经常自回归产生的成千上万个COT代币，这在缓慢且计算上昂贵。其次，它将推理限制为代币的离散空间，从而在推理步骤中创建信息瓶颈。第三，从根本上讲，它将推理与代币的产生纠缠，迫使LLM“在讲话时思考”，这会引起潜在的短视推理。鉴于这些局限性，我们在LLM中重新构想推理，并提出新的范式：Marcos。在我们的方法中，我们没有自动产生代币，而是将推理模型为一个隐藏的马尔可夫链，该链是连续的，高维的“思想”。每个推理步骤都涉及内部思想的过渡，其中明确的推理步骤（可能由数百个令牌组成）用作可观察的变量，这些变量是窥视隐性思想的窗口。由于这种潜在过程与标准监督学习不相容，因此我们进一步提出了两阶段的变分训练方案。我们对三个基准测试的实验表明，Marcos的表现优于现有的连续推理方法，并且首次实现了与基于令牌的COT相当的性能，甚至在GSM8K上甚至超过了4.7％的GSM8K，其推理中最多可以提高15.7倍。除此之外，Marcos还提供了其他优势，例如步进级别，而不是对随机性的代币级别的控制，为LLM中的强化学习和推理打开了重要的机会。

Title: STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation

Authors: Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25027
Pdf URL: https://arxiv.org/pdf/2509.25027
Copy Paste: [[2509.25027]] STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation(https://arxiv.org/abs/2509.25027)
Keywords: generation
Abstract: Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.
摘要：最近已经探索了增强学习以改善文本形象的生成，但是将现有的GRPO算法应用于自动回归（AR）图像模型仍然具有挑战性。训练过程的不稳定性很容易破坏长期验证的模型能力，从而导致边缘增长，降低图像质量和概括不佳。在这项工作中，我们重新访问了AR图像生成的GRPO，并确定了两个关键问题：来自不必要的令牌和不稳定的策略熵动态的矛盾梯度。为了解决这些问题，我们介绍了一个稳定且可推广的框架，该框架利用了两个有针对性的解决方案：1）Advanta Advante/KL重新享用。相似性觉醒的重新加权以减轻矛盾的更新； 2）熵奖励。与参考模型相对应的基于熵的奖励，以稳定学习。借助减轻令牌与稳定训练的熵奖励之间的冲突，我们减少了验证分配的破坏并减轻奖励黑客攻击，从而改善了概括并改善对其他基准的转移。跨多个基准测试的实验表明，与基线GRPO相比，阶段一致地提高了视觉质量，稳定性和交叉任务概括。

Title: Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

Authors: Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.25050
Pdf URL: https://arxiv.org/pdf/2509.25050
Copy Paste: [[2509.25050]] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models(https://arxiv.org/abs/2509.25050)
Keywords: generation
Abstract: Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at this https URL.
摘要：强化学习（RL）已成为推进大型语言模型（LLMS）的中心范式，其中训练和RL训练后训练具有相同的对数类样式的表述。相反，扩散模型的最新RL方法，最著名的是降级扩散策略优化（DDPO），优化了一个与预处理目标（得分/流量匹配损失）不同的目标。在这项工作中，我们建立了一种新颖的理论分析：DDPO是分数/流与嘈杂目标的隐式形式，这增加了方差和减慢收敛性。在此分析的基础上，我们介绍了\ textbf {优势加权匹配（AWM）}，这是一种扩散的策略梯度方法。它使用与预处理相同的得分/流匹配损失来获得低变义的目标，并以其优势重新获得每个样本。实际上，AWM提高了高回报样本的影响，并抑制了低回报的样本，同时保持建模与训练相同。这在概念上和实际上统一了预处理和RL，与策略梯度理论一致，减少差异并产生更快的收敛速度。这种简单而有效的设计带来了可观的好处：对于Geneval，OCR和PickScore基准测试，AWM将在Flow-GRPO上提供高达$ 24 \ times的$加速度（在DDPO上建立在DDPO上），当应用于稳定的扩散3.5培养基和速度，而无需发电质量损害。代码可在此HTTPS URL上找到。

Title: BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

Authors: Dingning Liu, Haoyu Guo, Jingyi Zhou, Tong He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25077
Pdf URL: https://arxiv.org/pdf/2509.25077
Copy Paste: [[2509.25077]] BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation(https://arxiv.org/abs/2509.25077)
Keywords: generation
Abstract: Monocular Depth Estimation (MDE) is a foundational task for computer vision. Traditional methods are limited by data scarcity and quality, hindering their robustness. To overcome this, we propose BRIDGE, an RL-optimized depth-to-image (D2I) generation framework that synthesizes over 20M realistic and geometrically accurate RGB images, each intrinsically paired with its ground truth depth, from diverse source depth maps. Then we train our depth estimation model on this dataset, employing a hybrid supervision strategy that integrates teacher pseudo-labels with ground truth depth for comprehensive and robust training. This innovative data generation and training paradigm enables BRIDGE to achieve breakthroughs in scale and domain diversity, consistently outperforming existing state-of-the-art approaches quantitatively and in complex scene detail capture, thereby fostering general and robust depth features. Code and models are available at this https URL.
摘要：单眼深度估计（MDE）是计算机视觉的基础任务。传统方法受数据稀缺和质量的限制，从而阻碍了它们的稳健性。为了克服这一点，我们提出了桥梁，即RL优化的深度到图像（D2I）生成框架，该框架综合了超过20m的现实且几何准确的RGB图像，每个图像本质上与其地面真相深度，来自不同的源深度图。然后，我们采用了混合监督策略来训练该数据集的深度估计模型，该策略将教师伪标签与地面真相深度整合在一起，以进行全面和健壮的培训。这种创新的数据生成和培训范式使桥梁能够在规模和领域的多样性方面取得突破，从定量和复杂的场景细节捕获中始终超过现有的最新方法，从而促进了一般和健壮的深度特征。代码和型号可在此HTTPS URL上找到。

Title: UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

Authors: Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, Qi Tian
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2509.25079
Pdf URL: https://arxiv.org/pdf/2509.25079
Copy Paste: [[2509.25079]] UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation(https://arxiv.org/abs/2509.25079)
Keywords: generation
Abstract: High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos \& code are available at this https URL
摘要：高保真3D资产的产生对于各个行业至关重要。虽然最近的3D预告片模型在生产逼真的内容方面表现出很强的能力，但大多数模型构建在扩散模型上，并遵循两阶段的管道，该管道首先生成几何形状，然后合成外观。这种脱钩的设计倾向于产生几何形状的错位和不可忽略的成本。在本文中，我们提出了Unilat3d，这是一个统一的框架，该框架编码单个潜在空间中的几何和外观，从而实现直接的单阶段生成。我们的关键贡献是几何表现统一VAE，它将高分辨率稀疏特征压缩成紧凑的潜在表示 - unilat。 Unilat将结构和视觉信息整合到一个密集的低分辨率潜在中，可以将其有效地解码为不同的3D格式，例如3D高斯和网格。基于此统一表示形式，我们将单个流匹配模型训练，将高斯噪声直接映射到Unilat中，从而消除了冗余阶段。 Unilat3D仅在公共数据集中受过培训，从单个图像中生产出高质量的3D资产，从而实现了出色的外观保真度和几何质量。此https URL可用更多演示\＆代码

Title: Towards generalizable deep ptychography neural networks

Authors: Albert Vong, Steven Henke, Oliver Hoidn, Hanna Ruth, Junjing Deng, Alexander Hexemer, Apurva Mehta, Arianna Gleason, Levi Hancock, Nicholas Schwarz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.25104
Pdf URL: https://arxiv.org/pdf/2509.25104
Copy Paste: [[2509.25104]] Towards generalizable deep ptychography neural networks(https://arxiv.org/abs/2509.25104)
Keywords: generation
Abstract: X-ray ptychography is a data-intensive imaging technique expected to become ubiquitous at next-generation light sources delivering many-fold increases in coherent flux. The need for real-time feedback under accelerated acquisition rates motivates surrogate reconstruction models like deep neural networks, which offer orders-of-magnitude speedup over conventional methods. However, existing deep learning approaches lack robustness across diverse experimental conditions. We propose an unsupervised training workflow emphasizing probe learning by combining experimentally-measured probes with synthetic, procedurally generated objects. This probe-centric approach enables a single physics-informed neural network to reconstruct unseen experiments across multiple beamlines; among the first demonstrations of multi-probe generalization. We find probe learning is equally important as in-distribution learning; models trained using this synthetic workflow achieve reconstruction fidelity comparable to those trained exclusively on experimental data, even when changing the type of synthetic training object. The proposed approach enables training of experiment-steering models that provide real-time feedback under dynamic experimental conditions.
摘要：X射线Ptychography是一种数据密集型成像技术，预计将在下一代光源中无处不在，可提供相干通量的增长。在加速采集率下需要实时反馈的需求激发了替代重建模型，例如深神经网络，这些模型提供了与传统方法相比的质量速度加速。但是，现有的深度学习方法在各种实验条件下缺乏鲁棒性。我们提出了一个无监督的培训工作流程，通过将实验测量的探针与合成的，程序生成的对象相结合，以强调探测学习。这种以探针为中心的方法使一个具有物理信息的神经网络能够重建多个光束线的看不见的实验。在多探针概括的第一个演示中。我们发现探测学习与分布学习同样重要。使用这种合成工作流训练的模型即使更改合成训练对象的类型，也可以与仅在实验数据上训练的人相当。提出的方法可以培训实验模型，这些模型在动态实验条件下提供实时反馈。

Title: Score Distillation of Flow Matching Models

Authors: Mingyuan Zhou, Yi Gu, Huangjie Zheng, Liangchen Song, Guande He, Yizhe Zhang, Wenze Hu, Yinfei Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25127
Pdf URL: https://arxiv.org/pdf/2509.25127
Copy Paste: [[2509.25127]] Score Distillation of Flow Matching Models(https://arxiv.org/abs/2509.25127)
Keywords: generation
Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. We will make the PyTorch implementation publicly available.
摘要：扩散模型可实现高质量的图像产生，但受到缓慢的迭代采样限制。蒸馏方法通过启用一步或几步的生成来减轻这种方法。最初以独特的框架引入的流量匹配在理论上与高斯假设下的扩散相等，这提出了一个问题，即是否直接直接传递了蒸馏技术，例如分数蒸馏。我们提供了一个简单的推导 - 基于贝叶斯的规则和条件期望 - 统一了高斯扩散和流动匹配而不依赖ODE/SDE公式。在此视图的基础上，我们将评分身份蒸馏（SID）扩展到鉴定的文本对图像流量匹配模型，包括SANA，SD3-MEDIUM，SD3.5-MEDIUM/大型/大和Flux.1-DEV，均为DIT骨干。实验表明，只有适度的流量匹配和特定于DIT的调整，SID在这些模型中，在无数据和数据辅助设置中都可以在这些模型中进行操作，而无需教师的Finetuntuntuntuntuntun或bistertratural Change。这提供了第一个系统的证据，即得分蒸馏广泛地适用于文本对图流量匹配模型，从而解决了对稳定性和健全性的先前关注，以及跨基于扩散和基于流量的生成器的统一加速技术。我们将公开提供Pytorch实施。

Title: TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Authors: Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.25143
Pdf URL: https://arxiv.org/pdf/2509.25143
Copy Paste: [[2509.25143]] TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models(https://arxiv.org/abs/2509.25143)
Keywords: generation
Abstract: Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.
摘要：视觉模型的现有医学推理基准主要集中于根据一次访问的图像分析患者的状况。但是，这种环境与现实世界中的临床实践显着偏离，医生通常指患者的历史状况，以通过随着时间的推移跟踪其变化来提供全面评估。在本文中，我们介绍了Temmed-Bench，这是第一个旨在分析不同临床访问之间患者状况变化的基准，这挑战了大型视觉模型（LVLMS），以推理有关时间医学图像的推理。 Temmed Bench由一个测试集组成，该测试集包括三个任务 - 视觉提问（VQA），报告生成和图像对选择 - 以及17,000多个实例的补充知识语料库。在Temmed Bench的情况下，我们对六个专有和六个开源LVLM进行了评估。我们的结果表明，大多数LVLM都缺乏分析患者病情在时间医学图像上变化的能力，并且只有在闭幕环境中只能在随机猜测水平上执行的大部分。相比之下，GPT O3，O4-Mini和Claude 3.5十四行诗表现出相对不错的表现，尽管它们尚未达到所需的水平。此外，我们还探索了在医学领域中检索的视觉和文本方式的增强输入。我们还表明，在我们的基准测试中，多数模式的增强功能比单独的大多数模型中没有检索和文本检索的性能提高，而VQA任务的平均提高了2.59％。总体而言，我们构成了基于现实世界临床实践的基准，它揭示了LVLMS在时间医学图像推理中的局限性，并突出了使用多模式检索增强的使用，作为一个值得探索的潜在有希望的方向，以应对这一挑战。

Title: Chance-constrained Flow Matching for High-Fidelity Constraint-aware Generation

Authors: Jinhao Liang, Yixuan Sun, Anirban Samaddar, Sandeep Madireddy, Ferdinando Fioretto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25157
Pdf URL: https://arxiv.org/pdf/2509.25157
Copy Paste: [[2509.25157]] Chance-constrained Flow Matching for High-Fidelity Constraint-aware Generation(https://arxiv.org/abs/2509.25157)
Keywords: generation, generative
Abstract: Generative models excel at synthesizing high-fidelity samples from complex data distributions, but they often violate hard constraints arising from physical laws or task specifications. A common remedy is to project intermediate samples onto the feasible set; however, repeated projection can distort the learned distribution and induce a mismatch with the data manifold. Thus, recent multi-stage procedures attempt to defer projection to clean samples during sampling, but they increase algorithmic complexity and accumulate errors across steps. This paper addresses these challenges by proposing a novel training-free method, Chance-constrained Flow Matching (CCFM), that integrates stochastic optimization into the sampling process, enabling effective enforcement of hard constraints while maintaining high-fidelity sample generation. Importantly, CCFM guarantees feasibility in the same manner as conventional repeated projection, yet, despite operating directly on noisy intermediate samples, it is theoretically equivalent to projecting onto the feasible set defined by clean samples. This yields a sampler that mitigates distributional distortion. Empirical experiments show that CCFM outperforms current state-of-the-art constrained generative models in modeling complex physical systems governed by partial differential equations and molecular docking problems, delivering higher feasibility and fidelity.
摘要：生成模型在合成复杂数据分布的高保真样本方面表现出色，但它们通常违反了由物理定律或任务规格引起的硬约束。一种常见的补救措施是将中间样品投射到可行的集合上。但是，重复投影会扭曲学习分布并引起与数据歧管的不匹配。因此，最近的多阶段程序试图在采样过程中将投影推迟到清洁样品，但它们增加了算法复杂性并在步骤中累积错误。本文通过提出一种新颖的无培训方法，即偶然受限的流量匹配（CCFM）来解决这些挑战，该方法将随机优化整合到采样过程中，从而有效地执行硬性约束，同时保持高保真样本的产生。重要的是，CCFM以与常规重复投影相同的方式保证可行性，但是，尽管直接在嘈杂的中间样品上运行，但理论上等同于投影到由干净样品定义的可行集合上。这产生了一种减轻分布失真的采样器。经验实验表明，CCFM在模拟由部分微分方程和分子对接问题控制的复杂物理系统中的当前最新生成模型的表现优于当前最新的生成模型，从而提供了更高的可行性和保真度。

Title: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Authors: Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.25160
Pdf URL: https://arxiv.org/pdf/2509.25160
Copy Paste: [[2509.25160]] GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts(https://arxiv.org/abs/2509.25160)
Keywords: generation
Abstract: Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.
摘要：视觉语言模型（VLM）实现了图像和文本的统一建模，使它们能够通过感知，计划和推理完成复杂的现实世界任务。在这些任务中，推理特别具有代表性，数学推理是一个重要的例子。它突出了VLM的高级能力，可以理解图像中的数学信息并执行复杂的推理。最近，已经提出了许多视觉数学推理基准，但通常仅限于几何形状，缺乏数学单词问题的覆盖范围，并且很少评估多个图像的推理。为了解决这些差距，我们引入了GSM8K-V，这是一种纯粹的视觉多图像数学推理基准。 GSM8K-V是通过系统地将每个样品从广泛使用的基于文本的GSM8K映射到视觉形式的。通过精心设计的自动化图像生成管道结合细致的人类注释，我们策划了1,319个高质量的样本。我们在GSM8K-V上评估了广泛的开源和闭合型型号。结果表明，尽管现有的VLM在基于文本的GSM8K上的性能几乎饱和，但GSM8K-V的改进空间仍然很大。例如，表现最好的模型Gemini-2.5-Pro在GSM8K上的精度达到95.22％，但GSM8K-V的精度仅为46.93％。我们对GSM8K-V进行了全面分析，研究了当前模型的局限性以及改进的潜在方向。 GSM8K-V提供了有关视觉数学推理的新观点，并建立了一个基准，以指导开发更健壮和可推广的VLM。

Title: Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Authors: Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25161
Pdf URL: https://arxiv.org/pdf/2509.25161
Copy Paste: [[2509.25161]] Rolling Forcing: Autoregressive Long Video Diffusion in Real Time(https://arxiv.org/abs/2509.25161)
Keywords: generation
Abstract: Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.
摘要：作为交互式世界模型和神经游戏引擎中的一个基本组成部分，流媒体视频生成旨在产生高质量，低延迟和时间连贯的长视频流。但是，大多数现有的工作遭受了严重的错误积累，通常会大大降低生成的流视频在远距离上。我们设计了滚动强迫，这是一种新型的视频生成技术，可实现以最小误差积累的流式视频。滚动强迫带有三种新颖的设计。首先，我们设计了一个联合去涂的方案，而不是迭代采样单个帧（加速误差传播），该方案同时将多个框架降低，并逐渐增加噪声水平。该设计使跨相邻框架的严格因果关系有效地抑制了误差生长。其次，我们将注意流机制介绍到长匹马流视频生成任务中，该任务使该模型可以将初始帧的钥匙值保持为全球上下文锚点，从而提高了长期的全球一致性。第三，我们设计了一种高效的培训算法，可以在很大程度上扩展的Denoising Windows上几步蒸馏。该算法在非重叠的窗户上运行，并缓解以自我生成的历史为条件的曝光偏见。广泛的实验表明，滚动强迫可以在单个GPU上实时流式传输生成多分钟视频，并大大减少了误差积累。

Title: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Authors: Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25162
Pdf URL: https://arxiv.org/pdf/2509.25162
Copy Paste: [[2509.25162]] Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models(https://arxiv.org/abs/2509.25162)
Keywords: generation
Abstract: In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
摘要：在这项工作中，我们提出对齐预验证的视觉编码器，以作为图像生成中潜在扩散模型的引物。与训练从头开始强调低级细节的差异自动编码器（VAE）不同，我们的方法利用了基础编码器的丰富语义结构。我们引入了三阶段的对齐策略：（1）冻结编码器并训练适配器和解码器，以建立语义潜在空间；（2）共同优化所有组件，并具有额外的语义保护损失，从而使编码器能够捕获感知细节，同时保留高级语义；（3）完善解码器以提高重建质量。这种比对产生有益于扩散模型的语义丰富的图像引物。在ImageNet 256 $ \ times $ 256上，我们的令牌加速了扩散模型的收敛性，仅在64个时代内达到1.90的GFID，并在没有无分类器指导的情况下改善了发电。在相同的训练步骤下，用我们的令牌训练的2B参数文本对图像模型扩展到Laion。总体而言，我们的方法简单，可扩展，并建立了用于连续令牌设计的语义扎根范式。

Title: GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Authors: Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.25170
Pdf URL: https://arxiv.org/pdf/2509.25170
Copy Paste: [[2509.25170]] GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models(https://arxiv.org/abs/2509.25170)
Keywords: generation
Abstract: The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a "flow matching model within a flow matching model" to sample Markov transitions. As we show in this work, this "inner" flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.
摘要：通过奖励比对算法，在推理时间可以大大提高流量匹配和扩散模型的性能，但是效率仍然是一个主要限制。虽然提出了几种算法，但我们证明了一种常见的瓶颈是这些算法所依赖的采样方法：许多算法通过SDE采样进行采样Markov过渡的许多算法，该采样效率明显较低，而且性能较小，并且通常比Ode采样较低。为了删除这种瓶颈，我们介绍了玻璃流，这是一种新的采样范式，该范式模拟了“流量匹配模型中的流量匹配模型”，以进行样本马尔可夫过渡。正如我们在这项工作中所显示的那样，可以从预先训练的模型中检索这种“内部”流匹配模型，而无需任何重新训练，将ODE的效率与SDES的随机演变相结合。在大规模的文本到图像模型上，我们表明玻璃流量消除了随机演变和效率之间的权衡。结合Feynman-kac转向，玻璃流量可改善文本对图像生成的最新性能，使其成为用于推理时间缩放和扩散模型的简单，液位的解决方案。

Title: TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion

Authors: Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2509.25171
Pdf URL: https://arxiv.org/pdf/2509.25171
Copy Paste: [[2509.25171]] TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion(https://arxiv.org/abs/2509.25171)
Keywords: generation
Abstract: Reinforcement learning with stochastic optimal control offers a promising framework for diffusion fine-tuning, where a pre-trained diffusion model is optimized to generate paths that lead to a reward-tilted distribution. While these approaches enable optimization without access to explicit samples from the optimal distribution, they require training on rollouts under the current fine-tuned model, making them susceptible to reinforcing sub-optimal trajectories that yield poor rewards. To overcome this challenge, we introduce TRee Search Guided TRajectory-Aware Fine-Tuning for Discrete Diffusion (TR2-D2), a novel framework that optimizes reward-guided discrete diffusion trajectories with tree search to construct replay buffers for trajectory-aware fine-tuning. These buffers are generated using Monte Carlo Tree Search (MCTS) and subsequently used to fine-tune a pre-trained discrete diffusion model under a stochastic optimal control objective. We validate our framework on single- and multi-objective fine-tuning of biological sequence diffusion models, highlighting the overall effectiveness of TR2-D2 for reliable reward-guided fine-tuning in discrete sequence generation.
摘要：使用随机最佳控制的增强学习为扩散微调提供了有希望的框架，在该框架中，预先训练的扩散模型被优化以生成导致奖励倾斜分布的路径。尽管这些方法可以优化，而无需从最佳分布中访问明确样本，但它们需要在当前微调模型下进行推出的培训，从而使它们容易受到增强的次优轨迹，从而产生不良的奖励。为了克服这一挑战，我们为离散扩散（TR2-D2）介绍了树搜索的轨迹轨迹 - 意识到的微调，这是一个新颖的框架，可通过树搜索优化奖励引导的离散扩散轨迹，以构建轨迹吸引的微调调节构建重播缓冲区。这些缓冲区是使用蒙特卡洛树搜索（MCT）生成的，随后用于在随机最佳控制目标下微调预训练的离散扩散模型。我们验证了生物序列扩散模型的单目标微调和多目标微调的框架，从而突出了TR2-D2在离散序列生成中可靠奖励引导的微调的总体效率。

Title: Personalized Vision via Visual In-Context Learning

Authors: Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25172
Pdf URL: https://arxiv.org/pdf/2509.25172
Copy Paste: [[2509.25172]] Personalized Vision via Visual In-Context Learning(https://arxiv.org/abs/2509.25172)
Keywords: generation
Abstract: Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.
摘要：经过大规模注释数据集培训的现代视觉模型，在预定义的任务上表现出色，但要与个性化的视觉斗争 - 具有自定义对象或新颖目标的用户在测试时间定义的任务。现有的个性化方法依赖于昂贵的微调或合成数据管道，这些数据管道不灵活，仅限于固定任务格式。 Visual-Ontext Learning（ICL）提供了一种有希望的替代方案，但先前的方法仅限于狭窄的内域任务，并且未能推广到开放式的个性化。我们介绍了个性化的内在操作员（PICO），这是一个简单的四面板框架，将扩散变压器重新用于视觉上的内在学习者。给定一个带注释的示例，Pico侵入了基本转换，并将其应用于新输入而无需重新培训。为了实现这一目标，我们构建了Visrel，这是一个紧凑而多样的调整数据集，表明任务多样性而不是扩展可推动强大的概括。我们进一步提出了一个注意引导的种子得分手，该得分子通过有效的推理缩放来提高可靠性。广泛的实验表明，PICO（i）超过微调和合成数据基线，（ii）灵活地适应新颖的用户定义任务，并且（iii）在识别和产生中概括了（iii）。

Title: GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Authors: Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.25178
Pdf URL: https://arxiv.org/pdf/2509.25178
Copy Paste: [[2509.25178]] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs(https://arxiv.org/abs/2509.25178)
Keywords: generation
Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.
摘要：多模式大语言模型（MLLMS）中的对象幻觉是一种持续的故障模式，可导致模型感知图像中不存在对象。目前，使用具有固定视觉场景的静态基准对MLLM的这种弱点进行了研究，这使人可以发现特定于模型的或意外的幻觉漏洞的可能性。我们介绍了幽灵（通过优化隐形令牌产生幻觉），这种方法旨在通过积极生成诱导幻觉的图像来强调MLLM。幽灵是完全自动的，不需要人类的监督或先验知识。它通过在图像嵌入空间中进行优化来误导模型，同时保持目标对象不存在，然后引导以嵌入为嵌入的扩散模型，以生成自然图像。所产生的图像在视觉上保持自然，并且接近原始输入，但引入了微妙的误导性提示，从而导致模型幻觉。我们在一系列模型中评估了我们的方法，包括诸如GLM-4.1V思维的推理模型，并获得超过28％的幻觉成功率，而先前数据驱动的发现方法约为1％。我们确认通过定量指标和人类评估，生成的图像既具有高质量又是无对象的。此外，Ghost发现可转移的漏洞：针对QWEN2.5-VL优化的图像以66.5％的速率诱导GPT-4O诱导幻觉。最后，我们表明，对图像进行微调可以减轻幻觉，将幽灵定位为构建更可靠的多模式系统的诊断和纠正工具。

Title: DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Authors: Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25180
Pdf URL: https://arxiv.org/pdf/2509.25180
Copy Paste: [[2509.25180]] DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space(https://arxiv.org/abs/2509.25180)
Keywords: generation
Abstract: Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: this https URL.
摘要：现有的文本对图像扩散模型在产生高质量图像方面表现出色，但是当缩放到高分辨率（例如4K图像生成）时，面临着巨大的效率挑战。尽管以前的研究在各个方面加速了扩散模型，但它很少处理潜在空间内的固有冗余。为了弥合这一差距，本文介绍了DC-Gen，这是一个通用框架，该框架通过利用深层压缩的潜在空间来加速文本形象扩散模型。 DC-GEN不是从划痕方法中进行昂贵的训练，而是使用有效的训练后管道来保留基本模型的质量。该范式中的一个关键挑战是基本模型潜在空间与深层压缩的潜在空间之间的表示差距，这可能会导致直接微调过程中的不稳定性。为了克服这一点，DC-Gen首先通过轻巧的嵌入对准训练来弥合表示差距。一旦将潜在的嵌入对齐，只需要少量的Lora微调即可解锁基本模型固有的一代质量。我们验证了DC-Gen对Sana和Flux.1-Krea的有效性。由此产生的DC-gen-Sana和DC-Gen-Flux模型可实现与基本模型相当的质量，但具有显着的加速。具体而言，DC-Gen-Flux在NVIDIA H100 GPU上降低了4K图像生成的潜伏期。当与NVFP4 SVDQuant结合使用时，DC-Gen-Flux在单个NVIDIA 5090 GPU上仅在3.5秒内产生4K图像，与Base Flux.1-Krea模型相比，总潜伏期降低了138倍。代码：此HTTPS URL。

Title: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Authors: Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.25182
Pdf URL: https://arxiv.org/pdf/2509.25182
Copy Paste: [[2509.25182]] DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder(https://arxiv.org/abs/2509.25182)
Keywords: generation
Abstract: We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: this https URL.
摘要：我们介绍了DC-Videogen，这是一种用于高效视频生成的训练后加速框架。 DC-VIDEEGEN可以应用于任何预训练的视频扩散模型，通过将其调整到具有轻质微调的深层压缩潜在空间来提高效率。该框架建立在两个关键创新的基础上：（i）深层压缩视频自动编码器，具有新颖的块临时时间设计，可实现32x/64x的空间和4倍的时间压缩，同时保留重建质量和概括以延长更长的视频；（ii）AE-Adapt-V，一种强大的适应性策略，可以快速稳定地将预训练的模型转移到新的潜在空间中。使用DC-Videgen适应预先训练的WAN-2.1-14B模型，只需10 GPU即可在NVIDIA H100 GPU上使用10天。加速模型的推理潜伏期比基本同行的推理潜伏期低14.8倍，而不会损害质量，并在单个GPU上进一步启用2160x3840视频生成。代码：此HTTPS URL。

Title: PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos

Authors: Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25183
Pdf URL: https://arxiv.org/pdf/2509.25183
Copy Paste: [[2509.25183]] PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos(https://arxiv.org/abs/2509.25183)
Keywords: generative
Abstract: We present PAD3R, a method for reconstructing deformable 3D objects from casually captured, unposed monocular videos. Unlike existing approaches, PAD3R handles long video sequences featuring substantial object deformation, large-scale camera movement, and limited view coverage that typically challenge conventional systems. At its core, our approach trains a personalized, object-centric pose estimator, supervised by a pre-trained image-to-3D model. This guides the optimization of deformable 3D Gaussian representation. The optimization is further regularized by long-term 2D point tracking over the entire input video. By combining generative priors and differentiable rendering, PAD3R reconstructs high-fidelity, articulated 3D representations of objects in a category-agnostic way. Extensive qualitative and quantitative results show that PAD3R is robust and generalizes well across challenging scenarios, highlighting its potential for dynamic scene understanding and 3D content creation.
摘要：我们提出PAD3R，这是一种从随意捕获的，未被捕获的单眼视频中重建可变形的3D对象的方法。与现有方法不同，PAD3R处理具有大量对象变形，大规模摄像头运动以及有限的视图覆盖范围的长视频序列，通常会挑战常规系统。从本质上讲，我们的方法训练了一个个性化的，以对象为中心的姿势估计器，由预先训练的图像到3D模型监督。这指导了可变形3D高斯表示的优化。在整个输入视频中，长期2D点跟踪进一步正规化了优化。通过将生成先验和可区分的渲染相结合，PAD3R重建了高保真性，以类别不固定的方式阐明对象的3D表示。广泛的定性和定量结果表明，PAD3R在具有挑战性的场景中非常强大，并且可以很好地推广其动态场景理解和3D内容创建的潜力。

Title: FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

Authors: Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25187
Pdf URL: https://arxiv.org/pdf/2509.25187
Copy Paste: [[2509.25187]] FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation(https://arxiv.org/abs/2509.25187)
Keywords: generation
Abstract: In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: this https URL
摘要：在图像到视频（I2V）一代中，使用输入图像作为第一框条件创建视频。现有的I2V方法将条件图像的完整信息与嘈杂的潜水量相连，以实现高保真度。但是，这些方法中的DINOISER倾向于捷径捷径，这被称为条件图像泄漏，导致性能降低问题，例如慢动作和颜色不一致。在这项工作中，我们进一步阐明了条件图像泄漏导致过度适应内域数据并降低室外场景中的性能。此外，我们介绍了名为FlashI2V的傅里叶引导的潜在转移I2V，以防止有条件的图像泄漏。具体而言，FlashI2V由：（1）潜在转移。我们通过从嘈杂的潜伏期中减去条件图像信息来修改流量匹配的源和目标分布，从而隐含地纳入条件。（2）傅立叶指导。我们使用傅立叶变换获得的高频幅度特征来加速收敛，并可以调整生成视频中的细节水平。实验结果表明，我们的方法有效地克服了有条件的图像泄漏，并在各种I2V范式之间实现了对室外数据的最佳概括和性能。 FlashI2V仅有1.3b参数，在VBENCH-I2V上获得了53.01的动态得分，超过CogVideOx1.5-5B-I2V和WAN2.1-I2V-14B-480P。 GitHub页面：此HTTPS URL

Title: Visual Jigsaw Post-Training Improves MLLMs

Authors: Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.25190
Pdf URL: https://arxiv.org/pdf/2509.25190
Copy Paste: [[2509.25190]] Visual Jigsaw Post-Training Improves MLLMs(https://arxiv.org/abs/2509.25190)
Keywords: generative
Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: this https URL
摘要：基于强化学习的基于培训的后培训最近成为增强多模式大语模型（MLLM）的一致性和推理能力的强大范式。尽管以视觉为中心的后训练对于增强MLLM对视觉信号的内在理解至关重要，但当前的训练后范式主要以文本为中心，在该信号中，仅利用密集的视觉输入来提取基于文本的推理的稀疏提示。有一些方向上有几种方法，但是，它们通常仍然依靠文本作为中间调解人或引入其他视觉生成设计。在这项工作中，我们介绍了Visual Jigsaw，这是一个通用的自我监管的训练后框架，旨在增强MLLM中的视觉理解。视觉拼图是作为一般订购任务配制的：视觉输入被划分，洗牌，模型必须通过在自然语言中产生正确的置换来重建视觉信息。这自然与从可验证的奖励（RLVR）学习的加固学习相一致，不需要其他视觉生成组件，并且自动衍生其监督信号而没有任何注释。我们在三种视觉方式上实例化视觉拼图，包括图像，视频和3D数据。广泛的实验表明，细粒度的感知，时间推理和3D空间理解的实质性改善。我们的发现凸显了在训练后MLLM中以自我监督为中心的任务的潜力，并旨在激发人们对以视觉为中心的借口设计的进一步研究。项目页面：此HTTPS URL