2025-10-17

Title: CoLoR-GAN: Continual Few-Shot Learning with Low-Rank Adaptation in Generative Adversarial Networks

Authors: Munsif Ali, Leonardo Rossi, Massimo Bertozzi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13869
Pdf URL: https://arxiv.org/pdf/2510.13869
Copy Paste: [[2510.13869]] CoLoR-GAN: Continual Few-Shot Learning with Low-Rank Adaptation in Generative Adversarial Networks(https://arxiv.org/abs/2510.13869)
Keywords: generative
Abstract: Continual learning (CL) in the context of Generative Adversarial Networks (GANs) remains a challenging problem, particularly when it comes to learn from a few-shot (FS) samples without catastrophic forgetting. Current most effective state-of-the-art (SOTA) methods, like LFS-GAN, introduce a non-negligible quantity of new weights at each training iteration, which would become significant when considering the long term. For this reason, this paper introduces \textcolor{red}{\textbf{\underline{c}}}ontinual few-sh\textcolor{red}{\textbf{\underline{o}}}t learning with \textcolor{red}{\textbf{\underline{lo}}}w-\textcolor{red}{\textbf{\underline{r}}}ank adaptation in GANs named CoLoR-GAN, a framework designed to handle both FS and CL together, leveraging low-rank tensors to efficiently adapt the model to target tasks while reducing even more the number of parameters required. Applying a vanilla LoRA implementation already permitted us to obtain pretty good results. In order to optimize even further the size of the adapters, we challenged LoRA limits introducing a LoRA in LoRA (LLoRA) technique for convolutional layers. Finally, aware of the criticality linked to the choice of the hyperparameters of LoRA, we provide an empirical study to easily find the best ones. We demonstrate the effectiveness of CoLoR-GAN through experiments on several benchmark CL and FS tasks and show that our model is efficient, reaching SOTA performance but with a number of resources enormously reduced. Source code is available on \href{this https URL}{Github.
摘要：生成对抗网络（GAN）背景下的持续学习（CL）仍然是一个具有挑战性的问题，特别是当它从少数样本（FS）样本中学习而不会发生灾难性遗忘时。当前最有效的最先进（SOTA）方法，例如 LFS-GAN，在每次训练迭代中引入了不可忽略的新权重数量，从长远来看，这将变得非常重要。为此，本文介绍了 \textcolor{red}{\textbf{\underline{c}}}ontinual Few-sh\textcolor{red}{\textbf{\underline{o}}}t 学习 \textcolor{red}{\textbf{\underline{lo}}}w-\textcolor{red}{\textbf{\underline{r}}}GAN 中的ANK 自适应名为 CoLoR-GAN，这是一个旨在同时处理 FS 和 CL 的框架，利用低秩张量有效地使模型适应目标任务，同时进一步减少所需参数的数量。应用普通的 LoRA 实现已经让我们获得了相当好的结果。为了进一步优化适配器的大小，我们挑战了 LoRA 的限制，为卷积层引入了 LoRA in LoRA (LLoRA) 技术。最后，意识到与 LoRA 超参数选择相关的重要性，我们提供了一项实证研究，以便轻松找到最佳参数。我们通过几个基准 CL 和 FS 任务的实验证明了 CoLoR-GAN 的有效性，并表明我们的模型是高效的，达到了 SOTA 性能，但资源数量大大减少。源代码可在 \href{此 https URL}{Github 上找到。

Title: Joint Discriminative-Generative Modeling via Dual Adversarial Training

Authors: Xuwang Yin, Claire Zhang, Julie Steele, Nir Shavit, Tony T. Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13872
Pdf URL: https://arxiv.org/pdf/2510.13872
Copy Paste: [[2510.13872]] Joint Discriminative-Generative Modeling via Dual Adversarial Training(https://arxiv.org/abs/2510.13872)
Keywords: generation, generative
Abstract: Simultaneously achieving robust classification and high-fidelity generative modeling within a single framework presents a significant challenge. Hybrid approaches, such as Joint Energy-Based Models (JEM), interpret classifiers as EBMs but are often limited by the instability and poor sample quality inherent in SGLD-based training. We address these limitations by proposing a novel training framework that integrates adversarial training (AT) principles for both discriminative robustness and stable generative learning. The proposed method introduces three key innovations: (1) the replacement of SGLD-based JEM learning with a stable, AT-based approach that optimizes the energy function by discriminating between real data and PGD-generated contrastive samples using the BCE loss; (2) synergistic adversarial training for the discriminative component that enhances classification robustness while eliminating the need for explicit gradient penalties; and (3) a two-stage training procedure to resolve the incompatibility between batch normalization and EBM training. Experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our method substantially improves adversarial robustness over existing hybrid models while maintaining competitive generative performance. On ImageNet, when optimized for generative modeling, our model's generative fidelity surpasses that of BigGAN and approaches diffusion models, representing the first MCMC-based EBM approach to achieve high-quality generation on complex, high-resolution datasets. Our approach addresses key stability issues that have limited JEM scaling and demonstrates that adversarial training can serve as an effective foundation for unified frameworks capable of generating and robustly classifying visual data.
摘要：在单一框架内同时实现稳健的分类和高保真生成建模提出了重大挑战。混合方法，例如基于联合能量的模型 (JEM)，将分类器解释为 EBM，但通常受到基于 SGLD 的训练固有的不稳定性和较差的样本质量的限制。我们通过提出一种新颖的训练框架来解决这些局限性，该框架集成了对抗性训练（AT）原则，以实现判别鲁棒性和稳定的生成学习。所提出的方法引入了三个关键创新：（1）用基于 AT 的稳定方法取代基于 SGLD 的 JEM 学习，该方法通过使用 BCE 损失区分真实数据和 PGD 生成的对比样本来优化能量函数；（2）对判别部分进行协同对抗训练，增强分类鲁棒性，同时消除显式梯度惩罚的需要； (3) 两阶段训练过程，以解决批量归一化和 EBM 训练之间的不兼容性。 CIFAR-10、CIFAR-100 和 ImageNet 上的实验表明，我们的方法比现有混合模型显着提高了对抗鲁棒性，同时保持了有竞争力的生成性能。在 ImageNet 上，当针对生成建模进行优化时，我们的模型的生成保真度超越了 BigGAN 并接近扩散模型，代表了第一个基于 MCMC 的 EBM 方法，可在复杂、高分辨率的数据集上实现高质量生成。我们的方法解决了限制 JEM 扩展的关键稳定性问题，并证明对抗性训练可以作为能够生成和稳健分类视觉数据的统一框架的有效基础。

Title: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Authors: Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13999
Pdf URL: https://arxiv.org/pdf/2510.13999
Copy Paste: [[2510.13999]] REAP the Experts: Why Pruning Prevails for One-Shot MoE compression(https://arxiv.org/abs/2510.13999)
Keywords: generation, generative
Abstract: Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a "functional subspace collapse", due to the loss of the router's independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
摘要：稀疏激活的专家混合 (SMoE) 模型提供高效的预训练和低延迟，但其大量参数会产生大量内存开销，从而激发了对专家压缩的研究。与最近支持专家在歧视性基准上合并的研究结果相反，我们证明专家剪枝是生成任务的优越策略。我们证明，由于路由器失去了对专家的独立的、依赖于输入的控制，合并会导致“功能子空间崩溃”，从而引入不可约的错误。利用这一见解，我们提出了路由器加权专家激活修剪（REAP），这是一种新颖的修剪标准，考虑了路由器门值和专家激活规范。在从 20B 到 1T 参数的各种 SMoE 模型中，REAP 在生成基准上始终优于合并和其他修剪方法，尤其是在 50% 压缩时。值得注意的是，我们的方法使用 Qwen3-Coder-480B 和 Kimi-K2 实现了代码生成和工具调用任务的近乎无损压缩，即使在修剪了 50% 的专家之后也是如此。

Title: NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Authors: Junjie Nan, Jianing Li, Wei Chen, Mingkun Zhang, Xueqi Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14025
Pdf URL: https://arxiv.org/pdf/2510.14025
Copy Paste: [[2510.14025]] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations(https://arxiv.org/abs/2510.14025)
Keywords: generation
Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
摘要：对抗性净化在对抗对抗性图像扰动方面取得了巨大成功，这些扰动通常被认为是可加的。然而，模糊、遮挡和扭曲等非加性对抗性扰动在现实世界中也很常见。在这种扰动下，现有的对抗性纯化方法的效果要差得多，因为它们是为适应加性性质而设计的。在本文中，我们提出了一种名为 NAPPure 的扩展对抗性净化框架，它可以进一步处理非加性扰动。具体来说，我们首先建立对抗性图像的生成过程，然后通过似然最大化来解开底层的干净图像和扰动参数。 GTSRB 和 CIFAR-10 数据集上的实验表明，NAPPure 显着提高了图像分类模型针对非加性扰动的鲁棒性。

Title: Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Authors: Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14032
Pdf URL: https://arxiv.org/pdf/2510.14032
Copy Paste: [[2510.14032]] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding(https://arxiv.org/abs/2510.14032)
Keywords: generation
Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0\%\sim 5.4\%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6\%$. Our code is publicly available at this https URL.
摘要：由于难以处理超出上下文窗口的密集视频标记并保留长期顺序信息，因此对长视频的理解和推理对大型视频语言模型（LVLM）提出了重大挑战。检索增强生成 (RAG) 已证明在处理大型语言模型 (LLM) 的长上下文方面的有效性；然而，将 RAG 应用于长视频面临着挑战，例如时间依赖性被破坏以及包含可能妨碍准确推理的不相关信息。为了解决这些限制，我们提出了 Vgent，一种新颖的基于图的检索推理增强生成框架，用于增强 LVLM 以实现长视频理解。我们的方法引入了两个关键创新：（i）它通过结构化图来表示视频，并保留视频剪辑之间的语义关系以提高检索效率。 (ii) 它引入了一个中间推理步骤来减轻 LVLM 的推理限制，该步骤利用结构化验证来减少检索噪声并促进跨剪辑的相关信息的显式聚合，从而产生更准确和上下文感知的响应。我们在三个长视频理解基准上使用各种开源 LVLM 全面评估我们的框架。与 MLVU 上的基本模型相比，我们的方法的整体性能提高了 $3.0\%\sim 5.4\%$，并且比最先进的视频 RAG 方法提高了 $8.6\%$。我们的代码可通过此 https URL 公开获取。

Title: CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Authors: Guangyi Chen, Yunlong Deng, Peiyuan Zhu, Yan Li, Yifan Sheng, Zijian Li, Kun Zhang
Subjects: cs.LG, cs.MS
Abstract URL: https://arxiv.org/abs/2510.14049
Pdf URL: https://arxiv.org/pdf/2510.14049
Copy Paste: [[2510.14049]] CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations(https://arxiv.org/abs/2510.14049)
Keywords: generation
Abstract: Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:this https URL, Dataset:this https URL.
摘要：因果表示学习（CRL）旨在揭示数据生成过程并识别潜在的因果变量和关系，由于需要已知的真实因果变量和因果结构，其评估本质上仍然具有挑战性。现有的评估通常依赖于简单的合成数据集或现实世界任务的下游性能，通常在现实性和评估精度之间陷入困境。在本文中，我们使用高保真模拟视觉数据引入了 CRL 的新基准，该数据既保留了真实的视觉复杂性，更重要的是，保留了对真实因果生成过程的访问。该数据集包含约 20 万张图像和 300 万个视频帧，涉及四个领域的 24 个子场景：静态图像生成、动态物理模拟、机器人操作和交通状况分析。这些场景范围从静态到动态设置，从简单到复杂的结构，以及单代理到多代理交互，提供了一个全面的测试平台，有望弥合严格评估和实际适用性之间的差距。此外，我们提供对底层因果结构的灵活访问，允许用户修改或配置它们以符合 CRL 中所需的假设，例如可用的域标签、时间依赖性或干预历史。利用这一基准，我们跨不同范式评估了代表性的 CRL 方法，并提供了实证见解，以帮助从业者和新手选择或扩展适当的 CRL 框架，以正确解决可以从 CRL 角度受益的特定类型的实际问题。欢迎访问我们的：项目页面：此 https URL，数据集：此 https URL。

Title: Synchronization of Multiple Videos

Authors: Avihai Naaman, Ron Shapira Weber, Oren Freifeld
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14051
Pdf URL: https://arxiv.org/pdf/2510.14051
Copy Paste: [[2510.14051]] Synchronization of Multiple Videos(https://arxiv.org/abs/2510.14051)
Keywords: generative
Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at this https URL
摘要：同步从同一场景中的多个摄像机同时捕获的视频通常很容易，并且通常只需要简单的时间平移。然而，由于不同的主题、背景和非线性时间错位，同步来自不同场景的视频或最近的生成人工智能视频带来了更加复杂的挑战。我们提出了时间原型学习（TPL），这是一种基于原型的框架，它从由任何各种预训练模型提取的高维嵌入构建共享的、紧凑的一维表示。 TPL 通过学习锚定关键动作阶段的统一原型序列来稳健地对齐视频，从而避免详尽的成对匹配。我们的实验表明，TPL 提高了跨不同数据集的同步准确性、效率和鲁棒性，包括细粒度帧检索和相位分类任务。重要的是，TPL 是第一种缓解描述相同动作的多个生成 AI 视频中的同步问题的方法。我们的代码和新的多视频同步数据集可从此 https URL 获取

Title: Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Authors: Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.14081
Pdf URL: https://arxiv.org/pdf/2510.14081
Copy Paste: [[2510.14081]] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images(https://arxiv.org/abs/2510.14081)
Keywords: generative
Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.
摘要：我们提出了一种新颖的零镜头管道，用于从一些非结构化手机图像创建超现实、保留身份的 3D 头像。现有的方法面临着几个挑战：单视图方法存在几何不一致和幻觉，降低了身份保存的性能，而基于合成数据训练的模型无法捕获皮肤皱纹和细发等高频细节，限制了真实性。我们的方法引入了两个关键贡献：(1) 生成规范化模块，将多个非结构化视图处理为标准化、一致的表示形式；(2) 基于变压器的模型，该模型在新的大规模数据集上进行训练，该数据集是从真人的圆顶捕获中获得的高保真高斯喷溅化身。这种“捕获、规范化、Splat”流程可以从非结构化照片中生成具有引人注目的真实感和强大身份保留的静态四分之一身头像。

Title: Briding Diffusion Posterior Sampling and Monte Carlo methods: a survey

Authors: Yazid Janati, Alain Durmus, Jimmy Olsson, Eric Moulines
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.14114
Pdf URL: https://arxiv.org/pdf/2510.14114
Copy Paste: [[2510.14114]] Briding Diffusion Posterior Sampling and Monte Carlo methods: a survey(https://arxiv.org/abs/2510.14114)
Keywords: generative
Abstract: Diffusion models enable the synthesis of highly accurate samples from complex distributions and have become foundational in generative modeling. Recently, they have demonstrated significant potential for solving Bayesian inverse problems by serving as priors. This review offers a comprehensive overview of current methods that leverage \emph{pre-trained} diffusion models alongside Monte Carlo methods to address Bayesian inverse problems without requiring additional training. We show that these methods primarily employ a \emph{twisting} mechanism for the intermediate distributions within the diffusion process, guiding the simulations toward the posterior distribution. We describe how various Monte Carlo methods are then used to aid in sampling from these twisted distributions.
摘要：扩散模型能够从复杂的分布中合成高度准确的样本，并已成为生成建模的基础。最近，他们展示了通过充当先验来解决贝叶斯逆问题的巨大潜力。这篇综述全面概述了当前的方法，这些方法利用\emph{预训练}扩散模型和蒙特卡罗方法来解决贝叶斯逆问题，而无需额外的训练。我们证明这些方法主要对扩散过程中的中间分布采用 \emph{扭曲} 机制，引导模拟走向后验分布。我们描述了如何使用各种蒙特卡罗方法来帮助从这些扭曲分布中进行采样。

Title: Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Authors: Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Taşel, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, Rahul Garg, Paul Debevec, Ning Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14179
Pdf URL: https://arxiv.org/pdf/2510.14179
Copy Paste: [[2510.14179]] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures(https://arxiv.org/abs/2510.14179)
Keywords: generation
Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: this https URL.
摘要：我们引入了一个框架，通过新颖的定制数据管道，在视频扩散模型中实现多视图角色一致性和 3D 摄像机控制。我们使用记录的体积捕捉性能来训练角色一致性组件，这些性能通过 4D 高斯泼溅 (4DGS) 使用不同的相机轨迹重新渲染，通过视频重新照明模型获得照明变化。我们根据这些数据微调最先进的开源视频扩散模型，以提供强大的多视图身份保存、精确的摄像机控制和照明适应性。我们的框架还支持虚拟生产的核心功能，包括使用两种方法的多主题生成：联合训练和噪声混合，后者能够在推理时有效组合独立定制的模型；它还实现了场景和现实生活视频的定制以及定制过程中对运动和空间布局的控制。大量的实验表明，视频质量得到了改善，个性化准确性更高，摄像机控制和灯光适应性也得到了增强，从而推动了视频生成与虚拟制作的集成。我们的项目页面位于：此 https URL。

Title: Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation

Authors: Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.14190
Pdf URL: https://arxiv.org/pdf/2510.14190
Copy Paste: [[2510.14190]] Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation(https://arxiv.org/abs/2510.14190)
Keywords: generation
Abstract: Diffusion models excel at generation, but their latent spaces are not explicitly organized for interpretable control. We introduce ConDA (Contrastive Diffusion Alignment), a framework that applies contrastive learning within diffusion embeddings to align latent geometry with system dynamics. Motivated by recent advances showing that contrastive objectives can recover more disentangled and structured representations, ConDA organizes diffusion latents such that traversal directions reflect underlying dynamical factors. Within this contrastively structured space, ConDA enables nonlinear trajectory traversal that supports faithful interpolation, extrapolation, and controllable generation. Across benchmarks in fluid dynamics, neural calcium imaging, therapeutic neurostimulation, and facial expression, ConDA produces interpretable latent representations with improved controllability compared to linear traversals and conditioning-based baselines. These results suggest that diffusion latents encode dynamics-relevant structure, but exploiting this structure requires latent organization and traversal along the latent manifold.
摘要：扩散模型擅长生成，但它们的潜在空间没有明确组织以进行可解释的控制。我们引入了 ConDA（对比扩散对齐），这是一个在扩散嵌入中应用对比学习来将潜在几何与系统动力学对齐的框架。最近的进展表明对比目标可以恢复更加解缠结和结构化的表示，ConDA 组织了扩散潜伏，使得遍历方向反映了潜在的动态因素。在这个对比结构的空间中，ConDA 能够实现非线性轨迹遍历，支持忠实插值、外推和可控生成。在流体动力学、神经钙成像、治疗性神经刺激和面部表情等基准方面，ConDA 产生可解释的潜在表示，与线性遍历和基于条件的基线相比，其可控性得到了改善。这些结果表明，扩散潜伏编码了动力学相关的结构，但利用这种结构需要潜在的组织和沿着潜在流形的遍历。

Title: LOTA: Bit-Planes Guided AI-Generated Image Detection

Authors: Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, Jie Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14230
Pdf URL: https://arxiv.org/pdf/2510.14230
Copy Paste: [[2510.14230]] LOTA: Bit-Planes Guided AI-Generated Image Detection(https://arxiv.org/abs/2510.14230)
Keywords: generation
Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2\% from GAN to Diffusion and over 99.2\% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at this https URL.
摘要：GAN 和 Diffusion 模型的快速发展使得区分 AI 生成的图像和真实图像变得更加困难。最近的研究经常使用基于图像的重建误差作为确定图像是否是人工智能生成的重要特征。然而，这些方法通常会产生较高的计算成本，并且也无法捕获原始图像中存在的固有噪声特征。为了解决这些问题，我们通过使用基于位平面的图像处理创新地改进了错误提取，因为较低的位平面确实代表了图像中的噪声模式。我们引入了有效的位平面引导噪声图像生成，并利用各种图像归一化策略，包括缩放和阈值化。然后，为了放大噪声信号以便更容易地进行人工智能生成的图像检测，我们设计了一个最大梯度补丁选择，它应用多方向梯度来计算噪声分数并选择分数最高的区域。最后，我们提出了一种轻量级且有效的分类头，并探索了两种不同的结构：基于噪声的分类器和噪声引导的分类器。 GenImage 基准上的大量实验证明了我们的方法的出色性能，达到了 \textbf{98.9\%} (\textbf{11.9}\%~$\uparrow$) 的平均精度，并显示出出色的跨生成器泛化能力。特别是，我们的方法从 GAN 到 Diffusion 的准确率超过 98.2%，从 Diffusion 到 GAN 的准确率超过 99.2%。此外，它以毫秒级执行错误提取，比现有方法快近一百倍。代码位于此 https URL。

Title: Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Authors: Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.14232
Pdf URL: https://arxiv.org/pdf/2510.14232
Copy Paste: [[2510.14232]] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models(https://arxiv.org/abs/2510.14232)
Keywords: generation
Abstract: Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.
摘要：竞争性编程已成为评估大型语言模型（LLM）推理和解决问题能力的严格基准。国际信息学奥林匹克竞赛 (IOI) 是竞争性编程领域最负盛名的年度竞赛之一，已成为比较人类和人工智能编程能力的重要基准。虽然一些专有模型据称在 IOI 上实现了金牌级别的性能（通常采用未公开的方法），但使用开放重量模型实现类似的结果仍然是一个重大挑战。在本文中，我们提出了 \gencluster，一个可扩展且可重复的测试时计算框架，它使用开放权重模型获得了 IOI 黄金级性能。它结合了大规模生成、行为聚类、排名和循环提交策略，可在有限的验证预算下有效地探索多样化的解决方案空间。我们的实验表明，我们提出的方法的性能与可用计算一致，缩小了开放系统和封闭系统之间的差距。值得注意的是，我们将证明 GenCluster 可以通过开放权重模型 gpt-oss-120b 首次在 IOI 2025 上获得金牌，为法学硕士推理的透明和可重复评估树立新基准。

Title: PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Authors: Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14241
Pdf URL: https://arxiv.org/pdf/2510.14241
Copy Paste: [[2510.14241]] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis(https://arxiv.org/abs/2510.14241)
Keywords: generative
Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at this https URL
摘要：受操纵媒体的兴起使深度伪造成为一种特别阴险的威胁，涉及各种生成操纵，例如口型同步修改、面部交换和头像驱动的面部合成。传统的检测方法主要依赖于手动设计的音素-视位对齐阈值、基本的帧级一致性检查或单峰检测策略，无法充分识别由 GAN、扩散模型和神经渲染技术等先进生成模型生成的现代深度赝品。这些先进技术生成近乎完美的单个帧，但无意中产生了传统检测器经常忽视的微小时间差异。我们提出了一种新颖的多模态视听框架，音素时间和身份动态分析（PIA），结合了语言、动态面部运动和面部识别线索来解决这些限制。我们利用音素序列、嘴唇几何数据和高级面部身份嵌入。这种集成方法通过识别多种互补模式之间的不一致，显着提高了对细微深度伪造更改的检测。代码可在此 https URL 获取

Title: Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Authors: Liao Shen, Wentao Jiang, Yiran Zhu, Tiezheng Ge, Zhiguo Cao, Bo Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14255
Pdf URL: https://arxiv.org/pdf/2510.14255
Copy Paste: [[2510.14255]] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization(https://arxiv.org/abs/2510.14255)
Keywords: generation
Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{this https URL}{this https URL}.
摘要：图像到视频 (I2V) 生成领域的最新进展在从静态图像合成高质量、时间相干的视频方面取得了显着进展。在I2V的所有应用中，以人为中心的视频生成占据了很大一部分。然而，现有的 I2V 模型在保持输入人体图像和生成视频之间的身份一致性方面遇到困难，特别是当视频中的人表现出显着的表情变化和动作时。当人脸仅占图像的一小部分时，这个问题就变得至关重要。由于人类对身份变化高度敏感，这对 I2V 生成提出了一个关键但尚未充分探索的挑战。在本文中，我们提出了身份保护奖励引导优化（IPRO），这是一种基于强化学习的新型视频传播框架，用于增强身份保护。我们的方法没有引入辅助模块或改变模型架构，而是引入了一种直接有效的调整算法，该算法使用面部身份评分器来优化扩散模型。为了提高性能并加速收敛，我们的方法通过采样链的最后一步反向传播奖励信号，从而实现更丰富的梯度反馈。我们还提出了一种新颖的面部评分机制，将真实视频中的面部视为面部特征池，提供多角度面部信息以增强泛化。进一步结合 KL 散度正则化来稳定训练并防止对奖励信号的过度拟合。对 Wan 2.2 I2V 模型和我们内部 I2V 模型的大量实验证明了我们方法的有效性。我们的项目和代码可在 \href{此 https URL}{此 https URL} 中找到。

Title: Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Authors: Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14256
Pdf URL: https://arxiv.org/pdf/2510.14256
Copy Paste: [[2510.14256]] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning(https://arxiv.org/abs/2510.14256)
Keywords: generation
Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
摘要：虽然 VACE 和 Phantom 等先进方法具有针对不同场景中特定主题的高级视频生成功能，但它们在动态交互中难以实现多人身份保存，其中多个角色之间的一致身份至关重要。为了解决这个问题，我们提出了 Identity-GRPO，这是一种人类反馈驱动的优化管道，用于改进多人身份保护视频生成。首先，我们构建了一个在包含人工注释和合成失真数据的大规模偏好数据集上训练的视频奖励模型，成对注释的重点是在整个视频中保持人类的一致性。然后，我们采用专为多人一致性而定制的 GRPO 变体，这极大地增强了 VACE 和 Phantom。通过广泛的消融研究，我们评估注释质量和设计选择对策略优化的影响。实验表明，与基线方法相比，Identity-GRPO 在人类一致性指标方面实现了高达 18.9% 的改进，为将强化学习与个性化视频生成相结合提供了可行的见解。

Title: Nonparametric Data Attribution for Diffusion Models

Authors: Yutian Zhao, Chao Du, Xiaosen Zheng, Tianyu Pang, Min Lin
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.14269
Pdf URL: https://arxiv.org/pdf/2510.14269
Copy Paste: [[2510.14269]] Nonparametric Data Attribution for Diffusion Models(https://arxiv.org/abs/2510.14269)
Keywords: generative
Abstract: Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at this https URL.
摘要：生成模型的数据归因旨在量化单个训练示例对模型输出的影响。现有的扩散模型方法通常需要访问模型梯度或重新训练，限制了它们在专有或大规模设置中的适用性。我们提出了一种完全基于数据进行操作的非参数归因方法，通过生成图像和训练图像之间的补丁级相似性来测量影响。我们的方法基于最佳得分函数的分析形式，并自然地扩展到多尺度表示，同时通过基于卷积的加速保持计算效率。除了产生空间可解释的归因之外，我们的框架还揭示了反映训练数据和输出之间内在关系的模式，独立于任何特定模型。实验表明，我们的方法实现了强大的归因性能，与基于梯度的方法紧密匹配，并且大大优于现有的非参数基线。代码可从此 https URL 获取。

Title: A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Authors: Shivangi Yadav, Arun Ross
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14314
Pdf URL: https://arxiv.org/pdf/2510.14314
Copy Paste: [[2510.14314]] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection(https://arxiv.org/abs/2510.14314)
Keywords: generative
Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.
摘要：虹膜生物识别系统可能会受到演示攻击 (PA) 的影响，其中人造眼睛、打印的眼睛图像或化妆隐形眼镜等伪影会呈现给系统。为了解决这个问题，已经开发了几种演示攻击检测（PAD）方法。然而，由于 PA 的构建和成像存在隐含的困难，用于训练和评估虹膜 PAD 技术的数据集很稀缺。为了解决这个问题，我们引入了多域图像平移扩散 StyleGAN (MID-StyleGAN)，这是一种用于生成合成眼部图像的新框架，可捕获真实、印刷眼睛和美容隐形眼镜等多个领域的 PA 和真实特征。 MID-StyleGAN 结合了扩散模型和生成对抗网络 (GAN) 的优势，可生成真实且多样化的合成数据。我们的方法利用多域架构，可以在真实的眼部图像和不同的 PA 域之间进行转换。该模型采用针对视觉数据定制的自适应损失函数来保持域一致性。大量实验表明，MID-StyleGAN 在生成高质量合成眼部图像方面优于现有方法。生成的数据用于显着增强 PAD 系统的性能，为虹膜和眼部生物识别中的数据稀缺问题提供可扩展的解决方案。例如，在 LivDet2020 数据集上，1% 错误检测率下的真实检测率从 93.41% 提高到 98.72%，展示了该方法的影响。

Title: Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Authors: Jaewan Park, Solbee Cho, Jay-Yoon Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14337
Pdf URL: https://arxiv.org/pdf/2510.14337
Copy Paste: [[2510.14337]] Stop-RAG: Value-Based Retrieval Control for Iterative RAG(https://arxiv.org/abs/2510.14337)
Keywords: generation
Abstract: Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q($\lambda$) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.
摘要：迭代检索增强生成 (RAG) 使大型语言模型能够回答复杂的多跳问题，但每个额外的循环都会增加延迟、成本以及引入分散注意力的证据的风险，从而激发了对有效停止策略的需求。现有方法要么使用预定次数的迭代，要么依赖置信代理，而这些置信代理不能很好地反映更多检索是否真正有帮助。我们将迭代 RAG 视为有限范围马尔可夫决策过程，并引入 Stop-RAG，这是一种基于值的控制器，可自适应地决定何时停止检索。 Stop-RAG 通过完整轨迹的全宽前视 Q($\lambda$) 目标进行训练，可以学习有效的停止策略，同时保持与黑盒 API 和现有管道的兼容性。在多跳问答基准测试中，Stop-RAG 始终优于固定迭代基线和 LLM 基于提示的停止。这些结果强调了自适应停止是当前代理系统中关键缺失的组件，并证明基于值的控制可以提高 RAG 系统的准确性。

Title: DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Authors: Dongnam Byun, Jungwon Park, Jumgmin Ko, Changin Choi, Wonjong Rhee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14376
Pdf URL: https://arxiv.org/pdf/2510.14376
Copy Paste: [[2510.14376]] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation(https://arxiv.org/abs/2510.14376)
Keywords: generation, generative
Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.
摘要：文本到图像 (T2I) 生成模型的最新进展在生成与文本提示对齐的高质量图像方面取得了显着改进。然而，这些模型仍然难以应对涉及多个对象的提示，通常会导致对象忽略或对象混合。通过广泛的研究，我们确定了四种有问题的场景：相似的形状、相似的纹理、不同的背景偏差和许多对象，其中对象间关系经常导致此类失败。受有关 CLIP 嵌入的两个关键观察的启发，我们提出了 DOS（定向对象分离），这是一种在将三种类型的 CLIP 文本嵌入传递到文本到图像模型之前对其进行修改的方法。实验结果表明，DOS持续提高了多目标图像生成的成功率并减少了目标混合。在人类评估中，DOS 显着优于四种竞争方法，在四种基准测试中获得了 26.24%-43.04% 的更多选票。这些结果凸显了 DOS 作为改进多目标图像生成的实用且有效的解决方案。

Title: Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits

Authors: Guillaume Rongier, Luk Peeters
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2510.14445
Pdf URL: https://arxiv.org/pdf/2510.14445
Copy Paste: [[2510.14445]] Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits(https://arxiv.org/abs/2510.14445)
Keywords: generative
Abstract: The distribution of resources in the subsurface is deeply linked to the variations of its physical properties. Generative modeling has long been used to predict those physical properties while quantifying the associated uncertainty. But current approaches struggle to properly reproduce geological structures, and fluvial deposits in particular, because of their continuity. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - can be trained to reproduce fluvial deposits simulated by a process-based model - a more expensive model that mimics geological processes. An ablation study shows that developments from the deep-learning community to generate large 2D images are directly transferable to 3D images of fluvial deposits. Training remains stable, and the generated samples reproduce the non-stationarity and details of the deposits without mode collapse or pure memorization of the training data. Using a process-based model to generate those training data allows us to include valuable properties other than the usual physical properties. We show how the deposition time let us monitor and validate the performance of a GAN by checking that its samples honor the law of superposition. Our work joins a series of previous studies suggesting that GANs are more robust that given credit for, at least for training datasets targeting specific geological structures. Whether this robustness transfers to larger 3D images and multimodal datasets remains to be seen. Exploring how deep generative models can leverage geological principles like the law of superposition shows a lot of promise.
摘要：地下资源的分布与其物理性质的变化密切相关。生成模型长期以来一直被用来预测这些物理特性，同时量化相关的不确定性。但目前的方法很难正确地再现地质结构，特别是河流沉积物，因为它们的连续性。这项研究探讨了生成对抗网络（GAN）（一种用于生成建模的深度学习算法）是否可以被训练来重现由基于过程的模型（一种模仿地质过程的更昂贵的模型）模拟的河流沉积物。消融研究表明，深度学习社区在生成大型 2D 图像方面的进展可以直接应用于河流沉积物的 3D 图像。训练保持稳定，生成的样本再现了存款的非平稳性和细节，而没有模式崩溃或训练数据的纯粹记忆。使用基于过程的模型来生成这些训练数据使我们能够包含除通常的物理属性之外的有价值的属性。我们展示了沉积时间如何让我们通过检查样本是否遵守叠加定律来监控和验证 GAN 的性能。我们的工作与之前的一系列研究相结合，表明 GAN 比已知的更强大，至少对于针对特定地质结构的训练数据集来说是这样。这种鲁棒性是否能转移到更大的 3D 图像和多模态数据集还有待观察。探索深层生成模型如何利用叠加定律等地质原理显示出很大的希望。

Title: Coder as Editor: Code-driven Interpretable Molecular Optimization

Authors: Wenyu Zhu, Chengzhu Li, Xiaohe Tian, Yifan Wang, Yinjun Jia, Jianhui Wang, Bowen Gao, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2510.14455
Pdf URL: https://arxiv.org/pdf/2510.14455
Copy Paste: [[2510.14455]] Coder as Editor: Code-driven Interpretable Molecular Optimization(https://arxiv.org/abs/2510.14455)
Keywords: generation
Abstract: Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.
摘要：分子优化是药物发现的核心任务，需要精确的结构推理和领域知识。虽然大型语言模型 (LLM) 在以自然语言生成高级编辑意图方面表现出了希望，但它们通常很难忠实地执行这些修改，尤其是在处理 SMILES 等非直观表示时。我们介绍 MECo，这是一个通过将编辑操作转换为可执行代码来连接推理和执行的框架。 MECo 将法学硕士的分子优化重新表述为级联框架：从分子和属性目标生成人类可解释的编辑意图，然后通过代码生成将这些意图转化为可执行的结构编辑。我们的方法在重现源自化学反应和特定目标化合物对的真实编辑方面达到了 98% 以上的准确度。在跨越物理化学性质和目标活性的下游优化基准上，MECo 将一致性大幅提高了 38-86 个百分点，达到 90% 以上，并在保持结构相似性的同时实现了比基于 SMILES 的基准更高的成功率。通过将意图与执行相结合，MECo 实现了一致、可控和可解释的分子设计，为药物发现中的高保真反馈循环和协作式人类人工智能工作流程奠定了基础。

Title: Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Authors: Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14462
Pdf URL: https://arxiv.org/pdf/2510.14462
Copy Paste: [[2510.14462]] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review(https://arxiv.org/abs/2510.14462)
Keywords: generative
Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.
摘要：无监督深度生成模型正在成为检测和分割大脑成像异常的监督方法的有前途的替代方案。与完全监督的方法不同，完全监督的方法需要大量体素级注释数据集并且仅限于明确表征的病理学，这些模型可以专门针对健康数据进行训练，并将异常识别为与学习的规范大脑结构的偏差。这篇 PRISMA 指导的范围界定综述综合了神经影像异常检测的无监督深度生成模型的最新工作，包括自动编码器、变分自动编码器、生成对抗网络和去噪扩散模型。 2018 年至 2025 年间共发表了 49 项研究，涵盖脑 MRI 以及较少见的 CT 在肿瘤、中风、多发性硬化症和小血管疾病等多种病理学中的应用。报告的性能指标与架构设计选择进行比较。在纳入的研究中，生成模型在大灶性病变方面取得了令人鼓舞的表现，并在解决更细微的异常方面取得了进展。生成模型的一个关键优势是它们能够产生可解释的伪健康（也称为反事实）重建，当注释数据稀缺时（例如罕见或异质疾病），这尤其有价值。展望未来，这些模型为异常检测提供了令人信服的方向，支持半监督学习，支持新型成像生物标志物的发现，并促进统一的端到端框架中的疾病内和跨疾病偏差映射。为了实现临床影响，未来的工作应优先考虑解剖感知建模、基础模型的开发、适合任务的评估指标和严格的临床验证。

Title: Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Authors: Thomas Katraouras, Dimitrios Rafailidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14463
Pdf URL: https://arxiv.org/pdf/2510.14463
Copy Paste: [[2510.14463]] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration(https://arxiv.org/abs/2510.14463)
Keywords: restoration
Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model's optimization, effectively uncovering "winning tickets" that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at this https URL.
摘要：图像质量是在网络平台上提供具有视觉吸引力的内容的关键因素。然而，由于在线社交网络（OSN）应用的有损操作，图像经常遭受降级，从而对用户体验产生负面影响。图像恢复是从给定的降级输入中恢复干净的高质量图像的过程。最近，多任务（一体化）图像恢复模型由于能够同时处理不同类型的图像退化而受到广泛关注。然而，这些模型通常具有过多的可训练参数，导致计算效率低下。在本文中，我们提出了一种压缩多任务图像恢复模型的策略。我们的目标是在过度参数化的深度模型中发现高度稀疏的子网络，这些子网络可以匹配甚至超越其密集对应模型的性能。所提出的模型，即 MIR-L，采用迭代修剪策略，在多轮中删除低量级权重，同时将剩余权重重置为原始初始化。这个迭代过程对于多任务图像恢复模型的优化非常重要，可以有效地发现在高稀疏水平下保持或超过最先进性能的“中奖彩票”。对去雨、去雾和去噪任务的基准数据集的实验评估表明，MIR-L 在保持较高的图像恢复性能的同时仅保留了 10% 的可训练参数。我们的代码、数据集和预训练模型已在此 https URL 上公开提供。

Title: Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Authors: Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14526
Pdf URL: https://arxiv.org/pdf/2510.14526
Copy Paste: [[2510.14526]] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models(https://arxiv.org/abs/2510.14526)
Keywords: generation
Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.
摘要：在文本到图像的生成中，不同的初始噪声通过预训练的稳定扩散（SD）模型产生不同的去噪路径。虽然此模式可以输出不同的图像，但其中一些图像可能无法与提示很好地对齐。现有方法通过改变去噪动态或通过提取多个噪声并进行后选择来缓解这个问题。在本文中，我们将这种不对齐归因于训练-推理不匹配：在训练期间，提示条件噪声位于潜在空间的提示特定子集中，而在推理时，噪声来自与提示无关的高斯先验。为了弥补这一差距，我们提出了一种噪声投影仪，它在去噪之前对初始噪声应用文本条件细化。以提示嵌入为条件，它将噪声映射到提示感知对应项，该对应项更好地匹配 SD 训练期间观察到的分布，而无需修改 SD 模型。我们的框架由以下步骤组成：我们首先对一些噪声进行采样，并从视觉语言模型（VLM）中获取其相应图像的令牌级反馈，然后将这些信号提炼成奖励模型，最后通过准直接偏好优化来优化噪声投影仪。我们的设计有两个好处：（i）它不需要参考图像或手工制作的先验，（ii）它产生的推理成本很小，用单个前向传递取代了多样本选择。大量的实验进一步表明，我们的提示感知噪声投影可以改善不同提示下的文本图像对齐。

Title: Exploring Image Representation with Decoupled Classical Visual Descriptors

Authors: Chenyuan Qu, Hao Chen, Jianbo Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14536
Pdf URL: https://arxiv.org/pdf/2510.14536
Copy Paste: [[2510.14536]] Exploring Image Representation with Decoupled Classical Visual Descriptors(https://arxiv.org/abs/2510.14536)
Keywords: generation
Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: this https URL.
摘要：探索和理解有效的图像表示是计算机视觉领域长期存在的挑战。虽然深度学习在图像理解任务上取得了显着的进步，但其内部表示通常是不透明的，使得很难解释视觉信息的处理方式。相比之下，经典的视觉描述符（例如边缘、颜色和强度分布）长期以来一直是图像分析的基础，并且对于人类来说仍然可以直观地理解。受这一差距的推动，我们提出一个核心问题：现代学习能否从这些经典线索中受益？在本文中，我们用 VisualSplit 来回答这个问题，该框架将图像显式分解为解耦的经典描述符，将每个描述符视为视觉知识的独立但互补的组成部分。通过重建驱动的预训练方案，VisualSplit 学会捕捉每个视觉描述符的本质，同时保留其可解释性。通过显式分解视觉属性，我们的方法本质上促进了各种高级视觉任务中的有效属性控制，包括图像生成和编辑，超越了传统的分类和分割，表明了这种新的视觉理解学习方法的有效性。项目页面：此 https URL。

Title: Consistent text-to-image generation via scene de-contextualization

Authors: Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14553
Pdf URL: https://arxiv.org/pdf/2510.14553
Copy Paste: [[2510.14553]] Consistent text-to-image generation via scene de-contextualization(https://arxiv.org/abs/2510.14553)
Keywords: generation
Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I's built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt's embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.
摘要：一致的文本到图像（T2I）生成旨在在不同场景中生成同一主题的保留身份的图像，但由于一种称为身份（ID）转移的现象，它经常失败。以前的方法已经解决了这个问题，但通常依赖于提前了解所有目标场景的不切实际的假设。本文揭示了 ID 偏移的一个关键来源是主题和场景上下文之间的原生相关性，称为场景上下文化，当 T2I 模型适合大量自然图像的训练分布时，它会自然出现。我们正式证明了这种场景 ID 相关性的近乎普遍性，并得出了其强度的理论界限。在此基础上，我们提出了一种新颖、高效、免训练的提示嵌入编辑方法，称为场景去上下文化（SDeC），该方法强加了 T2I 内置场景上下文化的反转过程。具体来说，它通过量化 SVD 方向稳定性来自适应地重新加权相应的特征值，从而识别并抑制 ID 提示嵌入中的潜在场景 ID 相关性。至关重要的是，SDeC 允许按场景使用（每个提示一个场景），无需事先访问所有目标场景。这使得它成为一种高度灵活且通用的解决方案，非常适合现实世界的应用，在这些应用中，此类先验知识通常不可用或随时间而变化。实验表明，SDeC 在保持场景多样性的同时显着增强了身份保留。

Title: STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Authors: Zhifei Chen, Tianshuo Xu, Leyi Wu, Luozhou Wang, Dongyu Yan, Zihan You, Wenting Luo, Guo Zhang, Yingcong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14588
Pdf URL: https://arxiv.org/pdf/2510.14588
Copy Paste: [[2510.14588]] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding(https://arxiv.org/abs/2510.14588)
Keywords: generation
Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB $+$ auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
摘要：视频生成最近在视觉方面取得了惊人的进步，但保持连贯的对象运动和交互仍然很困难。我们追踪了两个实际瓶颈：（i）人类提供的运动提示（例如，小的 2D 地图）在编码后通常会崩溃为太少的有效标记，从而削弱了指导； (ii) 对单个头部的外观和运动进行优化可以有利于纹理而不是时间一致性。我们提出了 STANCE，一个图像到视频的框架，它通过两个简单的组件解决了这两个问题。首先，我们介绍实例提示——一种像素对齐的控制信号，通过平均每个实例的流量并在实例掩码上增强单眼深度，将稀疏的、用户可编辑的提示转换为密集的 2.5D（相对于摄像机）运动场。与 2D 箭头输入相比，这减少了深度模糊性，同时保持易于使用。其次，我们使用 Dense RoPE 在标记空间中保留这些线索的显着性，它使用空间可寻址的旋转嵌入来标记一小组运动标记（锚定在第一帧上）。与联合 RGB $+$ 辅助图预测（分割或深度）配合使用，我们的模型锚定结构，同时 RGB 处理外观，稳定优化并提高时间连贯性，而不需要每帧轨迹脚本。

Title: Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

Authors: Rashmi R, Vidyadhar Upadhya
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14592
Pdf URL: https://arxiv.org/pdf/2510.14592
Copy Paste: [[2510.14592]] Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval(https://arxiv.org/abs/2510.14592)
Keywords: generation
Abstract: Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA's ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
摘要：当前的检索增强生成（RAG）系统主要对单模态文本数据进行操作，限制了其对非结构化多模态文档的有效性。此类文档通常结合文本、图像、表格、方程和图表，每个文档都提供独特的信息。在这项工作中，我们提出了一种模态感知混合检索架构（MAHA），专门为通过模态感知知识图进行推理的多模态问答而设计。 MAHA 将密集向量检索与结构化图遍历相结合，其中知识图编码跨模态语义和关系。这种设计可以实现跨不同模态的语义丰富和上下文感知的检索。对多个基准数据集的评估表明，MAHA 大大优于基线方法，达到 0.486 的 ROUGE-L 分数，提供完整的模态覆盖。这些结果凸显了 MAHA 将嵌入与显式文档结构相结合的能力，从而实现有效的多模式检索。我们的工作建立了一个可扩展且可解释的检索框架，通过对非结构化多模态数据进行模态感知推理来推进 RAG 系统。

Title: Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Authors: Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14605
Pdf URL: https://arxiv.org/pdf/2510.14605
Copy Paste: [[2510.14605]] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering(https://arxiv.org/abs/2510.14605)
Keywords: generation
Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at this https URL
摘要：基于知识的视觉问答（KB-VQA）需要视觉语言模型（VLM）将视觉理解与外部知识检索相结合。尽管检索增强生成（RAG）通过结合知识库查询在这项任务中取得了重大进展，但它仍然在多模态查询的质量和检索结果的相关性方面遇到困难。为了克服这些挑战，我们提出了一种新颖的三阶段方法，称为 Wiki-PRF，包括处理、检索和过滤阶段。处理阶段动态调用可视化工具来提取精确的多模态信息以供检索。检索阶段融合视觉和文本特征，实现多模态知识检索。过滤阶段对检索结果进行相关性过滤和集中。为此，我们引入了一种通过强化学习方式训练有答案准确性和格式一致性作为奖励信号的视觉语言模型。这增强了模型的推理、精确查询的工具调用以及不相关内容的过滤。基准数据集（E-VQA 和 InfoSeek）上的实验显示答案质量显着提高（36.0 和 42.8），实现了最先进的性能。代码可在此 https URL 获取

Title: LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Authors: Zhuo Cao, Xuan Zhao, Lena Krieger, Hanno Scharr, Ira Assent
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14623
Pdf URL: https://arxiv.org/pdf/2510.14623
Copy Paste: [[2510.14623]] LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching(https://arxiv.org/abs/2510.14623)
Keywords: generation
Abstract: The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model's prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.
摘要：机器学习 (ML) 和人工智能 (AI) 模型越来越多地集成到医疗保健和科学研究等高风险领域，要求模型不仅准确而且可解释。在现有的可解释方法中，反事实解释通过识别可能改变模型预测的输入的最小变化来提供可解释性，从而提供更深入的见解。然而，当前的反事实生成方法存在严重的局限性，包括梯度消失、不连续的潜在空间以及过度依赖学习的决策边界和真实的决策边界之间的对齐。为了克服这些限制，我们提出了 LeapFactual，一种基于条件流匹配的新型反事实解释算法。即使真实的决策边界与学习的决策边界存在分歧，LeapFactual 也会生成可靠且信息丰富的反事实。采用与模型无关的方法，LeapFactual 不仅限于具有可微损失函数的模型。它甚至可以处理人机交互系统，将反事实解释的范围扩展到需要人类注释者参与的领域，例如公民科学。我们对基准和现实世界数据集进行了广泛的实验，表明 LeapFactual 可以生成准确且符合分布的反事实解释，从而提供可操作的见解。例如，我们观察到，我们可靠的反事实样本（其标签与真实情况一致）可以有益地用作新的训练数据来增强模型。所提出的方法具有广泛的适用性，并增强了科学知识的发现和非专家的可解释性。

Title: Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Authors: Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14630
Pdf URL: https://arxiv.org/pdf/2510.14630
Copy Paste: [[2510.14630]] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation(https://arxiv.org/abs/2510.14630)
Keywords: generation, generative
Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.
摘要：我们引入了 Representation Tokenizer (RepTok)，这是一种生成建模框架，它使用从自监督视觉转换器获得的单个连续潜在标记来表示图像。在预先训练的 SSL 编码器的基础上，我们仅微调语义标记嵌入，并将其与使用标准流匹配目标联合训练的生成解码器配对。这种适应通过低级的、与重建相关的细节丰富了令牌，从而实现了忠实的图像重建。为了保留原始 SSL 空间的有利几何形状，我们添加了余弦相似性损失，以规范适应的令牌，确保潜在空间保持平滑并适合生成。我们的单令牌公式解决了二维潜在空间的空间冗余，并显着降低了培训成本。尽管简单高效，RepTok 在类条件 ImageNet 生成上取得了有竞争力的结果，并自然地扩展到文本到图像合成，在极其有限的训练预算下在 MS-COCO 上达到了有竞争力的零样本性能。我们的研究结果强调了微调 SSL 表示作为高效生成建模的紧凑且有效的潜在空间的潜力。

Title: In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Authors: Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, Guosheng Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14648
Pdf URL: https://arxiv.org/pdf/2510.14648
Copy Paste: [[2510.14648]] In-Context Learning with Unpaired Clips for Instruction-based Video Editing(https://arxiv.org/abs/2510.14648)
Keywords: generation
Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.
摘要：尽管基于指令的图像编辑取得了快速进展，但其对视频的扩展仍未得到充分探索，这主要是由于构建大规模配对视频编辑数据集的成本和复杂性令人望而却步。为了应对这一挑战，我们引入了一种基于指令的视频编辑的低成本预训练策略，该策略利用来自不配对视频剪辑的上下文学习。我们表明，使用这种策略预训练基础视频生成模型赋予其一般编辑功能，例如根据输入编辑指令进行添加、替换或删除操作。然后可以使用少量高质量的配对编辑数据有效地完善预训练模型。我们的框架基于 HunyuanVideoT2V 构建，首先对大约 1M 的真实视频剪辑进行预训练，以学习基本的编辑概念，然后对少于 150k 的精选编辑对进行微调，以扩展更多的编辑任务并提高编辑质量。对比实验表明，我们的方法在指令对齐和视觉保真度方面都超越了现有的基于指令的视频编辑方法，在编辑指令遵循方面实现了 12% 的改进，在编辑质量方面实现了 15% 的改进。

Title: Leveraging Learned Image Prior for 3D Gaussian Compression

Authors: Seungjoo Shin, Jaesik Park, Sunghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14705
Pdf URL: https://arxiv.org/pdf/2510.14705
Copy Paste: [[2510.14705]] Leveraging Learned Image Prior for 3D Gaussian Compression(https://arxiv.org/abs/2510.14705)
Keywords: restoration
Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.
摘要：3D 高斯分布 (3DGS) 的压缩技术最近在最大限度地减少 3D 高斯的存储开销同时保持高渲染质量方面取得了相当大的成功。尽管存储量减少令人印象深刻，但缺乏学习先验限制了 3DGS 压缩任务的率失真权衡的进一步发展。为了解决这个问题，我们引入了一种新颖的 3DGS 压缩框架，该框架利用学习图像先验的强大表示能力来恢复压缩引起的质量下降。我们的恢复网络建立在最初压缩的高斯函数的基础上，有效地模拟了图像空间中退化高斯函数和原始高斯函数之间的压缩伪影。为了增强率失真性能，我们将粗渲染残差作为辅助信息提供到恢复网络中。通过利用恢复图像的监督，压缩的高斯函数被细化，从而产生具有增强渲染性能的高度紧凑的表示。我们的框架旨在与现有的高斯压缩方法兼容，使其广泛适用于不同的基线。大量实验验证了我们框架的有效性，展示了卓越的率失真性能，并超越了最先进的 3DGS 压缩方法的渲染质量，同时所需的存储空间大大减少。

Title: Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Authors: Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14751
Pdf URL: https://arxiv.org/pdf/2510.14751
Copy Paste: [[2510.14751]] Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries(https://arxiv.org/abs/2510.14751)
Keywords: generation
Abstract: Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
摘要：下一个标记预测 (NTP) 推动了大型语言模型 (LLM) 的成功，但它在长期推理、规划和创意写作方面遇到了困难，这些限制很大程度上归因于教师强制培训。多令牌预测（MTP）通过同时预测多个未来令牌来部分缓解这些问题，但它主要捕获短期依赖性并提供有限的改进。我们提出了未来摘要预测（FSP），它训练一个辅助头来预测长期未来的紧凑表示，保留与长形式世代相关的信息。我们探索了 FSP 的两种变体：手工摘要，例如，序列未来的词袋摘要，以及学习摘要，它使用由从右到左训练的反向语言模型生成的嵌入。大规模预训练实验（3B 和 8B 参数模型）表明，FSP 在数学、推理和编码基准方面比 NTP 和 MTP 都有改进。

Title: LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement

Authors: Xu Wu, Zhihui Lai, Xianxu Hou, Jie Zhou, Ya-nan Zhang, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14753
Pdf URL: https://arxiv.org/pdf/2510.14753
Copy Paste: [[2510.14753]] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement(https://arxiv.org/abs/2510.14753)
Keywords: restoration
Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.
摘要：低光图像增强 (LLIE) 旨在改善照明，同时保留高质量的颜色和纹理。然而，由于低光条件下像素级信息严重退化，现有方法往往无法提取可靠的特征表示，从而导致纹理恢复不良、颜色不一致和伪影。为了应对这些挑战，我们提出了 LightQANet，这是一种新颖的框架，引入了用于低光增强的量化和自适应特征学习，旨在在不同的照明条件下实现一致且稳健的图像质量。从静态建模的角度来看，我们设计了光量化模块（LQM）来从图像特征中显式提取和量化与照明相关的因素。通过实施结构化光因子学习，LQM 增强了光不变表示的提取，并减轻了不同照明水平下的特征不一致。从动态适应的角度来看，我们引入了光感知提示模块（LAPM），它将照明先验编码为可学习的提示，以动态指导特征学习过程。 LAPM 使模型能够灵活适应复杂且不断变化的光照条件，进一步提高图像增强效果。对多个低光数据集的广泛实验表明，我们的方法实现了最先进的性能，在各种具有挑战性的照明场景中提供了卓越的定性和定量结果。

Title: FraQAT: Quantization Aware Training with Fractional bits

Authors: Luca Morreale, Alberto Gil C. P. Ramos, Malcolm Chadwick, Mehid Noroozi, Ruchika Chavhan, Abhinav Mehrotra, Sourav Bhattacharya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14823
Pdf URL: https://arxiv.org/pdf/2510.14823
Copy Paste: [[2510.14823]] FraQAT: Quantization Aware Training with Fractional bits(https://arxiv.org/abs/2510.14823)
Keywords: generation, generative
Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model's precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7\%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).
摘要：最先进的（SOTA）生成模型在图像合成或文本生成方面表现出了令人印象深刻的能力，通常具有大容量模型。然而，由于板载内存和计算的可用性有限，这些大型模型无法部署在智能手机上。量化方法降低了模型参数的精度，从而允许高效计算，例如，在\INT{8}中。尽管积极的量化解决了效率和内存限制，但保持模型的质量仍然是一个挑战。为了保持之前激进量化的质量，我们提出了一种新的小数位量化（\short）方法。新颖之处在于一个简单而有效的想法：我们逐渐将模型的精度从每个参数 32 位降低到 4 位，并在优化过程中利用小数位来保持较高的生成质量。我们表明，\short{} 在各种扩散模型（包括 SD3.5-Medium、Sana、\pixart 和 FLUX.1-schnell）上都能提高质量，同时实现比标准 QAT 低 4-7\%$ 的 FiD。最后，我们在三星 S25U 上部署并运行 Sana，该设备运行于 Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon 张量处理器 (HTP)。

Title: To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Authors: Eran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.14826
Pdf URL: https://arxiv.org/pdf/2510.14826
Copy Paste: [[2510.14826]] To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models(https://arxiv.org/abs/2510.14826)
Keywords: generation
Abstract: State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form'' generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.
摘要：状态空间模型 (SSM) 已成为序列建模中 Transformer 的主要替代方案。它们的主要优点是通过固定大小的内存和计算复杂性的线性缩放实现长上下文和长格式生成的效率。我们通过展示一个简单的理论结果来开始这项工作，表明 SSM 无法准确解决任何“真正的长形式”发电问题（在某种意义上我们正式定义），从而削弱了它们的主要竞争优势。然而，我们表明，可以通过允许 SSM 交互式访问外部工具来缓解这一限制。事实上，我们表明，如果选择正确的工具访问和与问题相关的训练数据，SSM 可以学习解决任何易于处理的问题并泛化到任意问题长度/复杂性（即实现长度泛化）。根据我们的理论发现，我们证明了工具增强的 SSM 在各种算术、推理和编码任务上实现了显着的长度泛化。这些发现凸显了 SSM 在基于交互式工具和代理的环境中是 Transformer 的潜在有效替代品。

Title: ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Authors: Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14847
Pdf URL: https://arxiv.org/pdf/2510.14847
Copy Paste: [[2510.14847]] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints(https://arxiv.org/abs/2510.14847)
Keywords: generation
Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.
摘要：视频生成模型取得了显着的进步，尤其是在现实场景中表现出色；然而，在富有想象力的场景中，它们的表现显着下降。这些提示通常涉及很少同时出现的具有长距离语义关系的概念，超出了训练分布。现有方法通常应用测试时间缩放来提高视频质量，但其固定的搜索空间和静态奖励设计限制了对富有想象力的场景的适应性。为了填补这一空白，我们提出了 ImagerySearch，这是一种提示引导的自适应测试时搜索策略，可根据提示中的语义关系动态调整推理搜索空间和奖励函数。这使得在富有挑战性的富有想象力的环境中视频更加连贯、视觉上更加可信。为了评估这个方向的进展，我们引入了 LDT-Bench，这是第一个用于长距离语义提示的专用基准，由 2,839 个不同的概念对和一个用于评估创意生成能力的自动化协议组成。大量实验表明，ImagerySearch 在 LDT-Bench 上始终优于强大的视频生成基线和现有测试时间缩放方法，并在 VBench 上实现了有竞争力的改进，证明了其在不同提示类型中的有效性。我们将发布LDT-Bench和代码，以促进未来对富有想象力的视频生成的研究。

Title: Benchmarking Multimodal Large Language Models for Face Recognition

Authors: Hatef Otroshi Shahreza, Sébastien Marcel
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.14866
Pdf URL: https://arxiv.org/pdf/2510.14866
Copy Paste: [[2510.14866]] Benchmarking Multimodal Large Language Models for Face Recognition(https://arxiv.org/abs/2510.14866)
Keywords: generation
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.
摘要：多模态大语言模型（MLLM）在不同的视觉和语言任务中取得了卓越的性能。然而，它们在人脸识别方面的潜力仍未得到充分开发。特别是，需要在具有类似协议的标准基准上评估开源 MLLM 的性能，并与现有人脸识别模型进行比较。在这项工作中，我们在多个人脸识别数据集（包括 LFW、CALFW、CPLFW、CFP、AgeDB 和 RFW）上提出了用于人脸识别的最先进 MLLM 的系统基准。实验结果表明，虽然 MLLM 捕获了对人脸相关任务有用的丰富语义线索，但它们在零样本应用中的高精度识别场景中落后于专用模型。该基准测试为推进基于 MLLM 的人脸识别奠定了基础，为设计具有更高准确度和泛化能力的下一代模型提供了见解。我们的基准测试的源代码可在项目页面中公开获取。

Title: TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Authors: Guangyi Han, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14874
Pdf URL: https://arxiv.org/pdf/2510.14874
Copy Paste: [[2510.14874]] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions(https://arxiv.org/abs/2510.14874)
Keywords: generation
Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{this https URL}{here}$.
摘要：手与物体交互（HOI）是人类表达意图的基础。现有的 HOI 生成研究主要局限于固定的抓取模式，其中控制与物理先验相关，例如力闭合或通用意图指令，即使通过复杂的语言表达也是如此。这种过于笼统的条件作用对稳定的把握施加了强烈的归纳偏差，因此无法捕捉日常 HOI 的多样性。为了解决这些限制，我们引入了自由形式的 HOI 生成，其目标是根据细粒度意图生成可控、多样化且物理上合理的 HOI，将 HOI 从抓取扩展到自由形式的交互，例如推、戳和旋转。为了支持这项任务，我们构建了 WildO2，一个野外多样化的 3D HOI 数据集，其中包括源自互联网视频的多样化 HOI。具体来说，它包含跨 92 个意图和 610 个对象类别的 4.4k 个独特交互，每个交互都有详细的语义注释。在此数据集的基础上，我们提出了 TOUCH，这是一个以多级扩散模型为中心的三阶段框架，有助于细粒度语义控制，以生成超越掌握先验的多功能手部姿势。该过程利用显式接触建模进行调节，并随后通过接触一致性和物理约束进行细化，以确保真实性。综合实验证明我们的方法能够生成代表日常活动的可控、多样化且物理上合理的手部交互。项目页面是$\href{这个https URL}{这里}$。

Title: ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention

Authors: Keli Liu, Zhendong Wang, Wengang Zhou, Shaodong Xu, Ruixiao Dong, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14882
Pdf URL: https://arxiv.org/pdf/2510.14882
Copy Paste: [[2510.14882]] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention(https://arxiv.org/abs/2510.14882)
Keywords: generation, generative
Abstract: Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
摘要：使用视觉自回归（VAR）模型的文本到图像生成最近在生成保真度和推理效率方面取得了令人瞩目的进步。虽然扩散模型的控制机制已经被探索，但在 VAR 范式中实现精确和灵活的控制仍有待探索。为了弥补这一关键差距，在本文中，我们引入了 ScaleWeaver，这是一种新颖的框架，旨在通过参数高效的微调，在先进的 VAR 模型上实现高保真、可控的生成。 ScaleWeaver 的核心模块是改进的 MMDiT 块，带有所提出的参考注意模块，它高效且有效地合并了条件信息。与 MM Attention 不同，所提出的 Reference Attention 模块丢弃了 image$\rightarrow$condition 中不必要的注意力，在稳定控制注入的同时降低了计算成本。此外，它战略性地强调参数重用，利用VAR骨干网本身的能力和一些引入的参数来处理控制信息，并配备零初始化线性投影，以确保控制信号被有效地结合，而不破坏基础模型的生成能力。大量实验表明，ScaleWeaver 能够提供高质量的生成和精确的控制，同时比基于扩散的方法获得更高的效率，使 ScaleWeaver 成为视觉自回归范式中可控文本到图像生成的实用且有效的解决方案。代码和模型将被发布。

Title: 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

Authors: JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14945
Pdf URL: https://arxiv.org/pdf/2510.14945
Copy Paste: [[2510.14945]] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation(https://arxiv.org/abs/2510.14945)
Keywords: generation
Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : this https URL
摘要：我们提出了 3DScenePrompt，这是一个框架，可以从任意长度的输入生成下一个视频块，同时实现精确的摄像机控制并保持场景一致性。与以单个图像或短片为条件的方法不同，我们采用双重时空条件来重新制定输入视频中的上下文视图参考。我们的方法以时间相邻的帧为条件来实现运动连续性，以空间相邻的内容为条件来实现场景一致性。然而，当生成超出时间边界时，直接使用空间相邻帧将错误地保留过去的动态元素。我们通过引入 3D 场景内存来解决这个问题，该内存专门表示从整个输入视频中提取的静态几何体。为了构建这种记忆，我们利用动态 SLAM 和新引入的动态掩蔽策略，将静态场景几何体与移动元素明确分开。然后，静态场景表示可以投影到任何目标视点，提供几何一致的扭曲视图，作为强大的 3D 空间提示，同时允许动态区域从时间上下文自然演变。这使我们的模型能够保持远程空间相干性和精确的相机控制，而不牺牲计算效率或运动真实感。大量的实验表明，我们的框架在场景一致性、相机可控性和生成质量方面显着优于现有方法。项目页面：此 https URL

Title: OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Authors: Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong, Chang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.14954
Pdf URL: https://arxiv.org/pdf/2510.14954
Copy Paste: [[2510.14954]] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression(https://arxiv.org/abs/2510.14954)
Keywords: generation
Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.
摘要：全身多模式人体运动生成面临两个主要挑战：创建有效的运动生成机制并将文本、语音和音乐等各种模式集成到一个有凝聚力的框架中。与以前通常采用离散屏蔽建模或自回归建模的方法不同，我们开发了一种连续屏蔽自回归运动变换器，其中考虑人体运动内的顺序性质来执行因果注意。在这个变压器中，我们引入了门控线性注意力和 RMSNorm 模块，它们驱动变压器关注关键动作，并抑制由异常运动或多模态内的异构分布引起的不稳定。为了进一步增强运动生成和多模态泛化，我们采用 DiT 结构将条件从变压器扩散到目标。为了融合不同的模式，利用 AdaLN 和交叉注意力来注入文本、语音和音乐信号。实验结果表明，我们的框架在所有模式上都优于以前的方法，包括文本到动作、语音到手势和音乐到舞蹈。我们方法的代码将被公开。

Title: RealDPO: Real or Not Real, that is the Preference

Authors: Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si, Ziwei Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14955
Pdf URL: https://arxiv.org/pdf/2510.14955
Copy Paste: [[2510.14955]] RealDPO: Real or Not Real, that is the Preference(https://arxiv.org/abs/2510.14955)
Keywords: generative
Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.
摘要：视频生成模型最近在合成质量方面取得了显着的进步。然而，生成复杂的运动仍然是一个严峻的挑战，因为现有模型通常难以产生自然、平滑且与上下文一致的运动。生成的运动和现实世界的运动之间的差距限制了它们的实际适用性。为了解决这个问题，我们引入了 RealDPO，这是一种新颖的对齐范例，它利用现实世界的数据作为偏好学习的正样本，从而实现更准确的运动合成。与提供有限校正反馈的传统监督微调 (SFT) 不同，RealDPO 采用直接偏好优化 (DPO) 和定制损失函数来增强运动真实感。通过将现实世界的视频与错误的模型输出进行对比，RealDPO 能够进行迭代自我校正，逐步完善运动质量。为了支持复杂运动合成的后期训练，我们提出了 RealAction-5K，这是一个精选的高质量视频数据集，捕捉人类日常活动，具有丰富而精确的运动细节。大量实验表明，与最先进的模型和现有偏好优化技术相比，RealDPO 显着提高了视频质量、文本对齐和运动真实感。

Title: MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Authors: Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.14958
Pdf URL: https://arxiv.org/pdf/2510.14958
Copy Paste: [[2510.14958]] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning(https://arxiv.org/abs/2510.14958)
Keywords: generation
Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: this https URL
摘要：虽然大型语言模型 (LLM) 在文本推理方面表现出色，但它们在本质上依赖于视觉辅助的数学领域（例如几何）却遇到了困难。现有的视觉思维链 (VCoT) 方法通常受到严格的外部工具的限制，或者无法生成解决复杂问题所需的高保真、战略定时图表。为了弥补这一差距，我们引入了 MathCanvas，这是一个综合框架，旨在为统一的大型多模态模型 (LMM) 赋予内在的 VCoT 数学功能。我们的方法包括两个阶段。首先，视觉操纵阶段在一个新颖的 1520 万对语料库上对模型进行预训练，其中包括 1000 万个标题到图表对 (MathCanvas-Imagen) 和 520 万个逐步编辑轨迹 (MathCanvas-Edit)，以掌握图表生成和编辑。其次，战略视觉辅助推理阶段对 MathCanvas-Instruct 上的模型进行微调，MathCanvas-Instruct 是一个新的 219K 示例的交错视觉文本推理路径数据集，教它何时以及如何利用视觉辅助。为了促进严格的评估，我们引入了 MathCanvas-Bench，这是一个具有挑战性的基准，涉及 3K 问题，需要模型生成交错的视觉文本解决方案。我们的模型 BAGEL-Canvas 在此框架下进行训练，在 MathCanvas-Bench 上的强大 LMM 基线上实现了 86% 的相对改进，展示了对其他公共数学基准的出色泛化。我们的工作提供了一个完整的工具包（框架、数据集和基准）来解锁 LMM 中复杂的、类人的视觉辅助推理。项目页面：此 https URL

Title: Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Authors: Jonas Geiping, Xinyu Yang, Guinan Su
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.14961
Pdf URL: https://arxiv.org/pdf/2510.14961
Copy Paste: [[2510.14961]] Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models(https://arxiv.org/abs/2510.14961)
Keywords: generation
Abstract: Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.
摘要：具有循环深度的语言模型（在考虑变压器时也称为通用或循环）是通过通过重复层来增加计算的能力来定义的。最近在预训练方面的努力表明，这些架构可以扩展到现代语言建模任务，同时在推理任务中表现出优势。在这项工作中，我们研究了循环深度模型和扩散语言模型之间的关系。基于它们的相似性，我们为这些模型开发了一种新的扩散强迫采样器以加速生成。采样器通过在模型的每次前向传递中解码新标记来前进，而这些标记的潜在状态可以通过递归进一步并行细化。从理论上讲，使用我们的采样器生成比在现代硬件上使用相同时间预算的基线自回归生成更具表现力。此外，该采样器基于扩散文献的原理，可以直接应用于现有的 3.5B 循环深度变压器，无需任何调整，从而实现高达 5 倍的加速。因此，我们的研究结果不仅提供了一种有效的机制来并行化推理时循环深度模型中的额外计算，而且还表明此类模型可以自然地被视为强连续但因果的扩散语言模型。

Title: pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Authors: Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.14974
Pdf URL: https://arxiv.org/pdf/2510.14974
Copy Paste: [[2510.14974]] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation(https://arxiv.org/abs/2510.14974)
Keywords: generation, generative
Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.
摘要：少步扩散或基于流的生成模型通常将速度预测教师提炼为预测去噪数据捷径的学生。这种格式不匹配导致了复杂的蒸馏程序，通常会面临质量与多样性的权衡。为了解决这个问题，我们提出了基于策略的流量模型（$\pi$-Flow）。 $\pi$-Flow 修改学生流模型的输出层，以在一个时间步预测无网络策略。然后，该策略以可忽略的开销在未来的子步骤中生成动态流速，从而能够在这些子步骤上快速、准确地进行 ODE 集成，而无需额外的网络评估。为了将策略的 ODE 轨迹与教师的轨迹相匹配，我们引入了一种新颖的模仿蒸馏方法，该方法使用标准的 $\ell_2$ 流匹配损失将策略的速度与教师的沿着策略的轨迹进行匹配。通过简单地模仿教师的行为，$\pi$-Flow 可以实现稳定且可扩展的培训，并避免质量多样性权衡。在 ImageNet 256$^2$ 上，它的 1-NFE FID 为 2.85，优于相同 DiT 架构的 MeanFlow。在 4 个 NFE 的 FLUX.1-12B 和 Qwen-Image-20B 上，$\pi$-Flow 实现了比最先进的几步方法更好的多样性，同时保持了教师水平的质量。

Title: WithAnyone: Towards Controllable and ID Consistent Image Generation

Authors: Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14975
Pdf URL: https://arxiv.org/pdf/2510.14975
Copy Paste: [[2510.14975]] WithAnyone: Towards Controllable and ID Consistent Image Generation(https://arxiv.org/abs/2510.14975)
Keywords: generation
Abstract: Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
摘要：身份一致的生成已成为文本到图像研究的一个重要焦点，最近的模型在生成与参考身份一致的图像方面取得了显着的成功。然而，包含同一个体的多个图像的大规模配对数据集的稀缺迫使大多数方法采用基于重建的训练。这种依赖通常会导致我们称之为复制粘贴的失败模式，其中模型直接复制参考人脸，而不是在姿势、表情或光照的自然变化中保留身份。这种过度相似性破坏了可控性并限制了生成的表达能力。为了解决这些限制，我们（1）构建了一个大规模配对数据集MultiID-2M，针对多人场景量身定制，为每个身份提供多样化的参考； (2) 引入一个基准，量化复制粘贴伪影以及身份保真度和变异之间的权衡； (3) 提出一种具有对比身份损失的新颖训练范式，利用配对数据来平衡保真度与多样性。这些贡献在 WithAnyone 中达到顶峰，这是一种基于扩散的模型，可以有效地减少复制粘贴，同时保持较高的身份相似性。广泛的定性和定量实验表明，WithAnyone 显着减少了复制粘贴伪影，提高了姿势和表情的可控性，并保持了强大的感知质量。用户研究进一步验证了我们的方法实现了高身份保真度，同时实现了富有表现力的可控生成。

Title: Terra: Explorable Native 3D World Model with Point Latents

Authors: Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14977
Pdf URL: https://arxiv.org/pdf/2510.14977
Copy Paste: [[2510.14977]] Terra: Explorable Native 3D World Model with Point Latents(https://arxiv.org/abs/2510.14977)
Keywords: generation
Abstract: World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.
摘要：世界模型对于现实世界的综合建模越来越受到关注。然而，大多数现有方法仍然依赖像素对齐表示作为世界演化的基础，忽略了物理世界固有的 3D 本质。这可能会破坏 3D 一致性并降低世界模型的建模效率。在本文中，我们提出了 Terra，一种原生 3D 世界模型，它表示并生成内在 3D 潜在空间中的可探索环境。具体来说，我们提出了一种新颖的点到高斯变分自动编码器（P2G-VAE），它将 3D 输入编码为潜在点表示，随后将其解码为 3D 高斯基元，以联合建模几何和外观。然后，我们引入稀疏点流匹配网络（SPFlow）来生成潜在点表示，它同时对潜在点的位置和特征进行去噪。我们的 Terra 可实现与本机 3D 表示和架构的精确多视图一致性，并且仅通过单个生成过程即可支持从任何视点进行灵活渲染。此外，Terra 通过点潜在空间中的渐进生成实现了可探索的世界建模。我们对 ScanNet v2 具有挑战性的室内场景进行了广泛的实验。 Terra 在重建和生成方面实现了最先进的性能，并具有高度的 3D 一致性。