2025-12-05

Title: ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Authors: Kerry Luo, Michael Fu, Joshua Peguero, Husnain Malik, Anvay Patil, Joyce Lin, Megan Van Overborg, Ryan Sarmiento, Kevin Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04125
Pdf URL: https://arxiv.org/pdf/2512.04125
Copy Paste: [[2512.04125]] ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text(https://arxiv.org/abs/2512.04125)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories, yielding chance-level performance even for low-variance classes. In contrast, classes with high internal mean similarity exhibit clear discriminability, revealing that the bottleneck lies in representation rather than generational variance. These findings position ASCII art as a stress test for multimodal representations and motivate the development of new embedding methods or evaluation metrics tailored to symbolic visual modalities. All resources are available at this https URL.
摘要：大型语言模型（LLM）已经展示了几种大规模的新兴行为，包括长文本生成的推理和流畅性。然而，他们仍然难以完成需要精确空间和位置推理的任务。 ASCII 艺术是一种符号媒介，其中字符对结构和形式进行编码，为这一限制提供了独特的探索。我们引入了 ASCIIBench，这是一种用于评估 ASCII 文本图像的生成和分类的新颖基准。 ASCIIBench 由 5,315 个类标记 ASCII 图像的过滤数据集组成，据我们所知，这是同类中第一个公开可用的基准。除了数据集之外，我们还发布了一个经过微调的 CLIP 模型的权重，该模型适合捕获 ASCII 结构，从而能够评估 LLM 生成的 ASCII 艺术。我们的分析表明，CLIP 嵌入的余弦相似度无法分离大多数 ASCII 类别，即使对于低方差类也能产生机会级性能。相反，具有高内部平均相似度的类表现出明显的可区分性，表明瓶颈在于表示而不是代际方差。这些发现将 ASCII 艺术定位为多模态表示的压力测试，并激发了针对符号视觉模态定制的新嵌入方法或评估指标的开发。所有资源都可以通过此 https URL 获取。

Title: Decoding Large Language Diffusion Models with Foreseeing Movement

Authors: Yichuan Mo, Quan Chen, Mingjie Li, Zeming Wei, Yisen Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04135
Pdf URL: https://arxiv.org/pdf/2512.04135
Copy Paste: [[2512.04135]] Decoding Large Language Diffusion Models with Foreseeing Movement(https://arxiv.org/abs/2512.04135)
Keywords: generation
Abstract: Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.
摘要：大型语言扩散模型 (LLDM) 受益于灵活的解码机制，该机制支持自回归模型的并行推理和可控生成。然而，这种灵活性带来了一个关键的挑战：推理性能对令牌的解码顺序变得高度敏感。然而，现有的启发式方法主要关注局部影响，而忽视了长期影响。为了解决这一限制，我们提出了预见解码方法（FDM），这是一种结合局部和全局考虑因素以释放全部潜力的新颖方法，采用基于搜索的策略来实现离散空间中的有效优化。此外，通过分析整个解码过程中所选标记的一致性，我们开发了一种变体，FDM with Acceleration (FDM-A)，它将深度探索限制在被确定为探索和平衡环境的关键步骤。跨不同基准和模型架构的大量实验验证了 FDM 的可扩展性，并展示了 FDM-A 实现的卓越效率与性能权衡。我们的工作可能为 LLDM 提供更强大的解码方法迈出原则性的一步。

Title: MechDetect: Detecting Data-Dependent Errors

Authors: Philipp Jung, Nicholas Chandler, Sebastian Jäger, Felix Biessmann
Subjects: cs.LG, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2512.04138
Pdf URL: https://arxiv.org/pdf/2512.04138
Copy Paste: [[2512.04138]] MechDetect: Detecting Data-Dependent Errors(https://arxiv.org/abs/2512.04138)
Keywords: generation
Abstract: Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
摘要：数据质量监控是现代信息处理系统的核心挑战。虽然已经提出了许多检测数据错误或变化的方法，但很少有研究调查控制错误生成的机制。我们认为，了解错误是如何产生的是追踪和修复错误的关键。在这项研究中，我们以统计文献中关于缺失值的现有工作为基础，提出了 MechDetect，一种用于研究错误生成机制的简单算法。给定表格数据集和相应的错误掩码，该算法使用机器学习模型估计错误是否取决于数据。我们的工作扩展了现有的方法来检测缺失值的机制，并且可以很容易地应用于其他错误类型，只要错误掩码可用。我们在已建立的基准数据集上进行实验，证明了 MechDetect 的有效性。

Title: Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Authors: Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04175
Pdf URL: https://arxiv.org/pdf/2512.04175
Copy Paste: [[2512.04175]] Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection(https://arxiv.org/abs/2512.04175)
Keywords: generation
Abstract: Generalizing deepfake detection to unseen manipulations remains a key challenge. A recent approach to tackle this issue is to train a network with pristine face images that have been manipulated with hand-crafted artifacts to extract more generalizable clues. While effective for static images, extending this to the video domain is an open issue. Existing methods model temporal artifacts as frame-to-frame instabilities, overlooking a key vulnerability: the violation of natural motion dependencies between different facial regions. In this paper, we propose a synthetic video generation method that creates training data with subtle kinematic inconsistencies. We train an autoencoder to decompose facial landmark configurations into motion bases. By manipulating these bases, we selectively break the natural correlations in facial movements and introduce these artifacts into pristine videos via face morphing. A network trained on our data learns to spot these sophisticated biomechanical flaws, achieving state-of-the-art generalization results on several popular benchmarks.
摘要：将深度伪造检测推广到看不见的操作仍然是一个关键挑战。最近解决这个问题的方法是使用原始人脸图像来训练网络，这些图像经过手工制作的人工制品进行操作，以提取更普遍的线索。虽然对于静态图像有效，但将其扩展到视频领域是一个悬而未决的问题。现有方法将时间伪影建模为帧到帧的不稳定性，忽略了一个关键漏洞：违反不同面部区域之间的自然运动依赖性。在本文中，我们提出了一种合成视频生成方法，该方法创建具有细微运动学不一致的训练数据。我们训练自动编码器将面部标志配置分解为运动基础。通过操纵这些基础，我们有选择地打破面部运动的自然相关性，并通过面部变形将这些伪影引入原始视频中。根据我们的数据训练的网络学会发现这些复杂的生物力学缺陷，在几个流行的基准上实现最先进的泛化结果。

Title: MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Authors: Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04221
Pdf URL: https://arxiv.org/pdf/2512.04221
Copy Paste: [[2512.04221]] MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis(https://arxiv.org/abs/2512.04221)
Keywords: generation
Abstract: While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
摘要：虽然文本到视频 (T2V) 生成在照片真实感方面取得了显着进步，但生成忠实遵守物理原理的意图一致的视频仍然是一个核心挑战。在这项工作中，我们系统地研究了牛顿运动控制的文本到视频的生成和评估，强调物理精度和运动连贯性。我们引入了 MoReGen，这是一个运动感知、基于物理的 T2V 框架，它集成了多代理 LLM、物理模拟器和渲染器，可根据代码域中的文本提示生成可重复的、物理准确的视频。为了定量评估物理有效性，我们提出对象轨迹对应作为直接评估指标，并提出 MoReSet，这是 1,275 个人工注释视频的基准，涵盖九类牛顿现象，包括场景描述、时空关系和真实轨迹。使用 MoReSet，我们对现有 T2V 模型进行实验，通过我们的 MoRe 指标和现有的基于物理的评估器评估其物理有效性。我们的结果表明，最先进的模型很难维持物理有效性，而 MoReGen 则确立了物理相干视频合成的原则方向。

Title: ReasonX: MLLM-Guided Intrinsic Image Decomposition

Authors: Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04222
Pdf URL: https://arxiv.org/pdf/2512.04222
Copy Paste: [[2512.04222]] ReasonX: MLLM-Guided Intrinsic Image Decomposition(https://arxiv.org/abs/2512.04222)
Keywords: generative
Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.
摘要：本质图像分解旨在将图像分离为物理成分，例如反照率、深度、法线和照明。虽然最近基于扩散和变压器的模型受益于合成数据集的配对监督，但它们对不同的现实世界场景的泛化仍然具有挑战性。我们提出了 ReasonX，这是一种新颖的框架，它利用多模态大语言模型（MLLM）作为感知判断，提供相对的内在比较，并使用这些比较作为 GRPO 奖励，用于在未标记的野外图像上微调内在分解模型。与生成模型的强化学习方法不同，我们的框架通过奖励法官的关系评估与从模型输出中分析得出的关系之间的一致性来调整条件内在预测变量。 ReasonX 与模型无关，可应用于不同的内在预测变量。在多种基础架构和模式中，ReasonX 取得了重大改进，包括 IIW 反照率的 WHDR 降低了 9-25%，以及 ETH3D 的深度精度提高了 46%，凸显了 MLLM 引导的比较监督在连接低级和高级视觉推理方面的前景。

Title: ActVAE: Modelling human activity schedules with a deep conditional generative approach

Authors: Fred Shone, Tim Hillel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04223
Pdf URL: https://arxiv.org/pdf/2512.04223
Copy Paste: [[2512.04223]] ActVAE: Modelling human activity schedules with a deep conditional generative approach(https://arxiv.org/abs/2512.04223)
Keywords: generation, generative
Abstract: Modelling the complexity and diversity of human activity scheduling behaviour is inherently challenging. We demonstrate a deep conditional-generative machine learning approach for the modelling of realistic activity schedules depending on input labels such as an individual's age, employment status, or other information relevant to their scheduling. We combine (i) a structured latent generative approach, with (ii) a conditional approach, through a novel Conditional VAE architecture. This allows for the rapid generation of precise and realistic schedules for different input labels. We extensively evaluate model capabilities using a joint density estimation framework and several case studies. We additionally show that our approach has practical data and computational requirements, and can be deployed within new and existing demand modelling frameworks. We evaluate the importance of generative capability more generally, by comparing our combined approach to (i) a purely generative model without conditionality, and (ii) a purely conditional model which outputs the most likely schedule given the input labels. This comparison highlights the usefulness of explicitly modelling the randomness of complex and diverse human behaviours using deep generative approaches.
摘要：对人类活动调度行为的复杂性和多样性进行建模本质上是具有挑战性的。我们展示了一种深度条件生成机器学习方法，用于根据输入标签（例如个人年龄、就业状况或与其日程安排相关的其他信息）对现实活动日程进行建模。我们通过新颖的条件 VAE 架构将 (i) 结构化潜在生成方法与 (ii) 条件方法结合起来。这允许为不同的输入标签快速生成精确且现实的时间表。我们使用联合密度估计框架和几个案例研究来广泛评估模型功能。我们还表明，我们的方法具有实际的数据和计算要求，并且可以部署在新的和现有的需求建模框架中。我们通过将我们的组合方法与（i）无条件的纯生成模型和（ii）在给定输入标签的情况下输出最可能的时间表的纯条件模型进行比较，更普遍地评估生成能力的重要性。这种比较凸显了使用深度生成方法对复杂多样的人类行为的随机性进行显式建模的有用性。

Title: MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Authors: Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04248
Pdf URL: https://arxiv.org/pdf/2512.04248
Copy Paste: [[2512.04248]] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models(https://arxiv.org/abs/2512.04248)
Keywords: generation
Abstract: We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.
摘要：我们介绍了 MVRoom，这是一种用于 3D 室内场景的可控新型视图合成 (NVS) 管道，它使用以粗略 3D 布局为条件的多视图扩散。 MVRoom 采用两阶段设计，始终使用 3D 布局来增强多视图一致性。第一阶段采用新颖的表示方法，有效地连接 3D 布局和一致的基于图像的条件信号，以实现多视图生成。第二阶段执行图像条件多视图生成，结合布局感知极线注意机制以增强扩散过程中的多视图一致性。此外，我们引入了一个迭代框架，通过递归执行多视图生成（MVRoom）来生成具有不同数量的对象和场景复杂性的 3D 场景，支持文本到场景的生成。实验结果表明，我们的方法实现了 NVS 的高保真度和可控 3D 场景生成，在数量和质量上都优于最先进的基线方法。消融研究进一步验证了我们发电管道中关键组件的有效性。

Title: UniLight: A Unified Representation for Lighting

Authors: Zitian Zhang, Iliyan Georgiev, Michael Fischer, Yannick Hold-Geoffroy, Jean-François Lalonde, Valentin Deschaintre
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04267
Pdf URL: https://arxiv.org/pdf/2512.04267
Copy Paste: [[2512.04267]] UniLight: A Unified Representation for Lighting(https://arxiv.org/abs/2512.04267)
Keywords: generation
Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.
摘要：照明对视觉外观有很大影响，但理解和表示图像中的照明仍然非常困难。存在各种照明表示，例如环境图、辐照度、球谐函数或文本，但它们不兼容，这限制了跨模式传输。因此，我们提出了 UniLight，一种作为照明表示的联合潜在空间，它将多种模式统一在共享嵌入中。用于文本、图像、辐照度和环境图的模态特定编码器经过对比训练以对齐其表示，并通过辅助球谐预测任务增强方向理解。我们的多模态数据管道支持跨三个任务的大规模训练和评估：基于照明的检索、环境地图生成以及基于扩散的图像合成中的照明控制。实验表明，我们的表示捕获了一致且可转移的照明特征，从而实现了跨模式的灵活操作。

Title: Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

Authors: Tasmiah Haque, Srinjoy Das
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04282
Pdf URL: https://arxiv.org/pdf/2512.04282
Copy Paste: [[2512.04282]] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer(https://arxiv.org/abs/2512.04282)
Keywords: generative
Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.
摘要：实时视频运动传输应用（例如沉浸式游戏和基于视觉的异常检测）需要准确而多样化的未来预测，以支持不确定性下的真实合成和稳健的下游决策。为了提高此类顺序预测的多样性，我们提出了一种新颖的推理时间细化技术，该技术将门控循环单元归一化流（GRU-NF）与随机采样方法相结合。虽然 GRU-NF 可以通过在时间预测框架内集成标准化流来捕获多模态分布，但其确定性变换结构可能会限制表达能力。为了解决这个问题，受随机归一化流（SNF）的启发，我们在 GRU-NF 推理过程中引入了马尔可夫链蒙特卡罗（MCMC）步骤，使模型能够探索更丰富的输出空间并更好地逼近真实数据分布，而无需重新训练。我们在基于关键点的视频运动传输管道中验证了我们的方法，其中捕获时间连贯和感知多样化的未来轨迹对于真实样本和低带宽通信至关重要。实验表明，我们的推理框架门控循环单元随机归一化流 (GRU-SNF) 在生成多样化输出方面优于 GRU-NF，且不牺牲准确性，即使在较长的预测范围内也是如此。通过在推理过程中注入随机性，我们的方法可以更有效地捕获多模态行为。这些结果凸显了将随机动力学与基于流的序列模型相结合进行生成时间序列预测的潜力。

Title: Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

Authors: Fan Jia, Yuhao Huang, Shih-Hsin Wang, Cristina Garcia-Cardona, Andrea L. Bertozzi, Bao Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04283
Pdf URL: https://arxiv.org/pdf/2512.04283
Copy Paste: [[2512.04283]] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint(https://arxiv.org/abs/2512.04283)
Keywords: restoration, super-resolution, generative
Abstract: Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.
摘要：基于流匹配的生成模型已集成到即插即用图像恢复框架中，由此产生的即插即用流匹配（PnP-Flow）模型在图像恢复方面取得了一些显着的经验成功。然而，对 PnP-Flow 的理论理解滞后于其实证成功。在本文中，我们推导了 PnP-Flow 的连续极限，从而产生了 PnP-Flow 的随机微分方程（SDE）代理模型。 SDE 模型为改进图像恢复的 PnP-Flow 提供了两个特别的见解：（1）它使我们能够量化图像恢复的误差，通知我们改进步骤调度并正则化神经网络参数化向量场的 Lipschitz 常数以减少误差。 (2) 它告诉我们通过外推法加速现成的 PnP-Flow 模型，从而产生所提出的 SDE 模型的重新缩放版本。我们使用多个基准任务验证了基于 SDE 的改进 PnP-Flow 的功效，包括图像去噪、去模糊、超分辨率和修复。数值结果表明，我们的方法显着优于基线 PnP-Flow 和其他最先进的方法，在各个评估指标上实现了卓越的性能。

Title: Learning Single-Image Super-Resolution in the JPEG Compressed Domain

Authors: Sruthi Srinivasan, Elham Shakibapour, Rajy Rawther, Mehdi Saeedi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04284
Pdf URL: https://arxiv.org/pdf/2512.04284
Copy Paste: [[2512.04284]] Learning Single-Image Super-Resolution in the JPEG Compressed Domain(https://arxiv.org/abs/2512.04284)
Keywords: restoration, super-resolution
Abstract: Deep learning models have grown increasingly complex, with input data sizes scaling accordingly. Despite substantial advances in specialized deep learning hardware, data loading continues to be a major bottleneck that limits training and inference speed. To address this challenge, we propose training models directly on encoded JPEG features, reducing the computational overhead associated with full JPEG decoding and significantly improving data loading efficiency. While prior works have focused on recognition tasks, we investigate the effectiveness of this approach for the restoration task of single-image super-resolution (SISR). We present a lightweight super-resolution pipeline that operates on JPEG discrete cosine transform (DCT) coefficients in the frequency domain. Our pipeline achieves a 2.6x speedup in data loading and a 2.5x speedup in training, while preserving visual quality comparable to standard SISR approaches.
摘要：深度学习模型变得越来越复杂，输入数据大小也相应扩大。尽管专业深度学习硬件取得了巨大进步，但数据加载仍然是限制训练和推理速度的主要瓶颈。为了应对这一挑战，我们提出直接在编码的 JPEG 特征上训练模型，减少与完整 JPEG 解码相关的计算开销，并显着提高数据加载效率。虽然之前的工作主要集中在识别任务上，但我们研究了这种方法对于单图像超分辨率（SISR）恢复任务的有效性。我们提出了一种轻量级超分辨率管道，可在频域中对 JPEG 离散余弦变换 (DCT) 系数进行操作。我们的流程在数据加载方面实现了 2.6 倍的加速，在训练方面实现了 2.5 倍的加速，同时保持了与标准 SISR 方法相当的视觉质量。

Title: Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Authors: Rui Fonseca, Bruno Martins, Gil Rocha
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.04309
Pdf URL: https://arxiv.org/pdf/2512.04309
Copy Paste: [[2512.04309]] Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction(https://arxiv.org/abs/2512.04309)
Keywords: generation
Abstract: Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
摘要：图像字幕引起了自然语言处理和计算机视觉领域的广泛关注。为了减少对精选数据的依赖，一些研究探索了无需任何人工注释的图像文本对进行训练的图像字幕，尽管现有方法仍然优于完全监督的方法。本文提出了 TOMCap，即一种改进的纯文本训练方法，无需对齐图像标题对即可执行字幕。该方法基于在经历减少模态差距的过程之后，使用从 CLIP 表示导出的信息来提示预先训练的语言模型解码器。我们专门测试了检索到的字幕示例和潜在向量表示的组合使用，以指导生成过程。通过大量的实验，我们证明 TOMCap 优于其他免训练和纯文本方法。我们还分析了有关检索增强和模态差距缩小组件配置的不同选择的影响。

Title: Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks

Authors: Biao Chen, Zhenhua Lei, Yahui Zhang, Tongzhi Niu
Subjects: cs.CV, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04323
Pdf URL: https://arxiv.org/pdf/2512.04323
Copy Paste: [[2512.04323]] Bayes-DIC Net: Estimating Digital Image Correlation Uncertainty with Bayesian Neural Networks(https://arxiv.org/abs/2512.04323)
Keywords: generation
Abstract: This paper introduces a novel method for generating high-quality Digital Image Correlation (DIC) dataset based on non-uniform B-spline surfaces. By randomly generating control point coordinates, we construct displacement fields that encompass a variety of realistic displacement scenarios, which are subsequently used to generate speckle pattern datasets. This approach enables the generation of a large-scale dataset that capture real-world displacement field situations, thereby enhancing the training and generalization capabilities of deep learning-based DIC algorithms. Additionally, we propose a novel network architecture, termed Bayes-DIC Net, which extracts information at multiple levels during the down-sampling phase and facilitates the aggregation of information across various levels through a single skip connection during the up-sampling phase. Bayes-DIC Net incorporates a series of lightweight convolutional blocks designed to expand the receptive field and capture rich contextual information while minimizing computational costs. Furthermore, by integrating appropriate dropout modules into Bayes-DIC Net and activating them during the network inference stage, Bayes-DIC Net is transformed into a Bayesian neural network. This transformation allows the network to provide not only predictive results but also confidence levels in these predictions when processing real unlabeled datasets. This feature significantly enhances the practicality and reliability of our network in real-world displacement field prediction tasks. Through these innovations, this paper offers new perspectives and methods for dataset generation and algorithm performance enhancement in the field of DIC.
摘要：本文介绍了一种基于非均匀 B 样条曲面生成高质量数字图像相关（DIC）数据集的新方法。通过随机生成控制点坐标，我们构建了包含各种现实位移场景的位移场，随后用于生成散斑图案数据集。这种方法能够生成捕获真实世界位移场情况的大规模数据集，从而增强基于深度学习的 DIC 算法的训练和泛化能力。此外，我们提出了一种新颖的网络架构，称为 Bayes-DIC Net，它在下采样阶段提取多个级别的信息，并在上采样阶段通过单个跳跃连接促进跨各个级别的信息聚合。 Bayes-DIC Net 包含一系列轻量级卷积块，旨在扩展感受野并捕获丰富的上下文信息，同时最大限度地降低计算成本。此外，通过将适当的dropout模块集成到Bayes-DIC Net中并在网络推理阶段激活它们，Bayes-DIC Net转变为贝叶斯神经网络。这种转换允许网络在处理真实的未标记数据集时不仅提供预测结果，还提供这些预测的置信度。这一功能显着增强了我们的网络在现实世界位移场预测任务中的实用性和可靠性。通过这些创新，本文为DIC领域的数据集生成和算法性能增强提供了新的视角和方法。

Title: A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Authors: Waleed Khalid, Dmitry Ignatov, Radu Timofte
Subjects: cs.CV, cs.SE
Abstract URL: https://arxiv.org/abs/2512.04329
Pdf URL: https://arxiv.org/pdf/2512.04329
Copy Paste: [[2512.04329]] A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks(https://arxiv.org/abs/2512.04329)
Keywords: generation
Abstract: Reusing existing neural-network components is central to research efficiency, yet discovering, extracting, and validating such modules across thousands of open-source repositories remains difficult. We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorch codebases into a searchable and executable library of validated neural modules. Unlike conventional code search or clone-detection tools, NN-RAG performs scope-aware dependency resolution, import-preserving reconstruction, and validator-gated promotion -- ensuring that every retrieved block is scope-closed, compilable, and runnable. Applied to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique. Through multi-level de-duplication (exact, lexical, structural), we find that NN-RAG contributes the overwhelming majority of unique architectures to the LEMUR dataset, supplying approximately 72% of all novel network structures. Beyond quantity, NN-RAG uniquely enables cross-repository migration of architectural patterns, automatically identifying reusable modules in one project and regenerating them, dependency-complete, in another context. To our knowledge, no other open-source system provides this capability at scale. The framework's neutral specifications further allow optional integration with language models for synthesis or dataset registration without redistributing third-party code. Overall, NN-RAG transforms fragmented vision code into a reproducible, provenance-tracked substrate for algorithmic discovery, offering a first open-source solution that both quantifies and expands the diversity of executable neural architectures across repositories.
摘要：重用现有的神经网络组件对于研究效率至关重要，但在数千个开源存储库中发现、提取和验证此类模块仍然很困难。我们引入了 NN-RAG，这是一种检索增强生成系统，可将大型异构 PyTorch 代码库转换为经过验证的神经模块的可搜索且可执行的库。与传统的代码搜索或克隆检测工具不同，NN-RAG 执行作用域感知的依赖解析、导入保留重建和验证器门控提升——确保每个检索到的块都是作用域封闭的、可编译的和可运行的。该管道应用于 19 个主要存储库，提取了 1,289 个候选块，验证了 941 个（73.0%），并证明超过 80% 在结构上是唯一的。通过多级重复数据删除（精确、词汇、结构），我们发现 NN-RAG 为 LEMUR 数据集贡献了绝大多数独特的架构，提供了所有新颖网络结构的大约 72%。除了数量之外，NN-RAG 还独特地支持架构模式的跨存储库迁移，自动识别一个项目中的可重用模块，并在另一个上下文中重新生成它们，完整的依赖关系。据我们所知，没有其他开源系统能够大规模提供这种功能。该框架的中立规范还允许与语言模型进行可选集成，以进行综合或数据集注册，而无需重新分发第三方代码。总体而言，NN-RAG 将碎片化的视觉代码转换为可重复的、可追踪来源的算法发现基底，提供了第一个开源解决方案，可以量化并扩展跨存储库的可执行神经架构的多样性。

Title: Open Set Face Forgery Detection via Dual-Level Evidence Collection

Authors: Zhongyi Cai, Bryce Gernon, Wentao Bao, Yifan Li, Matthew Wright, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04331
Pdf URL: https://arxiv.org/pdf/2512.04331
Copy Paste: [[2512.04331]] Open Set Face Forgery Detection via Dual-Level Evidence Collection(https://arxiv.org/abs/2512.04331)
Keywords: generation
Abstract: The proliferation of face forgeries has increasingly undermined confidence in the authenticity of online content. Given the rapid development of face forgery generation algorithms, new fake categories are likely to keep appearing, posing a major challenge to existing face forgery detection methods. Despite recent advances in face forgery detection, existing methods are typically limited to binary Real-vs-Fake classification or the identification of known fake categories, and are incapable of detecting the emergence of novel types of forgeries. In this work, we study the Open Set Face Forgery Detection (OSFFD) problem, which demands that the detection model recognize novel fake categories. We reformulate the OSFFD problem and address it through uncertainty estimation, enhancing its applicability to real-world scenarios. Specifically, we propose the Dual-Level Evidential face forgery Detection (DLED) approach, which collects and fuses category-specific evidence on the spatial and frequency levels to estimate prediction uncertainty. Extensive evaluations conducted across diverse experimental settings demonstrate that the proposed DLED method achieves state-of-the-art performance, outperforming various baseline models by an average of 20% in detecting forgeries from novel fake categories. Moreover, on the traditional Real-versus-Fake face forgery detection task, our DLED method concurrently exhibits competitive performance.
摘要：人脸伪造的泛滥日益削弱人们对在线内容真实性的信心。鉴于人脸伪造生成算法的快速发展，新的伪造类别可能会不断出现，对现有的人脸伪造检测方法构成重大挑战。尽管最近在人脸伪造检测方面取得了进展，但现有方法通常仅限于二元真假分类或识别已知的伪造类别，并且无法检测新型伪造的出现。在这项工作中，我们研究了开放集人脸伪造检测（OSFFD）问题，该问题要求检测模型识别新的伪造类别。我们重新表述了 OSFFD 问题，并通过不确定性估计来解决它，增强了其对现实场景的适用性。具体来说，我们提出了双级证据人脸伪造检测（DLED）方法，该方法收集并融合空间和频率水平上的特定类别证据以估计预测不确定性。在不同的实验环境中进行的广泛评估表明，所提出的 DLED 方法实现了最先进的性能，在检测新型赝品类别的赝品方面，比各种基准模型平均高出 20%。此外，在传统的真人脸与假人脸伪造检测任务上，我们的 DLED 方法同时表现出有竞争力的性能。

Title: Data-regularized Reinforcement Learning for Diffusion Models at Scale

Authors: Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04332
Pdf URL: https://arxiv.org/pdf/2512.04332
Copy Paste: [[2512.04332]] Data-regularized Reinforcement Learning for Diffusion Models at Scale(https://arxiv.org/abs/2512.04332)
Keywords: generation, generative
Abstract: Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
摘要：通过强化学习（RL）使生成扩散模型与人类偏好保持一致至关重要但也具有挑战性。大多数现有算法通常容易受到奖励黑客攻击，例如质量下降、过度程式化或多样性减少。我们的分析表明，这可以归因于其正则化的固有局限性，这提供了不可靠的惩罚。我们引入了数据正则化扩散强化学习（DDRL），这是一种新颖的框架，它使用前向 KL 散度将策略锚定到离策略数据分布。从理论上讲，DDRL 可以实现 RL 与标准扩散训练的稳健、公正的集成。根据经验，这转化为一种简单而有效的算法，将奖励最大化与扩散损失最小化相结合。通过超过一百万个 GPU 小时的实验和一万次双盲人类评估，我们在高分辨率视频生成任务中证明了 DDRL 显着提高了奖励，同时减轻了基线中出现的奖励黑客行为，实现了人类的最高偏好，并为扩散后训练建立了强大且可扩展的范式。

Title: Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Authors: Manh Nguyen, Sunil Gupta, Hung Le
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04351
Pdf URL: https://arxiv.org/pdf/2512.04351
Copy Paste: [[2512.04351]] Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models(https://arxiv.org/abs/2512.04351)
Keywords: generation
Abstract: Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of-$N$ selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.
摘要：检测大型语言模型 (LLM) 何时不确定对于构建可靠的系统至关重要，但现有方法过于复杂，依赖于脆弱的语义聚类或内部状态。我们引入了径向色散分数（RDS），这是一种简单的、无参数的、完全与模型无关的不确定性度量，用于测量嵌入空间中采样代的径向色散。轻量级概率加权变体进一步结合了模型自身的令牌概率（如果可用），其性能优于不同的九个强基线。此外，RDS 自然地扩展到每个样本评分，从而支持诸如 best-of-$N$ 选择和基于置信度的过滤等应用。在四个具有挑战性的自由形式 QA 数据集和多个法学硕士中，我们的指标实现了最先进的幻觉检测和答案选择性能，同时在样本大小和嵌入选择方面保持稳健和可扩展。

Title: FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring

Authors: Geunhyuk Youk, Jihyong Oh, Munchurl Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04390
Pdf URL: https://arxiv.org/pdf/2512.04390
Copy Paste: [[2512.04390]] FMA-Net++: Motion- and Exposure-Aware Real-World Joint Video Super-Resolution and Deblurring(https://arxiv.org/abs/2512.04390)
Keywords: restoration, super-resolution
Abstract: Real-world video restoration is plagued by complex degradations from motion coupled with dynamically varying exposure - a key challenge largely overlooked by prior works and a common artifact of auto-exposure or low-light capture. We present FMA-Net++, a framework for joint video super-resolution and deblurring that explicitly models this coupled effect of motion and dynamically varying exposure. FMA-Net++ adopts a sequence-level architecture built from Hierarchical Refinement with Bidirectional Propagation blocks, enabling parallel, long-range temporal modeling. Within each block, an Exposure Time-aware Modulation layer conditions features on per-frame exposure, which in turn drives an exposure-aware Flow-Guided Dynamic Filtering module to infer motion- and exposure-aware degradation kernels. FMA-Net++ decouples degradation learning from restoration: the former predicts exposure- and motion-aware priors to guide the latter, improving both accuracy and efficiency. To evaluate under realistic capture conditions, we introduce REDS-ME (multi-exposure) and REDS-RE (random-exposure) benchmarks. Trained solely on synthetic data, FMA-Net++ achieves state-of-the-art accuracy and temporal consistency on our new benchmarks and GoPro, outperforming recent methods in both restoration quality and inference speed, and generalizes well to challenging real-world videos.
摘要：现实世界的视频恢复受到运动造成的复杂退化和动态变化的曝光的困扰——这是以前的工作很大程度上忽视的一个关键挑战，也是自动曝光或低光捕捉的常见伪像。我们提出了 FMA-Net++，这是一个用于联合视频超分辨率和去模糊的框架，它明确地模拟了运动和动态变化曝光的耦合效应。 FMA-Net++ 采用基于分层细化和双向传播模块构建的序列级架构，可实现并行、远程时间建模。在每个块中，曝光时间感知调制层以每帧曝光为特征，这反过来又驱动曝光感知流引导动态过滤模块来推断运动和曝光感知退化内核。 FMA-Net++ 将退化学习与恢复解耦：前者预测曝光和运动感知先验来指导后者，从而提高准确性和效率。为了在真实的拍摄条件下进行评估，我们引入了 REDS-ME（多重曝光）和 REDS-RE（随机曝光）基准。仅基于合成数据进行训练，FMA-Net++ 在我们的新基准和 GoPro 上实现了最先进的准确性和时间一致性，在恢复质量和推理速度方面均优于最新方法，并且可以很好地推广到具有挑战性的现实世界视频。

Title: Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Authors: Sidan Zhu, Hongteng Xu, Dixin Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04426
Pdf URL: https://arxiv.org/pdf/2512.04426
Copy Paste: [[2512.04426]] Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation(https://arxiv.org/abs/2512.04426)
Keywords: generation
Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: this https URL.
摘要：作为一项具有挑战性的视频编辑任务，电影预告片生成涉及选择和重新组织电影镜头以创建引人入胜的预告片。目前，大多数现有的自动预告片生成方法采用“选择然后排名”的范例（即首先选择关键镜头然后对其进行排名），这不可避免地会出现错误传播并限制了生成预告片的质量。除了这个范式之外，我们提出了一种新的自定进度和自我校正的屏蔽预测方法，称为 SSMP，该方法通过双向上下文建模和渐进式自我校正在自动预告片生成方面取得了最先进的结果。特别是，SSMP 训练了一个 Transformer 编码器，该编码器以电影镜头序列作为提示，并相应地生成相应的预告片镜头序列。该模型通过屏蔽预测进行训练，从随机屏蔽的对应部分重建每个预告片镜头序列。掩模比率是自定进度的，允许任务难度适应模型，从而提高模型性能。在生成电影预告片时，模型在每一步都以高置信度填充镜头位置，并为下一次预测重新掩盖剩余位置，形成类似于人类编辑工作方式的渐进式自我校正机制。定量结果和用户研究都证明了 SSMP 与现有自动电影预告片生成方法相比的优越性。演示位于：此 https URL。

Title: MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Authors: Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga, Xiejie Fenga, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yane, Ziying Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04441
Pdf URL: https://arxiv.org/pdf/2512.04441
Copy Paste: [[2512.04441]] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving(https://arxiv.org/abs/2512.04441)
Keywords: generation, generative
Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.
摘要：端到端自动驾驶（E2E-AD）已成为一种新范式，其中轨迹规划发挥着至关重要的作用。现有的研究主要遵循两个方向：面向轨迹生成，侧重于通过简单的决策机制产生高质量的轨迹；以及面向轨迹选择，进行多维评估以选择最佳轨迹，但缺乏足够的生成能力。在这项工作中，我们提出了 MindDrive，这是一个将高质量轨迹生成与全面决策推理相结合的协调框架。它建立了“上下文模拟-候选生成-多目标权衡”的结构化推理范式。特别是，所提出的未来感知轨迹生成器（FaTG）基于世界行动模型（WaM），执行自我调节的“假设”模拟来预测潜在的未来场景并生成有远见的候选轨迹。在此基础上，面向VLM的评估器（VLoE）利用大型视觉语言模型的推理能力，在安全性、舒适性和效率维度上进行多目标评估，从而做出合理且人性化的决策。对 NAVSIM-v1 和 NAVSIM-v2 基准的大量实验表明，MindDrive 在多维驾驶指标上实现了最先进的性能，显着增强了安全性、合规性和通用性。这项工作为可解释和认知引导的自动驾驶提供了一条有希望的道路。

Title: StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

Authors: Yifei Wang, Zhenkai Li, Tianwen Qian, Huanran Zheng, Zheng Wang, Yuqian Fu, Xiaoling Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04451
Pdf URL: https://arxiv.org/pdf/2512.04451
Copy Paste: [[2512.04451]] StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios(https://arxiv.org/abs/2512.04451)
Keywords: generation
Abstract: As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
摘要：随着具身智能向现实世界的部署迈进，对流式视觉输入进行持续感知和推理的能力变得至关重要。在这种情况下，代理必须保持对其环境的态势感知，理解与周围实体的交互，并根据过去的观察、当前上下文和预期的未来事件动态规划行动。为了促进这个方向的进展，我们引入了 StreamEQA，这是第一个为具体场景中的流视频问答而设计的基准测试。 StreamEQA 沿着两个正交维度评估现有的 MLLM：体现和流媒体。沿着体现维度，我们将问题分为三个级别：感知、交互和规划，逐步评估模型识别细粒度视觉细节、推理代理与对象交互以及执行高级目标导向推理的能力。对于流维度，问题分为后向推理、实时推理和前向推理，每种模式都依赖于不同的时间上下文。 StreamEQA 基于 156 个独立的长视频构建，定义了 42 个任务，并通过结合自动生成和人工细化的混合管道生成大约 21K 个带有精确时间戳的问答对。对 13 个最先进的视频法学硕士的评估表明，尽管在传统基准上表现强劲，但这些模型在具体场景中的流视频理解方面仍然存在困难。我们希望 StreamEQA 能够促进对具体应用程序的流视频理解的研究。

Title: GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Authors: Changjin Kim, HyeokJun Lee, YoungJoon Yoo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04456
Pdf URL: https://arxiv.org/pdf/2512.04456
Copy Paste: [[2512.04456]] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis(https://arxiv.org/abs/2512.04456)
Keywords: generation, generative
Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at this https URL.
摘要：最近的图像去噪方法利用生成模型进行真实噪声合成，以解决现实世界噪声数据采集成本高昂的问题。然而，这些生成模型通常需要相机元数据和广泛的特定目标的噪声-干净图像对，通常显示设置之间的泛化能力有限。在本文中，为了减轻先决条件，我们提出了用于广义噪声合成 GuidNoise 的单对引导扩散，它使用单个噪声/干净对作为指导，通常可以在训练集中轻松获得。为了训练 GuidNoise（从引导生成合成噪声图像），我们引入了引导感知仿射特征修改（GAFM）和噪声感知细化损失，以利用扩散模型的固有潜力。该损失函数改进了扩散模型的后向过程，使模型更擅长生成真实的噪声分布。 GuidNoise 在不同的噪声环境下合成高质量的噪声图像，在训练和推理过程中无需额外的元数据。此外，GuidNoise 能够在推理时有效生成噪声干净的图像对，使合成噪声易于适用于增强训练数据。这种自我增强显着提高了去噪性能，尤其是在轻量级模型和有限训练数据的实际场景中。该代码可从此 https URL 获取。

Title: dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Authors: Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04459
Pdf URL: https://arxiv.org/pdf/2512.04459
Copy Paste: [[2512.04459]] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning(https://arxiv.org/abs/2512.04459)
Keywords: generation
Abstract: The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.
摘要：自动驾驶社区越来越关注解决分布式（OOD）驾驶场景带来的挑战。主导研究趋势旨在通过集成视觉语言模型（VLM）来增强端到端（E2E）驾驶系统，利用其丰富的世界知识和推理能力来提高跨不同环境的泛化能力。然而，大多数现有的 VLM 或视觉语言代理 (VLA) 都是基于自回归 (AR) 模型构建的。在本文中，我们观察到现有的基于 AR 的 VLM——受到因果注意力和顺序令牌生成的限制——通常无法保持高级推理和低级规划之间的一致性和可控性。相比之下，最近配备双向注意力的离散扩散 VLM 通过迭代去噪表现出卓越的可控性和可靠性。基于这些观察，我们引入了 dVLM-AD，这是一种基于扩散的视觉语言模型，它统一了端到端驾驶的感知、结构化推理和低级规划。在 nuScenes 和 WOD-E2E 上进行评估，dVLM-AD 产生了更一致的推理-动作对，并实现了与现有驾驶 VLM/VLA 系统相当的规划性能，尽管骨干网规模不大，其性能优于基于 AR 的基线，在长尾 WOD-E2E 场景下，行为轨迹一致性提高了 9%，RFS 提高了 6%。这些结果表明了可扩展的端到端驱动的可控且可靠的途径。

Title: UniTS: Unified Time Series Generative Model for Remote Sensing

Authors: Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04461
Pdf URL: https://arxiv.org/pdf/2512.04461
Copy Paste: [[2512.04461]] UniTS: Unified Time Series Generative Model for Remote Sensing(https://arxiv.org/abs/2512.04461)
Keywords: generation, generative
Abstract: One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.
摘要：卫星遥感的主要目标之一是捕获地球环境的复杂动态，其中包括重建连续无云时间序列图像、检测土地覆盖变化和预测未来地表演化等任务。然而，现有方法通常需要针对不同任务定制的专门模型，缺乏跨多个时间序列任务的时空特征的统一建模。在本文中，我们提出了统一时间序列生成模型（UnitTS），这是一个适用于各种时间序列任务的通用框架，包括时间序列重建、时间序列云去除、时间序列语义变化检测和时间序列预测。基于流匹配生成范式，UnitTS在特定任务条件的指导下构建从噪声到目标的确定性进化路径，实现多任务时空表示的统一建模。 UniTS 架构由具有时空块的扩散变压器组成，我们设计了自适应条件注入器（ACor）来增强模型对多模态输入的条件感知，从而实现高质量的可控生成。此外，我们设计了一个时空感知调制器（STM）来提高时空块捕获复杂时空依赖性的能力。此外，我们构建了两个高质量的多模态时间序列数据集TS-S12和TS-S12CR，填补了时间序列去云和预测任务基准数据集的空白。大量实验表明，UnitTS 在低级和高级时间序列任务中都表现出卓越的生成和认知能力。它显着优于现有方法，特别是在面临严重云污染、模态缺失和预测物候变化等挑战时。

Title: GraphBench: Next-generation graph learning benchmarking

Authors: Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris
Subjects: cs.LG, cs.AI, cs.NE, stat.ML
Abstract URL: https://arxiv.org/abs/2512.04475
Pdf URL: https://arxiv.org/pdf/2512.04475
Copy Paste: [[2512.04475]] GraphBench: Next-generation graph learning benchmarking(https://arxiv.org/abs/2512.04475)
Keywords: generation, generative
Abstract: Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See this http URL for further details.
摘要：图机器学习最近在分子特性预测和芯片设计等各个领域取得了令人瞩目的进展。然而，基准测试实践仍然分散，通常依赖于狭窄的、特定于任务的数据集和不一致的评估协议，这阻碍了可重复性和更广泛的进展。为了解决这个问题，我们引入了 GraphBench，这是一个全面的基准测试套件，涵盖不同的领域和预测任务，包括节点级、边缘级、图形级和生成设置。 GraphBench 提供标准化的评估协议——具有一致的数据集分割和解释分布外泛化的性能指标——以及统一的超参数调整框架。此外，我们使用消息传递神经网络和图转换器模型对 GraphBench 进行基准测试，提供原则性基线并建立参考性能。请参阅此 http URL 以获取更多详细信息。

Title: DeRA: Decoupled Representation Alignment for Video Tokenization

Authors: Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04483
Pdf URL: https://arxiv.org/pdf/2512.04483
Copy Paste: [[2512.04483]] DeRA: Decoupled Representation Alignment for Video Tokenization(https://arxiv.org/abs/2512.04483)
Keywords: generation
Abstract: This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.
摘要：本文提出了 DeRA，一种新颖的一维视频标记器，它将视频标记化中的时空表示学习解耦，以实现更好的训练效率和性能。具体来说，DeRA 保持紧凑的一维潜在空间，同时将视频编码分解为外观和运动流，这些流与预训练的视觉基础模型保持一致，以分别捕获视频中的空间语义和时间动态。为了解决异构监督引入的梯度冲突，我们进一步提出了对称对齐冲突投影（SACP）模块，该模块通过抑制沿冲突方向的分量来主动重新构造梯度。大量实验表明，在 rFVD 方面，DeRA 在 UCF-101 上的性能比 LARP（之前最先进的视频分词器）高出 25%。此外，使用 DeRA 进行自回归视频生成，我们还在 UCF-101 类条件生成和 K600 帧预测上取得了最新的结果。

Title: Not All Birds Look The Same: Identity-Preserving Generation For Birds

Authors: Aaron Sun, Oindrila Saha, Subhransu Maji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04485
Pdf URL: https://arxiv.org/pdf/2512.04485
Copy Paste: [[2512.04485]] Not All Birds Look The Same: Identity-Preserving Generation For Birds(https://arxiv.org/abs/2512.04485)
Keywords: generation
Abstract: Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.
摘要：自从可控图像生成出现以来，越来越丰富的控制模式为日常用户提供了更大的定制性和可访问性。零样本、身份保留模型（例如 Insert Anything 和 OminiControl）现在支持虚拟试穿等应用程序，无需额外的微调。虽然这些模型可能适合人类和刚性的日常物体，但它们对于非刚性或细粒度的类别仍然存在局限性。这些领域通常缺乏可访问的高质量数据——尤其是同一主题的视频或多视图观察——使得它们难以评估和改进。然而，这些领域对于从内容创建转向需要准确性和细节的应用程序至关重要。鸟类是完成这项任务的绝佳领域：它们表现出高度的多样性，需要细粒度的线索来识别，并且有各种各样的姿势。我们引入了 NABirds Look-Alikes (NABLA) 数据集，该数据集由 4,759 个专家策划的图像对组成。与从 iNaturalist 上的多图像观察中收集的 1,073 对数据和一小部分视频一起，这构成了评估保留身份的一代鸟类的基准。我们表明，最先进的基线无法保持该数据集的身份，并且我们证明，对按物种、年龄和性别分组的图像进行训练（用作身份的代理）可以显着提高已见和未见物种的性能。

Title: Controllable Long-term Motion Generation with Extended Joint Targets

Authors: Eunjong Lee, Eunhee Kim, Sanghoon Hong, Eunho Jung, Jihoon Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04487
Pdf URL: https://arxiv.org/pdf/2512.04487
Copy Paste: [[2512.04487]] Controllable Long-term Motion Generation with Extended Joint Targets(https://arxiv.org/abs/2512.04487)
Keywords: generation
Abstract: Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.
摘要：实时生成稳定且可控的角色运动是计算机动画的一个关键挑战。现有的方法通常无法提供细粒度的控制，或者会因长序列而导致运动退化，从而限制了它们在交互式应用中的使用。我们提出了 COMET，一种实时运行的自回归框架，可实现多功能的角色控制和强大的长视野合成。我们高效的基于 Transformer 的条件 VAE 允许对任意用户指定的关节进行精确的交互式控制，以完成诸如从单个模型实现目标和中间任务等任务。为了确保长期的时间稳定性，我们引入了一种新颖的参考引导反馈机制，可以防止错误累积。该机制还可以用作即插即用的风格化模块，实现实时风格传输。广泛的评估表明，COMET 能够以实时速度稳健地生成高质量运动，在复杂的运动控制任务中显着优于最先进的方法，并确认其已准备好满足要求苛刻的交互式应用。

Title: Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

Authors: Yuduo Jin, Brandon Haworth
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.04499
Pdf URL: https://arxiv.org/pdf/2512.04499
Copy Paste: [[2512.04499]] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model(https://arxiv.org/abs/2512.04499)
Keywords: generation, generative
Abstract: Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.
摘要：扩散模型已成为人体运动合成中广泛使用且成功的方法。面向任务的扩散模型具有显着先进的动作到动作、文本到动作和音频到动作应用。在本文中，我们在受控研究中研究了有关运动表示和损失函数的基本问题，并列举了生成运动扩散模型工作流程中各种决策的影响。为了回答这些问题，我们基于代理运动扩散模型（MDM）进行实证研究。我们应用v损失作为MDM（vMDM）的预测目标，其中v是运动数据和噪声的加权和。我们的目标是增强对潜在数据分布的理解，并为改善条件运动扩散模型的状态提供基础。首先，我们评估文献中的六种常见运动表示，并比较它们在质量和多样性指标方面的表现。其次，我们比较了各种配置下的训练时间，以阐明如何加快运动扩散模型的训练过程。最后，我们还对大型运动数据集进行了评估分析。我们的实验结果表明，不同数据集中的运动表示存在明显的性能差异。我们的结果还证明了不同配置对模型训练的影响，并表明这些决策对运动扩散模型结果的重要性和有效性。

Title: UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Authors: Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04504
Pdf URL: https://arxiv.org/pdf/2512.04504
Copy Paste: [[2512.04504]] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers(https://arxiv.org/abs/2512.04504)
Keywords: generation
Abstract: Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{this https URL}{this https URL}.
摘要：最近的图像扩散转换器实现了高保真度生成，但难以生成超出这些比例的图像，从而遭受内容重复和质量下降的困扰。在这项工作中，我们提出了 UltraImage，一个解决这两个问题的原则性框架。通过对位置嵌入的频率分析，我们发现重复是由主频率的周期性引起的，其周期与训练分辨率一致。我们引入递归主频率校正来将其限制在外推后的单个周期内。此外，我们发现质量下降源于注意力稀释，因此提出了熵引导的自适应注意力集中，它分配较高的焦点因子来增强对细节的局部注意力，并分配较低的焦点因子到全局注意力模式以保持结构一致性。实验表明，在三代场景中，UltraImage 在 Qwen-Image 和 Flux（大约 4K）上始终优于先前的方法，减少了重复并提高了视觉保真度。此外，UltraImage可以从1328p的训练分辨率生成高达6K*6K的图像，无需低分辨率指导，展现了其极致的外推能力。项目页面位于 \href{此 https URL}{此 https URL}。

Title: EgoLCD: Egocentric Video Generation with Long Context Diffusion

Authors: Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04515
Pdf URL: https://arxiv.org/pdf/2512.04515
Copy Paste: [[2512.04515]] EgoLCD: Egocentric Video Generation with Long Context Diffusion(https://arxiv.org/abs/2512.04515)
Keywords: generation, generative
Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: this https URL. Website: this https URL.
摘要：生成长而连贯的以自我为中心的视频很困难，因为手部物体交互和程序任务需要可靠的长期记忆。现有的自回归模型会受到内容漂移的影响，其中对象身份和场景语义会随着时间的推移而退化。为了应对这一挑战，我们引入了 EgoLCD，这是一种用于以自我为中心的长上下文视频生成的端到端框架，它将长视频合成视为高效稳定的内存管理问题。 EgoLCD 将用于稳定全局上下文的长期稀疏 KV 缓存与基于注意力的短期记忆相结合，并通过 LoRA 进行扩展以进行本地适应。记忆调节损失强制执行一致的内存使用，结构化叙事提示提供明确的时间指导。 EgoVid-5M 基准的大量实验表明，EgoLCD 在感知质量和时间一致性方面都实现了最先进的性能，有效地减轻了生成性遗忘，并代表着朝着构建可扩展的人工智能世界模型迈出的重要一步。代码：此 https URL。网站：此 https URL。

Title: VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Authors: Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04519
Pdf URL: https://arxiv.org/pdf/2512.04519
Copy Paste: [[2512.04519]] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory(https://arxiv.org/abs/2512.04519)
Keywords: generation
Abstract: Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.
摘要：自回归 (AR) 扩散通过因果地生成帧来实现流式传输、交互式长视频生成，但由于累积的错误、运动漂移和内容重复，保持分钟尺度范围内的连贯性仍然具有挑战性。我们从记忆的角度来处理这个问题，将视频合成视为一个需要协调的短期和长期上下文的循环动态过程。我们提出了 VideoSSM，一种将 AR 扩散与混合状态空间存储器相结合的长视频模型。状态空间模型（SSM）充当整个序列中场景动态的不断发展的全局记忆，而上下文窗口则为运动线索和精细细节提供本地记忆。这种混合设计保持了全局一致性，没有冻结、重复的模式，支持提示自适应交互，并随序列长度在线性时间内扩展。短程和长程基准测试的实验证明了自回归视频生成器具有最先进的时间一致性和运动稳定性，尤其是在分钟尺度范围内，从而实现了内容多样性和基于交互式提示的控制，从而为长视频生成建立了可扩展的、内存感知的框架。

Title: X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Authors: Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04537
Pdf URL: https://arxiv.org/pdf/2512.04537
Copy Paste: [[2512.04537]] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale(https://arxiv.org/abs/2512.04537)
Keywords: generative
Abstract: The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
摘要：嵌入式人工智能的进步释放了智能人形机器人的巨大潜力。然而，视觉-语言-动作（VLA）模型和世界模型的进展都因大规模、多样化训练数据的稀缺而受到严重阻碍。一个有前途的解决方案是“机器人化”网络规模的人类视频，这已被证明对于政策培训有效。然而，这些解决方案主要是将机器人手臂“叠加”到以自我为中心的视频上，无法处理第三人称视频中复杂的全身运动和场景遮挡，使得它们不适合将人类机器人化。为了弥补这一差距，我们引入了 X-Humanoid，这是一种生成视频编辑方法，可将强大的 Wan 2.2 模型改编成视频到视频的结构，并针对人机翻译任务对其进行微调。这种微调需要配对的人形视频，因此我们设计了一个可扩展的数据创建管道，使用虚幻引擎将社区资产转化为 17 多个小时的配对合成视频。然后，我们将经过训练的模型应用于 60 小时的 Ego-Exo4D 视频，生成并发布了包含超过 360 万个“机器人化”人形视频帧的新大规模数据集。定量分析和用户研究证实了我们的方法相对于现有基线的优越性：69% 的用户认为其运动一致性最佳，62.1% 的用户认为其实施正确性最佳。

Title: VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Authors: Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04540
Pdf URL: https://arxiv.org/pdf/2512.04540
Copy Paste: [[2512.04540]] VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management(https://arxiv.org/abs/2512.04540)
Keywords: generation
Abstract: Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
摘要：超长视频理解仍然是一个开放的挑战，因为现有的视觉语言模型（VLM）由于有限的上下文长度和低效的长期记忆保留而在此类内容上表现不佳。为了解决这个问题，最近的工作试图构建外部知识库和相应的检索增强生成（RAG）系统，但这些会产生巨大的存储和计算开销。在本文中，我们提出了 VideoMem，这是一种新颖的框架，它率先通过自适应内存管理将长视频理解建模为顺序生成任务。具体来说，VideoMem 动态更新全局内存缓冲区，该缓冲区自适应地保留关键信息，同时丢弃视频时间线上的冗余内容。为了有效地训练 VLM 来完成此类长期任务，VideoMem 集成了渐进式分组相对策略优化 (PRPO) 算法，该算法配备了两个核心模块：渐进式状态传播 (PSP) 自适应地保留有效的当前状态，将它们传播到下一个 rollout 步骤，并逐渐缩小模型探索空间。时间级联奖励（TCR）进一步缓解奖励稀疏性，提高样本利用率并加速收敛。大量实验表明，VideoMem 在超长视频理解任务的各种基准测试中显着优于现有开源模型。

Title: On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Authors: Yue Yu, Qiwei Di, Quanquan Gu, Dongruo Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04558
Pdf URL: https://arxiv.org/pdf/2512.04558
Copy Paste: [[2512.04558]] On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference(https://arxiv.org/abs/2512.04558)
Keywords: generation
Abstract: Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of-$n$ (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and observe consistent improvements over widely used approaches, demonstrating the practical effectiveness of our framework.
摘要：测试时计算 (TTC) 已成为增强大型语言模型 (LLM) 的日益重要的范例。尽管最佳 $n$ (BoN) 抽样和顺序修正等方法在实证上取得了成功，但它们的基本局限性仍不清楚。我们通过分析混合参考政策模型并证明标准 BoN 本质上是次优的来解决这一差距。为了更接近最佳边界，我们研究了奖励过滤顺序推理，这是一个简单的过程，有选择地仅将高奖励世代纳入上下文中。这种机制将计算集中在优秀的候选策略上，并抑制较差的候选策略。在理论方面，我们表明，奖励过滤的顺序推理比标准 TTC 范式产生严格更强的保证。在实证方面，我们在不同的基准上评估了这种推理策略，并观察到广泛使用的方法的一致改进，证明了我们框架的实际有效性。

Title: LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

Authors: Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, Alexandre Duval
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04562
Pdf URL: https://arxiv.org/pdf/2512.04562
Copy Paste: [[2512.04562]] LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models(https://arxiv.org/abs/2512.04562)
Keywords: generative
Abstract: Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.
摘要：生成机器学习 (ML) 模型有望通过无机晶体的逆向设计加速材料发现，从而实现对化学空间的前所未有的探索。然而，标准化评估框架的缺乏使得有意义地评估、比较和进一步开发这些机器学习模型变得困难。在这项工作中，我们引入了 LeMat-GenBench，这是晶体材料生成模型的统一基准，并由一组旨在更好地为模型开发和下游应用提供信息的评估指标提供支持。我们在 Hugging Face 上发布了开源评估套件和公共排行榜，并对 12 个最新生成模型进行了基准测试。结果表明，稳定性的提高平均会导致新颖性和多样性的下降，没有一个模型在所有维度上都表现出色。总而言之，LeMat-GenBench 为公平模型比较奠定了可重复和可扩展的基础，旨在指导开发更可靠、以发现为导向的晶体材料生成模型。

Title: COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Authors: Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04563
Pdf URL: https://arxiv.org/pdf/2512.04563
Copy Paste: [[2512.04563]] COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence(https://arxiv.org/abs/2512.04563)
Keywords: generation
Abstract: Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning capabilities. COOPER achieves an average \textbf{6.91\%} improvement in spatial reasoning while maintaining general performance. Moreover, even a variant trained only for auxiliary modality generation attains a \textbf{7.92\%} gain on distance and size estimation, suggesting that learning to generate auxiliary modalities helps internalize spatial knowledge and strengthen spatial understanding.
摘要：视觉空间推理对于使多模态大型语言模型 (MLLM) 理解对象属性和空间关系至关重要，但当前模型仍难以进行 3D 感知推理。现有方法通常通过使用深度和分割等辅助模式增强 RGB 输入来增强感知，或者通过空间 VQA 数据集训练和应用强化学习来增强推理，从而单独处理这两个方面。在这项工作中，我们研究了统一的 MLLM 是否可以发展增强空间感知的内在能力，并通过自适应交错推理实现更强的空间智能。我们提出 \textbf{COOPER}，一个统一的 MLLM，利用深度和分割作为辅助模态，并分两个阶段进行训练以获得辅助模态生成和自适应、交错推理能力。 COOPER 在空间推理方面实现了平均 \textbf{6.91\%} 改进，同时保持了一般性能。此外，即使是仅针对辅助模态生成进行训练的变体，在距离和尺寸估计方面也获得了 \textbf{7.92\%} 增益，这表明学习生成辅助模态有助于内化空间知识并加强空间理解。

Title: Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Authors: Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.04601
Pdf URL: https://arxiv.org/pdf/2512.04601
Copy Paste: [[2512.04601]] Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space(https://arxiv.org/abs/2512.04601)
Keywords: generative
Abstract: Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
摘要：大型语言模型（LLM）代理——在长期范围内与环境动态交互的LLM——已经成为一个日益重要的研究领域，可以实现涉及工具使用、网页浏览和与人对话的复杂任务的自动化。在缺乏专家演示的情况下，训练 LLM 代理依赖于策略梯度方法，该方法根据（通常稀疏的）奖励函数优化 LLM 策略。然而，在奖励稀疏的长期任务中，从轨迹级奖励中学习可能会产生噪音，导致训练不稳定且样本复杂度较高。此外，政策改进取决于通过探索发现更好的行动，当行动位于自然语言空间时，这可能会很困难。在本文中，我们提出了自然语言演员批评家（NLAC），这是一种新颖的演员批评家算法，该算法使用生成自然语言而不是标量值的生成式 LLM 批评家来训练 LLM 策略。这种方法利用了法学硕士的固有优势，提供更丰富、更可行的培训信号；特别是，在具有大型、开放式行动空间的任务中，自然语言解释为什么行动不是最优的对于法学硕士政策来说非常有用，可以推理如何改进其行动，而无需依赖随机探索。此外，我们的方法可以在没有策略梯度的情况下进行离策略训练，为现有的在策略方法提供更高效、更稳定的替代方案。我们展示了推理、网页浏览和工具使用与对话任务相结合的结果，证明 NLAC 在超越现有训练方法方面表现出希望，并为 LLM 代理提供了更具可扩展性和稳定的训练范例。

Title: Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Authors: Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04677
Pdf URL: https://arxiv.org/pdf/2512.04677
Copy Paste: [[2512.04677]] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length(https://arxiv.org/abs/2512.04677)
Keywords: generation
Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
摘要：现有的基于扩散的视频生成方法从根本上受到顺序计算和长期不一致的限制，限制了它们在实时、流媒体音频驱动的化身合成中的实际采用。我们推出了 Live Avatar，这是一种算法系统共同设计的框架，可使用 140 亿参数扩散模型实现高效、高保真和无限长度的头像生成。我们的方法引入了时间步长管道并行（TPP），这是一种分布式推理范例，可以跨多个 GPU 管道化去噪步骤，有效打破自回归瓶颈并确保稳定、低延迟的实时流。为了进一步增强时间一致性并减轻身份漂移和颜色伪影，我们提出了滚动接收器帧机制（RSFM），该机制通过使用缓存的参考图像动态重新校准外观来保持序列保真度。此外，我们利用自强迫分布匹配蒸馏来促进大型模型的因果、可流式适应，而不会牺牲视觉质量。 Live Avatar 展示了最先进的性能，在 5 个 H800 GPU 上达到 20 FPS 端到端生成速度，据我们所知，它是第一个实现这种规模的实用、实时、高保真头像生成的产品。我们的工作建立了在工业长格式视频合成应用中部署先进扩散模型的新范例。

Title: Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04678
Pdf URL: https://arxiv.org/pdf/2512.04678
Copy Paste: [[2512.04678]] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation(https://arxiv.org/abs/2512.04678)
Keywords: generation
Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.
摘要：高效的流媒体视频生成对于模拟交互式和动态世界至关重要。现有方法利用滑动窗口注意力提炼出少步视频扩散模型，使用初始帧作为接收器标记来保持注意力性能并减少错误累积。然而，视频帧变得过度依赖这些静态标记，导致复制初始帧并减弱运动动态。为了解决这个问题，我们引入了奖励强制，这是一个具有两个关键设计的新颖框架。首先，我们提出 EMA-Sink，它维护从初始帧初始化的固定大小的令牌，并在退出滑动窗口时通过指数移动平均融合被逐出的令牌来不断更新。无需额外的计算成本，EMA-Sink 代币即可捕获长期上下文和近期动态，防止初始帧复制，同时保持长期一致性。其次，为了更好地从教师模型中提取运动动力学，我们提出了一种新颖的奖励分布匹配蒸馏（Re-DMD）。普通分布匹配平等对待每个训练样本，限制了模型优先考虑动态内容的能力。相反，Re-DMD 通过优先考虑视觉语言模型评价的具有更大动态性的样本，将模型的输出分布偏向高奖励区域。 Re-DMD 显着提高运动质量，同时保持数据保真度。我们进行了定量和定性实验，以表明奖励强制在标准基准测试中实现了最先进的性能，同时能够在单个 H100 GPU 上以 23.1 FPS 生成高质量流媒体视频。

Title: TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation

Authors: Baris Yilmaz, Bevan Deniz Cilgin, Erdem Akagündüz, Salih Tileylioglu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04694
Pdf URL: https://arxiv.org/pdf/2512.04694
Copy Paste: [[2512.04694]] TimesNet-Gen: Deep Learning-based Site Specific Strong Motion Generation(https://arxiv.org/abs/2512.04694)
Keywords: generation
Abstract: Effective earthquake risk reduction relies on accurate site-specific evaluations. This requires models that can represent the influence of local site conditions on ground motion characteristics. In this context, data driven approaches that learn site controlled signatures from recorded ground motions offer a promising direction. We address strong ground motion generation from time-domain accelerometer records and introduce the TimesNet-Gen, a time-domain conditional generator. The approach uses a station specific latent bottleneck. We evaluate generation by comparing HVSR curves and fundamental site-frequency $f_0$ distributions between real and generated records per station, and summarize station specificity with a score based on the $f_0$ distribution confusion matrices. TimesNet-Gen achieves strong station-wise alignment and compares favorably with a spectrogram-based conditional VAE baseline for site-specific strong motion synthesis. Our codes are available via this https URL.
摘要：有效降低地震风险依赖于准确的特定地点评估。这需要能够代表当地场地条件对地震动特征影响的模型。在这种情况下，从记录的地面运动中学习站点控制特征的数据驱动方法提供了一个有希望的方向。我们解决了从时域加速度计记录生成强地面运动的问题，并介绍了 TimesNet-Gen，一种时域条件生成器。该方法使用特定于站的潜在瓶颈。我们通过比较 HVSR 曲线和每个站点的真实记录和生成记录之间的基本站点频率 $f_0$ 分布来评估生成，并根据 $f_0$ 分布混淆矩阵总结站点特异性。 TimesNet-Gen 实现了强大的逐站对齐，并且与用于特定位置强运动合成的基于频谱图的条件 VAE 基线相比具有优势。我们的代码可通过此 https URL 获取。

Title: OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution

Authors: Xinning Chai, Zhengxue Cheng, Yuhong Zhang, Hengsheng Zhang, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04699
Pdf URL: https://arxiv.org/pdf/2512.04699
Copy Paste: [[2512.04699]] OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution(https://arxiv.org/abs/2512.04699)
Keywords: super-resolution, generation
Abstract: Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at this https URL.
摘要：任意尺度超分辨率 (ASSR) 克服了传统超分辨率 (SR) 方法仅在固定尺度（例如 4 倍）下运行的限制，使单个模型能够处理任意放大倍数。大多数现有的 ASSR 方法依赖于隐式神经表示（INR），但其回归驱动的特征提取和聚合本质上限制了合成精细细节的能力，导致真实感较低。最近基于扩散的真实图像超分辨率 (Real-ISR) 模型利用强大的预训练扩散先验，并在 4 倍设置下显示出令人印象深刻的结果。我们观察到他们也可以实现 ASSR，因为扩散先验通过鼓励高真实感生成来隐式适应规模。然而，如果没有明确的尺度控制，扩散过程就无法针对不同的放大倍数进行适当调整，从而导致过度的幻觉或模糊的输出，尤其是在超高尺度下。为了解决这些问题，我们提出了 OmniScaleSR，一种基于扩散的现实任意尺度 SR 框架，旨在实现高保真度和高真实感。我们引入了显式的、扩散原生的尺度控制机制，该机制与隐式尺度适应协同工作，从而实现了扩散过程的尺度感知和内容感知调制。此外，我们还结合了多域保真度增强设计，以进一步提高重建精度。对双三次退化基准和真实世界数据集的大量实验表明，OmniScaleSR 在保真度和感知真实感方面都超越了最先进的方法，并且在大放大倍数下具有特别强大的性能。代码将在此 https URL 发布。

Title: Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Authors: Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04728
Pdf URL: https://arxiv.org/pdf/2512.04728
Copy Paste: [[2512.04728]] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild(https://arxiv.org/abs/2512.04728)
Keywords: generative
Abstract: Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
摘要：野外对话的生成心理分析面临两个基本挑战：（1）现有的视觉语言模型（VLM）无法解决发音情感歧义，即语音的视觉模式模仿情感表达；（2）由于缺乏能够评估视觉基础和推理深度的可验证的评估指标，进展受到阻碍。我们提出了一个完整的生态系统来应对这些双重挑战。首先，我们引入了用于解缠结的多级洞察网络（MIND），这是一种新颖的分层视觉编码器，它引入了状态判断模块，以根据时间特征方差通过算法抑制模糊的嘴唇特征，从而实现显式的视觉解缠结。其次，我们构建了 ConvoInsight-DB，这是一个新的大规模数据集，具有针对微表情和深度心理推理的专家注释。第三，第三，我们设计了心理推理洞察力评级指标（PRISM），这是一个自动化维度框架，使用专家指导的法学硕士来衡量大型心理视觉模型的多维度表现。在我们的 PRISM 基准测试中，MIND 显着优于所有基准，在微表情检测方面比之前的 SOTA 提高了 86.95%。消融研究证实，我们的状态判断解开模块是实现这一性能飞跃的最关键组件。我们的代码已经打开了。

Title: RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting

Authors: Siqi Wang, Hailong Yang, Junjie Zhu, Xuezhu Wang, Yufan Xu, Depei Qian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04752
Pdf URL: https://arxiv.org/pdf/2512.04752
Copy Paste: [[2512.04752]] RLHFSpec: Breaking the Efficiency Bottleneck in RLHF Training via Adaptive Drafting(https://arxiv.org/abs/2512.04752)
Keywords: generation
Abstract: Reinforcement Learning from Human Feedback (RLHF) is an important fine-tuning technique for large language models (LLMs) and comprises three stages: generation, inference, and training. The generation stage generates samples that are then used to infer learnable experiences for training. We observe that the generation stage is the bottleneck of the entire execution process and consider it a key point for optimization. Specifically, we realize the first attempt to integrate speculative decoding into the RLHF generation stage and propose RLHFSpec, an RLHF system that accelerates generation execution with adaptive speculative decoding and sample reallocation. To fully exploit the performance potential provided by speculative decoding, especially dealing with the dynamic workload of the generation stage, RLHFSpec proposes a workload-aware drafting strategy selection mechanism, which selects the near-optimal strategy by jointly considering the verification cost and the number of accepted tokens. Moreover, RLHFSpec also proposes sample reallocation to fully utilize the GPU resources, and optimizes it with an efficient sample migration mechanism. The experimental results show that the RLHFSpec can achieve higher throughput in the generation stage compared to state-of-the-art works. Moreover, due to the effective alleviation of the generation bottleneck, RLHFSpec also shows significant performance speedup in the entire RLHF execution.
摘要：人类反馈强化学习 (RLHF) 是大型语言模型 (LLM) 的一项重要微调技术，包括三个阶段：生成、推理和训练。生成阶段生成样本，然后用于推断可学习的训练经验。我们观察到生成阶段是整个执行过程的瓶颈，并将其视为优化的关键点。具体来说，我们实现了将推测解码集成到 RLHF 生成阶段的首次尝试，并提出了 RLHFSpec，这是一种通过自适应推测解码和样本重新分配来加速生成执行的 RLHF 系统。为了充分利用推测解码提供的性能潜力，特别是处理生成阶段的动态工作负载，RLHFSpec 提出了一种工作负载感知的起草策略选择机制，该机制通过联合考虑验证成本和接受的令牌数量来选择接近最优的策略。此外，RLHFSpec还提出样本重新分配以充分利用GPU资源，并通过高效的样本迁移机制进行优化。实验结果表明，与最先进的作品相比，RLHFSpec 在生成阶段可以实现更高的吞吐量。此外，由于生成瓶颈的有效缓解，RLHFSpec在整个RLHF执行过程中也表现出了显着的性能加速。

Title: Order Matters: 3D Shape Generation from Sequential VR Sketches

Authors: Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04761
Pdf URL: https://arxiv.org/pdf/2512.04761
Copy Paste: [[2512.04761]] Order Matters: 3D Shape Generation from Sequential VR Sketches(https://arxiv.org/abs/2512.04761)
Keywords: generation
Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at this https URL.
摘要：VR 草图允许用户直接在 3D 中探索和迭代想法，为传统 CAD 工具提供更快、更直观的替代方案。然而，现有的草图到形状模型忽略了笔画的时间顺序，丢弃了有关结构和设计意图的关键线索。我们介绍 VRSketch2Shape，这是第一个用于从连续 VR 草图生成 3D 形状的框架和多类别数据集。我们的贡献有三个方面：(i) 一个从任意形状生成连续 VR 草图的自动化管道，(ii) 四个类别的超过 20k 合成和 900 个手绘草图形状对的数据集，以及 (iii) 一个与基于扩散的 3D 生成器相结合的顺序感知草图编码器。我们的方法比以前的工作产生了更高的几何保真度，在最少的监督下有效地从合成草图推广到真实草图，甚至在部分草图上也表现良好。所有数据和模型都将在此 https URL 开源发布。

Title: MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

Authors: Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.04763
Pdf URL: https://arxiv.org/pdf/2512.04763
Copy Paste: [[2512.04763]] MemLoRA: Distilling Expert Adapters for On-Device Memory Systems(https://arxiv.org/abs/2512.04763)
Keywords: generation
Abstract: Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
摘要：记忆增强大语言模型（LLM）通过存储相关记忆并将其合并为上下文，在长时间对话中表现出了显着的一致性。这种基于内存的个性化也是设备上设置的关键，它允许用户保持对话和数据的私密性。然而，内存增强系统通常依赖于本地设备部署成本太高的 LLM。尽管小语言模型 (SLM) 比 LLM 更适合设备端推理，但它们无法实现足够的性能。此外，这些基于法学硕士的系统缺乏原生视觉功能，限制了它们在多模式环境中的适用性。在本文中，我们介绍了（i）MemLoRA，一种新颖的内存系统，通过为 SLM 配备专用内存适配器来实现本地部署，以及（ii）其视觉扩展 MemLoRA-V，它将小型视觉语言模型（SVLM）集成到内存系统中，从而实现本地视觉理解。遵循知识蒸馏原则，每个适配器针对特定的内存操作$\unicode{x2013}$知识提取、内存更新和内存增强生成分别进行训练。小型型号配备内存适配器，可实现准确的设备内存操作，而无需依赖云。在纯文本操作中，MemLoRA 的性能优于 10 美元\倍$大的基准模型（例如 Gemma2-27B），并在 LoCoMo 基准上实现了与 60 美元\倍$大的模型（例如 GPT-OSS-120B）相当的性能。为了评估视觉理解操作，我们通过需要直接视觉推理的具有挑战性的视觉问答任务来扩展 LoCoMo。在这一点上，我们的 VLM 集成 MemLoRA-V 比基于字幕的方法显示出巨大的改进（81.3 比 23.7 准确率），同时在基于文本的任务中保持强劲的性能，证明了我们的方法在多模式环境中的有效性。

Title: PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Authors: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04784
Pdf URL: https://arxiv.org/pdf/2512.04784
Copy Paste: [[2512.04784]] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling(https://arxiv.org/abs/2512.04784)
Keywords: generation, generative
Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at this https URL.
摘要：一致的图像生成需要忠实地保留多个图像之间的身份、风格和逻辑一致性，这对于讲故事和角色设计等应用程序至关重要。由于缺乏捕获视觉一致性的大规模数据集以及人类感知偏好建模的复杂性，监督训练方法难以完成这项任务。在本文中，我们认为强化学习（RL）提供了一种有前途的替代方案，使模型能够以无数据的方式学习复杂且主观的视觉标准。为了实现这一目标，我们引入了 PaCo-RL，这是一个综合框架，它将专门的一致性奖励模型与高效的 RL 算法相结合。第一个组件 PaCo-Reward 是一个成对一致性评估器，在通过自动子图配对构建的大规模数据集上进行训练。它通过任务感知指令和 CoT 原因增强的生成式自回归评分机制来评估一致性。第二个组件 PaCo-GRPO 利用新颖的分辨率解耦优化策略来大幅降低 RL 成本，同时采用日志驯服的多奖励聚合机制来确保平衡和稳定的奖励优化。跨两个代表性子任务的大量实验表明，PaCo-Reward 显着提高了与人类视觉一致性感知的一致性，而 PaCo-GRPO 通过提高训练效率和稳定性实现了最先进的一致性性能。总之，这些结果凸显了 PaCo-RL 作为一致图像生成的实用且可扩展的解决方案的前景。项目页面可通过此 https URL 获取。

Title: LaFiTe: A Generative Latent Field for 3D Native Texturing

Authors: Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Song-Hai Zhang, Yuan-Chen Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04786
Pdf URL: https://arxiv.org/pdf/2512.04786
Copy Paste: [[2512.04786]] LaFiTe: A Generative Latent Field for 3D Native Texturing(https://arxiv.org/abs/2512.04786)
Keywords: super-resolution, generation, generative
Abstract: Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.
摘要：直接在 3D 表面上生成高保真、无缝纹理（我们称之为 3D 原生纹理）仍然是一个基本的开放挑战，有可能克服基于 UV 和多视图投影方法的长期限制。然而，现有的本机方法由于缺乏强大且通用的潜在表示而受到限制，这严重限制了其生成的纹理的保真度和通用性。我们认为这种代表性差距是进一步进步的主要障碍。我们引入了 LaFiTe，这是一个通过学习生成纹理作为 3D 生成稀疏潜在色域来解决这一挑战的框架。 LaFiTe 的核心采用变分自动编码器 (VAE) 将复杂的表面外观编码到稀疏、结构化的潜在空间中，随后将其解码为连续的色域。通过有效地从网格拓扑和 UV 参数化中分离纹理外观，这种表示实现了前所未有的保真度，在重建中超过了最先进的方法 >10 dB PSNR。基于这种强大的表征，条件整流流模型合成了跨不同风格和几何形状的高质量、连贯的纹理。大量实验表明，LaFiTe 不仅为 3D 原生纹理树立了新基准，而且还实现了材料合成和纹理超分辨率等灵活的下游应用，为下一代 3D 内容创建工作流程铺平了道路。

Title: EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Authors: Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04810
Pdf URL: https://arxiv.org/pdf/2512.04810
Copy Paste: [[2512.04810]] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture(https://arxiv.org/abs/2512.04810)
Keywords: generation
Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
摘要：我们提出了 EMMA，一种用于多模态理解、生成和编辑的高效且统一的架构。具体来说，EMMA 主要包括 1) 具有 32 倍压缩比的高效自动编码器，可显着减少生成所需的令牌数量。通过对图像应用相同的压缩比，这还确保了理解和生成任务之间的训练平衡。 2）视觉理解和生成标记之间按通道级联而不是按标记级联，这进一步减少了统一架构中的视觉标记。 3）共享和解耦的网络，可以实现跨任务的相互改进，同时满足特定于任务的建模要求。 4）视觉理解编码器采用专家混合机制，通过少量参数的增加大幅提高感知能力。大量实验表明，EMMA-4B 在效率和性能方面都可以显着优于最先进的统一多模态方法（例如 BAGEL-7B），同时与最近的多模态理解和生成专家（例如 Qwen3-VL 和 Qwen-Image）相比也取得了有竞争力的结果。我们相信EMMA为统一多模式架构的未来发展奠定了坚实的基础。

Title: LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

Authors: Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04821
Pdf URL: https://arxiv.org/pdf/2512.04821
Copy Paste: [[2512.04821]] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation(https://arxiv.org/abs/2512.04821)
Keywords: generative
Abstract: Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.
摘要：随着流匹配（FM）的出现，生成模型取得了显着的进步。它展示了强大的生成能力，并作为一种能够学习精确数据密度的无模拟、基于流的框架而引起了广泛关注。受这些进步的推动，我们提出了 LatentFM，这是一种在潜在空间中运行的基于流的模型，用于医学图像分割。为了对数据分布进行建模，我们首先设计了两个变分自动编码器（VAE），将医学图像及其相应的掩模编码到低维潜在空间中。然后，我们根据输入图像估计引导流动的条件速度场。通过对多个潜在表示进行采样，我们的方法合成了不同的分割输出，其像素级方差可靠地捕获了底层数据分布，从而实现了高度准确和不确定性感知的预测。此外，我们生成量化模型确定性的置信图，为临床医生提供更丰富的信息以进行更深入的分析。我们在 ISIC-2018 和 CVC-Clinic 这两个数据集上进行实验，并将我们的方法与之前的几个基线（包括确定性和生成方法模型）进行比较。通过综合评估，定性和定量结果表明，我们的方法实现了卓越的分割精度，同时在潜在空间中保持高效。

Title: FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Authors: Shijie Chen, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04830
Pdf URL: https://arxiv.org/pdf/2512.04830
Copy Paste: [[2512.04830]] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis(https://arxiv.org/abs/2512.04830)
Keywords: generation, generative
Abstract: Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.
摘要：自动驾驶的闭环模拟和可扩展预训练需要合成自由视点驾驶场景。然而，现有的数据集和生成管道很少提供一致的偏离轨迹观察，限制了大规模评估和训练。虽然最近的生成模型表现出很强的视觉真实感，但它们很难在没有每个场景优化的情况下共同实现插值一致性和外推真实感。为了解决这个问题，我们提出了 FreeGen，一种用于自由视点驾驶场景合成的前馈重建生成协同训练框架。重建模型提供稳定的几何表示以确保插值一致性，而生成模型执行几何感知增强以提高看不见的视点的真实感。通过协同训练，生成先验被提炼到重建模型中以改善偏离轨迹渲染，而细化的几何形状反过来为生成提供了更强的结构指导。实验表明，FreeGen 在自由视点驾驶场景合成方面实现了最先进的性能。

Title: Tokenizing Buildings: A Transformer for Layout Synthesis

Authors: Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04832
Pdf URL: https://arxiv.org/pdf/2512.04832
Copy Paste: [[2512.04832]] Tokenizing Buildings: A Transformer for Layout Synthesis(https://arxiv.org/abs/2512.04832)
Keywords: generative
Abstract: We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.
摘要：我们介绍小型建筑模型 (SBM)，这是一种基于 Transformer 的架构，用于建筑信息模型 (BIM) 场景中的布局合成。我们解决了如何通过将建筑元素的异构特征集统一到序列中来标记建筑物的问题，同时保留组合结构。此类特征集表示为捕获房间属性的稀疏属性特征矩阵。然后，我们设计一个统一的嵌入模块，用于学习分类和可能相关的连续特征组的联合表示。最后，我们以两种模式训练单个 Transformer 主干：产生高保真房间嵌入的仅编码器路径，以及用于房间实体自回归预测的编码器-解码器管道，称为数据驱动实体预测（DDEP）。检索和生成布局综合的实验表明，SBM 学习紧凑的房间嵌入，这些嵌入可以按类型和拓扑可靠地聚类，从而实现强大的语义检索。在 DDEP 模式下，SBM 生成功能健全的布局，减少碰撞和边界违规，并提高导航性。

Title: Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens

Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04857
Pdf URL: https://arxiv.org/pdf/2512.04857
Copy Paste: [[2512.04857]] Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens(https://arxiv.org/abs/2512.04857)
Keywords: generation
Abstract: Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
摘要：由于其可扩展性和通用性，自回归（AR）视觉生成已成为图像和多模态合成的强大范例。然而，由于在解码过程中需要缓存所有先前生成的视觉标记，现有的 AR 图像生成面临严重的内存瓶颈，导致存储需求高和吞吐量低。在本文中，我们介绍了 \textbf{LineAR}，这是一种新颖的、免训练的渐进式键值 (KV) 缓存压缩管道，用于自回归图像生成。通过充分利用视觉注意力的内在特征，LineAR 使用 2D 视图在行级别管理缓存，保留视觉依赖区域，同时在行间注意力的指导下逐步驱逐对后续行生成无害的信息较少的标记。 LineAR 仅利用几行缓存即可实现高效的自回归 (AR) 图像生成，从而节省内存并提高吞吐量，同时保持甚至提高生成质量。六种自回归图像生成模型（包括类条件和文本到图像生成）的广泛实验验证了其有效性和通用性。 LineAR 在 LlamaGen-XL 和 Janus-Pro-1B 上将 ImageNet FID 从 2.77 提高到 2.68，将 COCO FID 从 23.85 提高到 22.86，同时仅保留 1/6 KV 缓存。它还改进了 Lumina-mGPT-768 上的 DPG，仅具有 1/8 KV 缓存。此外，LineAR 还实现了显着的内存和吞吐量增益，包括在 LlamaGen-XL 上内存减少高达 67.61%，加速提高 7.57 倍，在 Janus-Pro-7B 上内存减少 39.66%，加速提高 5.62 倍。

Title: Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

Authors: Maria-Paola Forte, Nikos Athanasiou, Giulia Ballardini, Jan Ulrich Bartels, Katherine J. Kuchenbecker, Michael J. Black
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04862
Pdf URL: https://arxiv.org/pdf/2512.04862
Copy Paste: [[2512.04862]] Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing(https://arxiv.org/abs/2512.04862)
Keywords: generation
Abstract: Capturing accurate 3D human pose in the wild would provide valuable data for training pose estimation and motion generation methods. While video-based estimation approaches have become increasingly accurate, they often fail in common scenarios involving self-contact, such as a hand touching the face. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel framework that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization during measured self-contact: reprojection error and deviations from the input estimate are minimized while enforcing vertex proximity constraints. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture. Testing with three input pose estimators, we demonstrate an average of 11.7% improvement in reconstruction accuracy. We also present a miniature wearable bioimpedance sensor that enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation using BioTUCH. Code and data are available at this http URL
摘要：在野外捕获准确的 3D 人体姿势将为训练姿势估计和运动生成方法提供有价值的数据。虽然基于视频的估计方法变得越来越准确，但它们在涉及自我接触的常见场景中经常失败，例如手触摸脸部。相比之下，可穿戴生物阻抗传感可以廉价且不引人注目地测量真实的皮肤与皮肤接触。因此，我们提出了一种新颖的框架，将视觉姿势估计器与生物阻抗传感相结合，通过考虑自接触来捕获人的 3D 姿势。我们的方法 BioTUCH 使用现成的估计器初始化姿势，并在测量的自接触过程中引入接触感知姿势优化：在强制顶点邻近约束的同时，最小化重投影误差和与输入估计的偏差。我们使用同步 RGB 视频、生物阻抗测量和 3D 动作捕捉的新数据集来验证我们的方法。使用三个输入姿态估计器进行测试，我们证明重建精度平均提高了 11.7%。我们还推出了一种微型可穿戴生物阻抗传感器，能够有效地大规模收集接触感知训练数据，从而使用 BioTUCH 改进姿势估计和生成。代码和数据可在此 http URL 获取

Title: ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

Authors: Guanbo Huang, Jingjia Mao, Fanding Huang, Fengkai Liu, Xiangyang Luo, Yaoyuan Liang, Jiasheng Lu, Xiaoe Wang, Pei Liu, Ruiliu Fu, Shao-Lun Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04904
Pdf URL: https://arxiv.org/pdf/2512.04904
Copy Paste: [[2512.04904]] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching(https://arxiv.org/abs/2512.04904)
Keywords: generation
Abstract: Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.
摘要：尽管最近取得了巨大进展，但由于训练和推理的差异，流匹配方法仍然存在暴露偏差。本文研究了流匹配中暴露偏差的根本原因，包括：（1）模型在训练过程中缺乏对有偏差输入的泛化，（2）早期去噪过程中捕获的低频内容不足，导致累积偏差。基于这些见解，我们提出了 ReflexFlow，这是对 Flow Matching 学习目标的简单而有效的反射性改进，可以动态纠正暴露偏差。 ReflexFlow 由两个组件组成：（1）抗漂移校正（ADR），它利用训练时间计划采样下重新设计的损失，反射性地调整有偏差输入的预测目标； (2)频率补偿(FC)，反映丢失的低频分量，并通过使用曝光偏差重新加权损失来补偿它们。 ReflexFlow 与模型无关，与所有流匹配框架兼容，并提高了跨数据集的生成质量。 CIFAR-10、CelebA-64 和 ImageNet-256 上的实验表明，ReflexFlow 在减轻曝光偏差方面优于先前的方法，在 CelebA-64 上实现了 35.65% 的 FID 降低。

Title: Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Authors: Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04926
Pdf URL: https://arxiv.org/pdf/2512.04926
Copy Paste: [[2512.04926]] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion(https://arxiv.org/abs/2512.04926)
Keywords: generation
Abstract: Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: this https URL.
摘要：潜在扩散模型 (LDM) 本质上遵循从粗到细的生成过程，其中高级语义结构的生成稍早于细粒度纹理。这表明前面的语义通过提供语义锚点可能有利于纹理生成。最近的进展集成了预训练视觉编码器的语义先验，以进一步增强 LDM，但它们仍然同步对语义和 VAE 编码纹理进行去噪，忽略了这种排序。观察这些，我们提出语义优先扩散（SFD），这是一种明确优先考虑语义形成的潜在扩散范式。 SFD 首先通过将紧凑的语义潜在信息与纹理潜在信息相结合来构建复合潜在信息，该语义潜在信息是通过专用语义 VAE 从预训练的视觉编码器中提取的。 SFD 的核心是使用单独的噪声调度对语义和纹理潜伏进行异步降噪：语义通过时间偏移先于纹理，为纹理细化提供更清晰的高级指导，并实现自然的从粗到细的生成。在有指导的 ImageNet 256x256 上，SFD 实现了 FID 1.06 (LightningDiT-XL) 和 FID 1.04 (1.0B LightningDiT-XXL)，同时实现比原始 DiT 快 100 倍的收敛速度。 SFD 还改进了 ReDi 和 VA-VAE 等现有方法，展示了异步、语义主导建模的有效性。项目页面和代码：此 https URL。

Title: Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis

Authors: Jasmaine Khale, Ravi Prakash Srivastava
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04967
Pdf URL: https://arxiv.org/pdf/2512.04967
Copy Paste: [[2512.04967]] Balanced Few-Shot Episodic Learning for Accurate Retinal Disease Diagnosis(https://arxiv.org/abs/2512.04967)
Keywords: generation
Abstract: Automated retinal disease diagnosis is vital given the rising prevalence of conditions such as diabetic retinopathy and macular degeneration. Conventional deep learning approaches require large annotated datasets, which are costly and often imbalanced across disease categories, limiting their reliability in practice. Few-shot learning (FSL) addresses this challenge by enabling models to generalize from only a few labeled samples per class. In this study,we propose a balanced few-shot episodic learning framework tailored to the Retinal Fundus Multi-Disease Image Dataset (RFMiD). Focusing on the ten most represented classes, which still show substantial imbalance between majority diseases (e.g., Diabetic Retinopathy, Macular Hole) and minority ones (e.g., Optic Disc Edema, Branch Retinal Vein Occlusion), our method integrates three key components: (i) balanced episodic sampling, ensuring equal participation of all classes in each 5-way 5-shot episode; (ii) targeted augmentation, including Contrast Limited Adaptive Histogram Equalization (CLAHE) and color/geometry transformations, to improve minority-class di- versity; and (iii) a ResNet-50 encoder pretrained on ImageNet, selected for its superior ability to capture fine-grained retinal features. Prototypes are computed in the embedding space and classification is performed with cosine similarity for improved stability. Trained on 100 episodes and evaluated on 1,000 test episodes, our framework achieves substantial accuracy gains and reduces bias toward majority classes, with notable improvements for underrepresented diseases. These results demonstrate that dataset-aware few-shot pipelines, combined with balanced sampling and CLAHE-enhanced preprocessing, can deliver more robust and clinically fair retinal disease diagnosis under data-constrained conditions.
摘要：鉴于糖尿病视网膜病变和黄斑变性等疾病的患病率不断上升，自动化视网膜疾病诊断至关重要。传统的深度学习方法需要大量带注释的数据集，这些数据集成本高昂，而且在疾病类别之间往往不平衡，限制了其在实践中的可靠性。少样本学习 (FSL) 通过使模型能够从每个类别的少数标记样本中进行泛化来解决这一挑战。在这项研究中，我们提出了一个针对视网膜眼底多疾病图像数据集（RFMiD）量身定制的平衡的少镜头情景学习框架。关注最具代表性的十个类别，这些类别在大多数疾病（例如，糖尿病视网膜病变、黄斑裂孔）和少数疾病（例如，视盘水肿、分支视网膜静脉阻塞）之间仍然表现出严重不平衡，我们的方法整合了三个关键组成部分：（i）平衡的情景抽样，确保所有类别平等参与每个 5 路 5 镜头事件；（ii）有针对性的增强，包括对比度有限自适应直方图均衡（CLAHE）和颜色/几何变换，以提高少数群体的多样性； (iii) 在 ImageNet 上预训练的 ResNet-50 编码器，因其捕获细粒度视网膜特征的卓越能力而被选中。在嵌入空间中计算原型，并使用余弦相似度进行分类以提高稳定性。经过 100 个集的训练和 1,000 个测试集的评估，我们的框架实现了显着的准确性提升并减少了对大多数类别的偏见，对代表性不足的疾病有了显着的改善。这些结果表明，数据集感知的少样本管道与平衡采样和 CLAHE 增强预处理相结合，可以在数据受限的条件下提供更稳健和临床公平的视网膜疾病诊断。

Title: Rethinking the Use of Vision Transformers for AI-Generated Image Detection

Authors: NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04969
Pdf URL: https://arxiv.org/pdf/2512.04969
Copy Paste: [[2512.04969]] Rethinking the Use of Vision Transformers for AI-Generated Image Detection(https://arxiv.org/abs/2512.04969)
Keywords: generative
Abstract: Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.
摘要：来自 CLIP-ViT 的丰富特征表示已广泛应用于 AI 生成的图像检测。虽然大多数现有方法主要利用最后一层的特征，但我们系统地分析了分层特征对此任务的贡献。我们的研究表明，较早的层提供了更多的本地化和泛化特征，通常超过了检测任务中最终层特征的性能。此外，我们发现不同的层捕获数据的不同方面，每个方面对人工智能生成的图像检测都有独特的贡献。受这些发现的启发，我们引入了一种新颖的自适应方法，称为 MoLD，它使用基于门控的机制动态集成多个 ViT 层的特征。对 GAN 和扩散生成图像的大量实验表明，MoLD 显着提高了检测性能，增强了不同生成模型的泛化能力，并在现实场景中表现出鲁棒性。最后，我们通过成功地将其应用于其他预训练的 ViT（例如 DINOv2）来说明该方法的可扩展性和多功能性。

Title: Efficient Generative Transformer Operators For Million-Point PDEs

Authors: Armand Kassaï Koupaï, Lise Le Boudec, Patrick Gallinari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04974
Pdf URL: https://arxiv.org/pdf/2512.04974
Copy Paste: [[2512.04974]] Efficient Generative Transformer Operators For Million-Point PDEs(https://arxiv.org/abs/2512.04974)
Keywords: generation, generative
Abstract: We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 $\times$ spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.
摘要：我们介绍 ECHO，一个用于生成百万点 PDE 轨迹的变换算子框架。虽然现有的神经算子 (NO) 在求解偏微分方程方面表现出了良好的前景，但由于密集网格上的可扩展性差、动态展开过程中的误差累积以及特定于任务的设计，它们在实践中仍然受到限制。 ECHO 通过三项关键创新来应对这些挑战。 (i) 它采用分层卷积编码-解码架构，实现 100 $\times$ 时空压缩，同时保持网格点的保真度。 (ii) 它结合了训练和适应策略，可以从稀疏输入网格生成高分辨率 PDE 解决方案。（iii）它采用生成建模范式，可以学习完整的轨迹段，从而减轻长范围误差漂移。训练策略将表示学习与下游任务监督分离，使模型能够处理多个任务，例如轨迹生成、正向和逆向问题以及插值。生成模型进一步支持条件生成和无条件生成。我们在具有复杂几何形状、高频动态和长期视野的各种偏微分方程系统中展示了最先进的百万点模拟性能。

Title: Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

Authors: NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04981
Pdf URL: https://arxiv.org/pdf/2512.04981
Copy Paste: [[2512.04981]] Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models(https://arxiv.org/abs/2512.04981)
Keywords: generation
Abstract: Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
摘要：基于大型视觉语言模型（LVLM）的文本到图像（T2I）系统已成为图像生成的主导范例，但它们是否会放大社会偏见仍不清楚。在本文中，我们表明基于 LVLM 的模型比非基于 LVLM 的模型产生的社会偏见图像明显更多。我们引入了涵盖四个语言复杂性级别的 1,024 个提示基准，并以系统的方式评估多个属性的人口统计偏差。我们的分析确定系统提示（指导 LVLM 的预定义指令）是偏见行为的主要驱动因素。通过解码的中间表示、标记概率诊断和嵌入关联分析，我们揭示了系统提示如何对传播到图像合成中的人口统计先验进行编码。为此，我们提出了 FairPro，这是一种免培训的元提示框架，使 LVLM 能够在测试时进行自我审核并构建公平感知系统提示。对两个基于 LVLM 的 T2I 模型 SANA 和 Qwen-Image 的实验表明，FairPro 在保持文本图像对齐的同时大大减少了人口统计偏差。我们相信我们的研究结果可以更深入地了解系统提示在偏见传播中的核心作用，并为构建更具社会责任感的 T2I 系统提供实用、可部署的方法。

Title: Reflection Removal through Efficient Adaptation of Diffusion Transformers

Authors: Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.05000
Pdf URL: https://arxiv.org/pdf/2512.05000
Copy Paste: [[2512.05000]] Reflection Removal through Efficient Adaptation of Diffusion Transformers(https://arxiv.org/abs/2512.05000)
Keywords: restoration
Abstract: We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: this https URL
摘要：我们引入了用于单图像反射去除的扩散变换器（DiT）框架，该框架利用了恢复设置中基础扩散模型的泛化优势。我们不依赖特定于任务的架构，而是通过在反射污染的输入上进行调节并引导其走向干净的传输层来重新调整基于 DiT 的预训练基础模型的用途。我们系统地分析现有的反射去除数据源的多样性、可扩展性和真实感。为了解决合适数据的短缺问题，我们在 Blender 中构建了一个基于物理的渲染 (PBR) 管道，围绕 Principled BSDF 构建，以合成逼真的玻璃材质和反射效果。基于 LoRA 的基础模型的高效适应，结合所提出的合成数据，在域内和零样本基准测试中实现了最先进的性能。这些结果表明，预训练的扩散变压器与物理接地数据合成和高效适应相结合，可以为反射消除提供可扩展的高保真解决方案。项目页面：此 https URL

Title: Generative Neural Video Compression via Video Diffusion Prior

Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05016
Pdf URL: https://arxiv.org/pdf/2512.05016
Copy Paste: [[2512.05016]] Generative Neural Video Compression via Video Diffusion Prior(https://arxiv.org/abs/2512.05016)
Keywords: generation, generative
Abstract: We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
摘要：我们提出了 GNVC-VD，这是第一个基于 DiT 的生成神经视频压缩框架，建立在高级视频生成基础模型的基础上，其中时空潜在压缩和序列级生成细化在单个编解码器中统一。现有的感知编解码器主要依靠预先训练的图像生成先验来恢复高频细节，但其逐帧性质缺乏时间建模，不可避免地导致感知闪烁。为了解决这个问题，GNVC-VD 引入了一个统一的流匹配潜在细化模块，该模块利用视频扩散转换器通过序列级去噪来联合增强帧内和帧间潜在，从而确保一致的时空细节。 GNVC-VD 不是像视频生成中那样从纯高斯噪声中进行去噪，而是从解码的时空潜伏中进行初始化细化，并学习一个校正项，该校正项可以在压缩引起的退化之前适应扩散。调节适配器进一步将压缩感知线索注入中间 DiT 层，从而实现有效去除伪影，同时在极端比特率限制下保持时间一致性。大量实验表明，GNVC-VD 在感知质量方面超越了传统和学习编解码器，并显着减少了先前生成方法中持续存在的闪烁伪影，甚至低于 0.01 bpp，凸显了将视频原生生成先验集成到神经编解码器中以实现下一代感知视频压缩的前景。

Title: Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Authors: Abhigyan Bhattacharya, Hiranmoy Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05039
Pdf URL: https://arxiv.org/pdf/2512.05039
Copy Paste: [[2512.05039]] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding(https://arxiv.org/abs/2512.05039)
Keywords: restoration, generative
Abstract: Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
摘要：面部图像修复的目的是恢复面部图像中丢失或损坏的区域，同时保持身份、结构一致性和逼真的图像质量，这是专门为照片修复创建的任务。尽管深度生成模型最近取得了很多进展，但现有方法面临着大型不规则掩模的问题，由于直接像素级合成方法和面部先验的有限利用，通常会在掩模区域的边缘产生模糊的纹理、语义不一致或不令人信服的面部结构。在本文中，我们提出了一种新颖的架构，通过语义引导的层次综合来解决上述挑战。我们的方法首先是根据含义组织和合成信息，然后细化纹理。在我们继续创建详细图像之前，这个过程可以清晰地了解面部结构。在第一阶段，我们融合了两种技术：一种是使用 CNN 关注局部特征，另一种是使用 Vision Transformer 关注全局特征。这帮助我们创建清晰且详细的语义布局。在第二阶段，我们使用多模态纹理生成器通过从不同尺度提取信息来细化这些布局，确保一切看起来都有凝聚力和一致。该架构通过动态注意力自然地处理任意掩模配置，无需特定于掩模的训练。在两个数据集 CelebA-HQ 和 FFHQ 上进行的实验表明，我们的模型优于其他最先进的方法，在 LPIPS、PSNR 和 SSIM 等指标方面显示出改进。在具有挑战性的大面积修复情况下，它可以产生视觉上引人注目的结果，并具有更好的语义保留。

Title: Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Authors: Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05044
Pdf URL: https://arxiv.org/pdf/2512.05044
Copy Paste: [[2512.05044]] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image(https://arxiv.org/abs/2512.05044)
Keywords: generation
Abstract: Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: this https URL.
摘要：从单个静态图像生成交互式动态 4D 场景仍然是一个核心挑战。大多数现有的“生成然后重建”和“重建然后生成”方法将几何图形与运动解耦，导致时空不一致和泛化能力差。为了解决这些问题，我们扩展了重建然后生成框架来联合执行 4D 合成的运动生成和几何重建 (MoRe4D)。我们首先介绍 TrajScene-60K，这是一个包含 60,000 个视频样本的大规模数据集，具有密集的点轨迹，解决了高质量 4D 场景数据的稀缺问题。基于此，我们提出了一种基于扩散的 4D 场景轨迹生成器 (4D-STraG)，以联合生成几何一致且运动合理的 4D 点轨迹。为了利用单视图先验，我们设计了深度引导运动标准化策略和运动感知模块，以实现有效的几何和动力学集成。然后，我们提出了一个 4D 视图合成模块 (4D-ViSM)，用于根据 4D 点轨迹表示来渲染具有任意相机轨迹的视频。实验表明，MoRe4D 可以从单个图像生成具有多视图一致性和丰富动态细节的高质量 4D 场景。代码：此 https URL。

Title: BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05076
Pdf URL: https://arxiv.org/pdf/2512.05076
Copy Paste: [[2512.05076]] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation(https://arxiv.org/abs/2512.05076)
Keywords: generation
Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: this https URL
摘要：新兴的视频扩散模型实现了高视觉保真度，但从根本上将场景动态与摄像机运动耦合在一起，限制了它们提供精确的空间和时间控制的能力。我们引入了 4D 可控视频扩散框架，该框架明确地将场景动态与摄像机姿态解耦，从而能够对场景动态和摄像机视点进行细粒度操作。我们的框架采用连续的世界时间序列和摄像机轨迹作为条件输入，通过注意力层中的 4D 位置编码和特征调制的自适应归一化将它们注入视频扩散模型。为了训练这个模型，我们创建了一个独特的数据集，其中时间和相机变化是独立参数化的；该数据集将被公开。实验表明，我们的模型在不同的时序模式和相机轨迹上实现了强大的现实世界 4D 控制，同时保持了高生成质量并在可控性方面优于先前的工作。请参阅我们的网站了解视频结果：此 https URL

Title: Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Authors: Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.05079
Pdf URL: https://arxiv.org/pdf/2512.05079
Copy Paste: [[2512.05079]] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints(https://arxiv.org/abs/2512.05079)
Keywords: generation, generative
Abstract: Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
摘要：物体几何形状是机器人操纵的关键信息。然而，对象重建是一项具有挑战性的任务，因为相机只能捕获对象的部分观察结果，尤其是在发生遮挡时。在本文中，我们利用两个额外的信息源来减少视觉信号的模糊性。首先，生成模型学习常见物体形状的先验，使我们能够对几何中不可见的部分做出合理的猜测。其次，可以从视频和物理交互中获得的接触信息为几何边界提供了稀疏约束。我们通过接触引导的 3D 生成将两种信息源结合起来。指导制定的灵感来自于生成模型中基于拖动的编辑。对合成数据和真实世界数据的实验表明，与纯 3D 生成和基于接触的优化相比，我们的方法改进了重建。

Title: OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design

Authors: Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah, Ramith Hettiarachchi, David R. Koes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.05080
Pdf URL: https://arxiv.org/pdf/2512.05080
Copy Paste: [[2512.05080]] OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design(https://arxiv.org/abs/2512.05080)
Keywords: generative
Abstract: Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at this https URL
摘要：基于结构的药物设计（SBDD）专注于设计与特定蛋白质口袋结合的小分子配体。计算方法是现代 SBDD 工作流程中不可或缺的一部分，并且通常通过对接或药效团搜索来利用虚拟筛选方法。现代生成建模方法侧重于通过从头设计来改进新型配体的发现。在这项工作中，我们认识到这些任务共享一个共同的结构，因此可以表示为一致的生成建模框架的不同实例。我们在 OMTRA 中提出了一种统一的方法，这是一种多模式流匹配模型，可以灵活地执行与 SBDD 相关的许多任务，包括一些传统工作流程中没有的类似任务。此外，我们还整理了 500M 3D 分子构象异构体的数据集，补充了蛋白质配体数据并扩展了可用于训练的化学多样性。 OMTRA 在袖珍条件从头设计和对接方面获得了最先进的性能；然而，大规模预训练和多任务训练的效果有限。用于重现这项工作的所有代码、经过训练的模型和数据集均可在此 https URL 中找到

Title: Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05081
Pdf URL: https://arxiv.org/pdf/2512.05081
Copy Paste: [[2512.05081]] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression(https://arxiv.org/abs/2512.05081)
Keywords: generation
Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
摘要：自回归视频扩散的最新进展已经实现了实时帧流，但现有的解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现，天真地将 StreamingLLM 式的注意力集中应用于视频扩散会导致保真度下降和运动停滞。为了克服这个问题，我们引入了深度强制，它由两种免训练机制组成，无需任何微调即可解决此问题。具体来说，1) Deep Sink 将一半的滑动窗口专用于持久性接收器令牌，并将其时间 RoPE 阶段重新与当前时间线对齐，从而在长时间部署期间稳定全局上下文。 2) 参与压缩执行重要性感知的 KV 缓存修剪，仅保留积极参与最近关注的令牌，同时安全地丢弃冗余和降级的历史记录，最大限度地减少分布长度生成下的错误累积。这些组件共同实现了超过 12 倍的外推（例如，5 秒训练到 60 秒以上的生成），具有比 LongLive 更好的成像质量、比 RollingForcing 更好的美学质量、几乎保持整体一致性，并在动态程度方面大幅提高，同时保持实时生成。我们的结果表明，免训练的 KV 缓存管理可以匹配或超过基于训练的自回归流式长视频生成方法。

Title: SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards

Authors: Yuan Gao, Jin Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.05098
Pdf URL: https://arxiv.org/pdf/2512.05098
Copy Paste: [[2512.05098]] SA-IQA: Redefining Image Quality Assessment for Spatial Aesthetics with Multi-Dimensional Rewards(https://arxiv.org/abs/2512.05098)
Keywords: generation, quality assessment
Abstract: In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
摘要：近年来，针对人工智能生成图像（AIGI）的图像质量评估（IQA）发展迅速；然而，现有方法主要针对肖像和艺术图像，缺乏对室内场景的系统评估。我们引入空间美学，这是一种从四个维度评估室内图像美学质量的范式：布局、和谐、照明和扭曲。我们构建了 SA-BENCH，第一个空间美学基准，包含 18,000 张图像和 50,000 个精确注释。利用SA-BENCH，我们系统地评估了当前的IQA方法，并通过MLLM微调和多维融合方法开发了SA-IQA，作为评估空间美学的综合奖励框架。我们将 SA-IQA 应用于两个下游任务：（1）作为与 GRPO 强化学习集成的奖励信号来优化 AIGC 生成管道，以及（2）Best-of-N 选择来过滤高质量图像并提高生成质量。实验表明，SA-IQA 显着优于 SA-BENCH 上的现有方法，为空间美学评估设立了新标准。代码和数据集将开源，以推进该领域的研究和应用。

Title: TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.05103
Pdf URL: https://arxiv.org/pdf/2512.05103
Copy Paste: [[2512.05103]] TV2TV: A Unified Framework for Interleaved Language and Video Generation(https://arxiv.org/abs/2512.05103)
Keywords: generation, generative
Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
摘要：视频生成模型正在迅速发展，但仍然难以应对复杂的视频输出，这些输出需要大量的语义分支或对接下来应该发生的情况进行重复的高级推理。在本文中，我们介绍了一类新型全向视频文本模型，该模型集成了最新的 LM 推理进展的思想来应对这一挑战。更具体地说，我们提出了 TV2TV，一个统一的生成建模框架，它将视频生成分解为交错的文本和视频生成过程。 TV2TV 使用 Mixture-of-Transformers (MoT) 架构联合学习语言建模（下一个标记预测）和视频流匹配（下一帧预测）。在推理时，TV2TV 决定何时交替生成文本和视频帧，从而允许模型在“以像素行动”生成帧之前“用文字思考”后续内容。这种设计减轻了决定语言建模塔旁边应该发生什么的大部分责任，从而提高了视觉质量并及时对齐生成的视频。它还实现了细粒度的可控性，允许用户在过程中的任何点通过文本干预来修改视频生成轨迹。在视频游戏数据的受控实验中，TV2TV 在视觉质量和可控性方面都表现出了显着的改进。 TV2TV 还可以扩展到自然视频，正如我们通过使用视觉语言模型 (VLM) 通过交错的自然语言动作描述来增强体育视频所展示的那样。在此语料库上训练 TV2TV 可产生强大的视觉质量和及时的对齐，展示了模型推理和生成复杂的现实世界动作序列的能力。总之，这些结果凸显了 TV2TV 是朝着具有开放式文本推理和控制的视频生成迈出了有希望的一步。

Title: EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation

Authors: Jiaqi Ma, Shengkai Hu, Jun Wan, Jiaxing Huang, Lefei Zhang, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05104
Pdf URL: https://arxiv.org/pdf/2512.05104
Copy Paste: [[2512.05104]] EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation(https://arxiv.org/abs/2512.05104)
Keywords: restoration
Abstract: All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
摘要：多合一图像恢复 (AiOIR) 任务通常涉及多种退化，需要强大且通用的策略。然而，大多数现有方法通常缺乏明确的频率建模，并依赖于固定或启发式优化计划，这限制了异构退化的泛化。为了解决这些限制，我们提出了 EvoIR，这是一种 AiOIR 特定的框架，它引入了用于动态和自适应图像恢复的进化频率调制。具体来说，EvoIR 采用调频模块 (FMM)，以显式方式将特征分解为高频和低频分支，并自适应调制它们以增强结构保真度和细粒度细节。 EvoIR 的核心是进化优化策略 (EOS)，通过基于群体的进化过程迭代调整频率感知目标，动态平衡结构准确性和感知保真度。其进化指导进一步减轻了退化过程中的梯度冲突并加速了收敛。通过协同 FMM 和 EOS，EvoIR 比单独使用任一组件产生了更大的改进，强调了它们的互补作用。对多个基准的大量实验表明，EvoIR 优于最先进的 AiOIR 方法。

Title: NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Authors: Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister
Subjects: cs.CV, cs.GR, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.05106
Pdf URL: https://arxiv.org/pdf/2512.05106
Copy Paste: [[2512.05106]] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation(https://arxiv.org/abs/2512.05106)
Keywords: generation
Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion {\phi}-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. {\phi}-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, {\phi}-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, {\phi}-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{this https URL}{project page}.
摘要：标准扩散使用高斯噪声破坏数据，其傅里叶系数具有随机幅度和随机相位。虽然对于无条件或文本到图像生成有效，但破坏相位分量会破坏空间结构，使其不适合需要几何一致性的任务，例如重新渲染、模拟增强和图像到图像转换。我们引入了保相扩散 {\phi}-PD，这是一种与模型无关的扩散过程重新表述，它在随机化幅度的同时保留输入相位，从而无需架构更改或附加参数即可实现结构对齐生成。我们进一步提出频率选择结构（FSS）噪声，它通过单个频率截止参数提供对结构刚度的连续控制。 {\phi}-PD 不增加推理时间成本，并且与图像或视频的任何扩散模型兼容。通过照片级真实感和风格化重新渲染，以及驾驶规划者的模拟到真实增强，{\phi}-PD 可以产生可控的、空间对齐的结果。当应用于 CARLA 模拟器时，{\phi}-PD 将 CARLA 到 Waymo 规划器的性能提高了 50\%。该方法是对现有调节方法的补充，广泛适用于图像到图像和视频到视频的生成。视频、其他示例和代码可在我们的 \href{此 https URL}{项目页面} 上找到。

Title: ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.05110
Pdf URL: https://arxiv.org/pdf/2512.05110
Copy Paste: [[2512.05110]] ShadowDraw: From Any Object to Shadow-Drawing Compositional Art(https://arxiv.org/abs/2512.05110)
Keywords: generation, generative
Abstract: We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page this https URL for more results and an end-to-end real-world demonstration of our pipeline!
摘要：我们介绍 ShadowDraw，一个将普通 3D 对象转换为阴影绘制构图艺术的框架。给定一个 3D 对象，我们的系统会预测场景参数，包括对象姿势和光照，以及部分线条绘制，以便投射阴影将绘图完成为可识别的图像。为此，我们优化场景配置以揭示有意义的阴影，使用阴影笔划来指导线条图生成，并采用自动评估来增强阴影绘制的连贯性和视觉质量。实验表明，ShadowDraw 在不同的输入（从现实世界的扫描和精选数据集到生成资产）中产生令人信服的结果，并自然地扩展到多对象场景、动画和物理部署。我们的工作为创建皮影艺术提供了一条实用的管道，并拓宽了计算视觉艺术的设计空间，弥合了算法设计和艺术叙事之间的差距。查看我们的项目页面（此 https URL）以获取更多结果以及我们管道的端到端真实演示！

Title: ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05111
Pdf URL: https://arxiv.org/pdf/2512.05111
Copy Paste: [[2512.05111]] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning(https://arxiv.org/abs/2512.05111)
Keywords: generative
Abstract: Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
摘要：奖励模型对于使视觉语言系统与人类偏好保持一致至关重要，但当前的方法存在幻觉、视觉基础薄弱以及无法使用工具进行验证的问题，限制了它们在复杂的多模态推理任务上的可靠性。我们提出了 ARM-Thinker，一种自主的多模式奖励模型，它自动调用外部工具（例如图像裁剪、文档页面检索）来根据可验证的证据进行判断，取代静态的、非交互式的奖励评分。这使得模型能够验证细粒度的视觉细节、交叉引用多页证据并验证推理主张，这些都是现有奖励模型所缺乏的功能。我们通过多阶段强化学习来训练 ARM-Thinker，共同优化工具调用决策和判断准确性。为了评估代理奖励模型，我们引入了 ARMBench-VL，它包含三个基准测试，分别评估细粒度视觉基础（图像级工具）、多页文档理解（检索工具）和指令遵循（文本级验证）。 ARM-Thinker 在奖励建模基准上实现了 +16.2% 的平均改进，在工具使用任务上实现了 +9.6% 的平均改进，并且在多模态数学和逻辑推理基准上优于基准。我们的结果表明，代理能力显着提高了奖励模型的准确性和可解释性。

Title: DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.05112
Pdf URL: https://arxiv.org/pdf/2512.05112
Copy Paste: [[2512.05112]] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation(https://arxiv.org/abs/2512.05112)
Keywords: super-resolution, generation
Abstract: Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
摘要：最近的统一多模态大语言模型 (MLLM) 显示了令人印象深刻的功能，结合了思想链 (CoT) 推理来增强文本到图像的生成。然而，现有的方法仍然有限，要么仅将模型视为独立的生成器，要么依赖于抽象的文本规划。为此，我们提出了 Draft-as-CoT (DraCo)，这是一种新颖的交错推理范式，充分利用 CoT 中的文本和视觉内容来更好地规划和验证。我们的方法首先生成一个低分辨率草稿图像作为预览，提供更具体和结构性的视觉规划和指导。然后，我们利用模型固有的理解能力来验证草稿和输入提示之间潜在的语义不一致，并通过超分辨率的选择性校正来进行细化。通过这种方式，我们的方法解决了两个基本挑战：文本规划的粗粒度性质和生成稀有属性组合的难度。为了支持培训，我们策划了 DraCo-240K，旨在增强涵盖一般校正、实例操作和布局重组的三种原子功能。在 DraCo-CFG（一种用于交错推理的专用无分类器引导 (CFG) 策略）的支持下，DraCo 在 GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上实现了巨大的提升，显着优于直接生成和 CoT 授权的其他生成方法。

Title: Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Authors: Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05115
Pdf URL: https://arxiv.org/pdf/2512.05115
Copy Paste: [[2512.05115]] Light-X: Generative 4D Video Rendering with Camera and Illumination Control(https://arxiv.org/abs/2512.05115)
Keywords: generation, generative
Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
摘要：照明控制的最新进展将基于图像的方法扩展到视频，但仍然面临照明保真度和时间一致性之间的权衡。除了重新照明之外，现实世界场景生成建模的关键一步是相机轨迹和照明的联合控制，因为视觉动态本质上是由几何和照明共同塑造的。为此，我们推出了 Light-X，这是一种视频生成框架，可以通过视点和照明控制实现单目视频的可控渲染。 1）我们提出了一种解耦几何体和照明信号的解耦设计：几何体和运动是通过沿着用户定义的相机轨迹投影的动态点云捕获的，而照明线索是由一致投影到相同几何体的重照明框架提供的。这些明确的、细粒度的线索能够有效地解开并引导高质量的照明。 2）为了解决缺乏配对多视图和多照明视频的问题，我们引入了 Light-Syn，这是一种基于退化的管道，具有逆映射功能，可以从野外单目镜头中合成训练对。该策略产生一个涵盖静态、动态和人工智能生成场景的数据集，确保稳健的训练。大量实验表明，Light-X 在联合摄像机照明控制方面优于基线方法，并且在文本和背景条件设置下均优于先前的视频重新照明方法。

Title: Value Gradient Guidance for Flow Matching Alignment

Authors: Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, Dinghuai Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.05116
Pdf URL: https://arxiv.org/pdf/2512.05116
Copy Paste: [[2512.05116]] Value Gradient Guidance for Flow Matching Alignment(https://arxiv.org/abs/2512.05116)
Keywords: generative
Abstract: While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
摘要：虽然存在将流匹配模型（一类流行且有效的生成模型）与人类偏好对齐的方法，但现有方法无法实现适应效率和概率上合理的先验保留。在这项工作中，我们利用最优控制理论，提出了 VGG-Flow，一种基于梯度匹配的方法，用于微调预训练的流匹配模型。该算法背后的关键思想是，微调速度场和预训练速度场之间的最佳差异应与值函数的梯度场相匹配。该方法不仅结合了来自奖励模型的一阶信息，而且受益于价值函数的启发式初始化以实现快速适应。根据经验，我们在流行的文本到图像流匹配模型（Stable Diffusion 3）上证明，我们的方法可以在有限的计算预算下微调流匹配模型，同时实现有效且保留先验的对齐。