2025-12-03

Title: Pharmacophore-based design by learning on voxel grids

Authors: Omar Mahmood, Pedro O. Pinheiro, Richard Bonneau, Saeed Saremi, Vishnu Sresht
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2512.02031
Pdf URL: https://arxiv.org/pdf/2512.02031
Copy Paste: [[2512.02031]] Pharmacophore-based design by learning on voxel grids(https://arxiv.org/abs/2512.02031)
Keywords: generation, generative
Abstract: Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore-shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel-based generative modelling, we propose a pharmacophore-based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore-based virtual screening. We introduce \emph{VoxCap}, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore-based generation: \emph{de-novo} design, in which we aim to generate new molecules with high pharmacophore-shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse \textit{de-novo} hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.
摘要：基于配体的药物发现 (LBDD) 依赖于利用已知的蛋白质靶标结合物来寻找相似可能结合的结构多样的分子。该过程通常涉及使用某种分子相似性度量对分子库进行已知结合物（查询）的强力搜索。一种流行的方法将已知结合物的药效团形状轮廓覆盖到每个库分子枚举的 3D 构象，计算重叠，并挑选一组具有高重叠的不同库分子。虽然这种虚拟筛选工作流程在命中多样化、支架跳跃和专利破坏方面取得了相当大的成功，但它与文库大小的扩展性很差，并且将候选化合物的生成限制为现有的文库化合物。利用基于体素的生成建模的最新进展，我们提出了一种基于药效团的生成模型和工作流程，解决了传统的基于药效团的虚拟筛选的规模和繁殖力问题。我们引入 \emph{VoxCap}，一种体素字幕方法，用于从体素化分子表示生成 SMILES 字符串。我们提出了两个工作流程作为实际用例以及基于药效团的生成的基准：\emph{de-novo}设计，其中我们的目标是生成与查询分子具有高度药效团形状相似性的新分子；以及快速搜索，其旨在将生成设计与廉价的2D子结构相似性搜索相结合，以实现有效的命中识别。我们的结果表明，VoxCap 在生成不同的 \textit{de-novo} 命中方面显着优于以前的方法。当与我们的快速搜索工作流程相结合时，VoxCap 将计算时间减少了几个数量级，同时返回所有查询分子的命中，从而能够搜索难以通过强力搜索的大型库。

Title: Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering

Authors: Haoxiang Zhang, Ruihao Yuan, Lihui Zhang, Yushi Luo, Qiang Zhang, Pan Ding, Xiaodong Ren, Weijie Xing, Niu Gao, Jishan Chen, Chubo Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02057
Pdf URL: https://arxiv.org/pdf/2512.02057
Copy Paste: [[2512.02057]] Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering(https://arxiv.org/abs/2512.02057)
Keywords: generation
Abstract: The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers' cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.
摘要：工程人工智能 (AI4E) 的工业应用面临两个基本瓶颈：高质量数据稀缺和黑盒模型缺乏可解释性，这在航空航天等安全敏感领域尤其重要。我们提出了一个可解释的、少样本的 AI4E 框架，该框架在整个架构中系统地吸收了物理学和专家知识。从空中 K439B 高温合金铸件修复焊接案例中仅 32 个实验样本开始，我们首先通过三阶段协议增强物理上合理的合成数据：针对过程变异进行校准的差异化噪声注入、强制物理约束的执行以及参数间关系的保存。然后，我们采用嵌套优化策略来发现本构模型，其中符号回归探索方程结构，而差分进化优化参数，然后使用混合全局局部优化进行密集参数细化。由此产生的可解释本构方程在预测热裂倾向方面的准确度达到 88%。该方程不仅提供了定量预测，还提供了明确的物理见解，揭示了热、几何和冶金机制如何耦合以驱动裂纹，从而提高工程师对该过程的认知理解。此外，本构方程可作为流程优化和高保真虚拟数据生成的多功能工具，从而提高其他数据驱动模型的准确性。我们的方法为开发值得信赖的人工智能系统提供了总体蓝图，将工程领域知识直接嵌入到其架构中，从而能够在数据有限但可以进行物理理解的高风险工业应用中可靠地采用。

Title: FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Authors: Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02161
Pdf URL: https://arxiv.org/pdf/2512.02161
Copy Paste: [[2512.02161]] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges(https://arxiv.org/abs/2512.02161)
Keywords: generation, generative
Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.
摘要：文本到图像 (T2I) 模型能够生成视觉上令人印象深刻的图像，但它们通常无法准确捕获用户提示中的特定属性，例如具有指定颜色的对象的正确数量。此类错误的多样性强调了对分层评估框架的需求，该框架可以比较不同图像生成模型的即时依从能力。同时，视觉语言模型 (VLM) 的基准还没有跟上 VLM 用于注释的场景的复杂性。在这项工作中，我们提出了一种联合评估 T2I 模型和 VLM 的结构化方法，通过测试 VLM 是否能够识别 T2I 模型在具有挑战性的提示下生成的图像中的 27 种特定故障模式。我们的第二个贡献是由 5 个 T2I 模型（Flux、SD3-Medium、SD3-Large、SD3.5-Medium、SD3.5-Large）生成的提示和图像数据集以及由 LLM (Llama3) 注释的 VLM（Molmo、InternVL3、Pixtral）的相应注释，以测试 VLM 是否正确识别生成图像中的故障模式。通过分析一组精选提示的故障模式，我们揭示了属性保真度和对象表示方面的系统错误。我们的研究结果表明，当前的指标不足以捕获这些细微的错误，这凸显了有针对性的基准对于提高生成模型可靠性和可解释性的重要性。

Title: SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

Authors: Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker, Amitabh Varshney
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02172
Pdf URL: https://arxiv.org/pdf/2512.02172
Copy Paste: [[2512.02172]] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting(https://arxiv.org/abs/2512.02172)
Keywords: super-resolution
Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.
摘要：3D 高斯泼溅 (3DGS) 可实现高质量的新颖视图合成，激发人们对生成比训练期间可用的更高分辨率渲染的兴趣。一种自然的策略是将超分辨率 (SR) 应用于低分辨率 (LR) 输入视图，但独立增强每个图像会引入多视图不一致，从而导致渲染模糊。先前的方法尝试通过学习的神经组件、时间一致的视频先验或 LR 和 SR 视图的联合优化来减轻这些不一致，但所有方法都在每个图像上统一应用 SR。相比之下，我们的主要见解是，特写 LR 视图可能包含在更远距离视图中捕获的区域的高频信息，并且我们可以使用相对于场景几何形状的相机姿势来告知在何处添加 SR 内容。基于这一见解，我们提出了 SplatSuRe，一种仅在缺乏高频监督的欠采样区域选择性应用 SR 内容的方法，从而产生更清晰、更一致的结果。在 Tanks & Temples、深度混合和 Mip-NeRF 360 中，我们的方法在保真度和感知质量方面都超越了基线。值得注意的是，我们的收益在需要更高细节的局部前景区域中最为显着。

Title: WhAM: Towards A Translative Model of Sperm Whale Vocalization

Authors: Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores García, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2512.02206
Pdf URL: https://arxiv.org/pdf/2512.02206
Copy Paste: [[2512.02206]] WhAM: Towards A Translative Model of Sperm Whale Vocalization(https://arxiv.org/abs/2512.02206)
Keywords: generation
Abstract: Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at this https URL
摘要：抹香鲸通过称为尾声的短序列点击声进行交流。我们推出了 WhAM（鲸鱼声学模型），这是第一个基于变压器的模型，能够根据任何音频提示生成合成抹香鲸尾声。 WhAM 是通过微调 VampNet 构建的，VampNet 是一种基于音乐音频进行预训练的掩蔽声学标记模型，使用过去二十年收集的 10k 尾声录音。通过迭代屏蔽标记预测，WhAM 生成高保真合成尾声，保留源录音的关键声学特征。我们使用 Fréchet Audio Distance 并通过与海洋生物学家专家的感知研究来评估 WhAM 的合成尾声。在包括节奏、社会单位和元音分类在内的下游分类任务中，尽管 WhAM 的学习表征是针对生成而不是分类进行训练，但其学习表征仍取得了出色的性能。我们的代码可在此 https URL 获取

Title: InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Authors: Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02213
Pdf URL: https://arxiv.org/pdf/2512.02213
Copy Paste: [[2512.02213]] InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages(https://arxiv.org/abs/2512.02213)
Keywords: generation
Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
摘要：低资源语言 (LRL) 的有效文本生成和聊天界面对于最先进的大型语言模型 (LLM) 来说仍然是一个挑战。这主要是由于为 LRL 策划高质量的教学数据集很困难，这是非洲大陆和其他地区使用的语言普遍存在的限制。当前的方法，例如自动翻译和合成数据生成，经常产生缺乏流畅性甚至拼写一致性的输出。在本文中，我们介绍了 InstructLR，这是一种旨在为 LRL 生成高质量指令数据集的新颖框架。我们的方法将 LLM 驱动的文本生成与双层质量过滤机制集成在一起：一个基于检索增强生成 (RAG) 的 n-shot 提示的自动过滤层，以及一个人机循环验证层。 InstructLR 从任务定义中的 MMLU 等基准测试中汲取灵感，促进了三个多域指令基准测试的创建：ZarmaInstruct-50k、BambaraInstruct-50k 和 FulfuldeInstruct-50k。

Title: Uncertainty Reasoning with Photonic Bayesian Machines

Authors: F. Brückerhoff-Plückelmann, H. Borras, S. U. Hulyal, L. Meyer, X. Ji, J. Hu, J. Sun, B. Klein, F. Ebert, J. Dijkstra, L. McRae, P. Schmidt, T. J. Kippenberg, H. Fröning, W. Pernice
Subjects: cs.LG, physics.app-ph
Abstract URL: https://arxiv.org/abs/2512.02217
Pdf URL: https://arxiv.org/pdf/2512.02217
Copy Paste: [[2512.02217]] Uncertainty Reasoning with Photonic Bayesian Machines(https://arxiv.org/abs/2512.02217)
Keywords: generation
Abstract: Artificial intelligence (AI) systems increasingly influence safety-critical aspects of society, from medical diagnosis to autonomous mobility, making uncertainty awareness a central requirement for trustworthy AI. We present a photonic Bayesian machine that leverages the inherent randomness of chaotic light sources to enable uncertainty reasoning within the framework of Bayesian Neural Networks. The analog processor features a 1.28 Tbit/s digital interface compatible with PyTorch, enabling probabilistic convolutions processing within 37.5 ps per convolution. We use the system for simultaneous classification and out-of-domain detection of blood cell microscope images and demonstrate reasoning between aleatoric and epistemic uncertainties. The photonic Bayesian machine removes the bottleneck of pseudo random number generation in digital systems, minimizes the cost of sampling for probabilistic models, and thus enables high-speed trustworthy AI systems.
摘要：人工智能 (AI) 系统越来越多地影响社会的安全关键方面，从医疗诊断到自主移动，使不确定性意识成为值得信赖的人工智能的核心要求。我们提出了一种光子贝叶斯机，它利用混沌光源固有的随机性，在贝叶斯神经网络的框架内实现不确定性推理。该模拟处理器具有与 PyTorch 兼容的 1.28 Tbit/s 数字接口，可在每个卷积 37.5 ps 内实现概率卷积处理。我们使用该系统对血细胞显微镜图像进行同步分类和域外检测，并证明任意不确定性和认知不确定性之间的推理。光子贝叶斯机消除了数字系统中伪随机数生成的瓶颈，最大限度地降低了概率模型的采样成本，从而实现了高速可信的人工智能系统。

Title: Towards Unified Video Quality Assessment

Authors: Chen Feng, Tianhao Peng, Fan Zhang, David Bull
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02224
Pdf URL: https://arxiv.org/pdf/2512.02224
Copy Paste: [[2512.02224]] Towards Unified Video Quality Assessment(https://arxiv.org/abs/2512.02224)
Keywords: quality assessment
Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly ``generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple ``perceptual experts'' dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.
摘要：最近的视频质量评估 (VQA) 工作通常采用整体模型，该模型通常预测每个测试视频的单个质量分数。这些方法无法提供诊断性、可解释的反馈，无法深入了解视频质量下降的原因。它们中的大多数也是专门的、特定于格式的指标，而不是真正的“通用”解决方案，因为它们旨在从不同的感知领域学习折衷的表示。为了解决这些限制，本文提出了 Unified-VQA，这是一个框架，通过将通用 VQA 重新设计为诊断专家混合 (MoE) 问题，提供适用于多种视频格式中的各种失真类型的单一、统一的质量模型。Unified-VQA 采用多个“感知专家”，致力于不同的问题一种新颖的多代理专家训练策略旨在使用排名启发的损失来优化每个专家，以最适合其领域的代理指标为指导，我们还将诊断多任务头集成到该框架中，以生成全局质量分数和可解释的多维工件向量，该向量使用弱监督学习策略进行优化，利用为此工作生成的大规模训练数据库的已知属性（无需重新训练或微调）。与 17 个包含 HD、UHD、HDR 和 HFR 格式的各种流媒体伪影的数据库中的通用 VQA 和诊断伪影检测任务的超过 18 种基准方法相比，Unified-VQA 表现出一致和卓越的性能。这项工作代表了朝着实用、可操作和可解释的视频质量评估迈出的重要一步。

Title: Spatiotemporal Pyramid Flow Matching for Climate Emulation

Authors: Jeremy Andrew Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao, Nomin-Erdene Bayarsaikhan, Daniele Visioni, Andrew Y. Ng, Duncan Watson-Parris
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2512.02268
Pdf URL: https://arxiv.org/pdf/2512.02268
Copy Paste: [[2512.02268]] Spatiotemporal Pyramid Flow Matching for Climate Emulation(https://arxiv.org/abs/2512.02268)
Keywords: generative
Abstract: Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at this https URL .
摘要：生成模型有可能改变我们模拟地球气候变化的方式。以前的生成方法依赖于天气尺度的自回归来进行气候模拟，但这对于长期的气候视野来说本质上是缓慢的，并且尚未证明在非平稳强迫下的稳定推广。在这里，我们介绍时空金字塔流（SPF），这是一种新型的流匹配方法，可以跨空间和时间尺度对数据进行分层建模。受级联视频模型的启发，SPF 将生成轨迹划分为时空金字塔，逐渐增加空间分辨率以减少计算量，并将每个阶段与相关的时间尺度耦合，以实现金字塔中任何时间级别的直接采样。这种设计，再加上根据规定的物理强迫（例如温室气体或气溶胶）调节每个阶段，可以在多个时间尺度上进行高效、并行的气候模拟。在 ClimateBench 上，SPF 在每年和每月的时间尺度上优于强大的流量匹配基线和预训练模型，同时提供快速采样，尤其是在较粗糙的时间水平上。为了扩展 SPF，我们策划了 ClimateSuite，这是迄今为止最大的地球系统模拟集合，包括十个气候模型的 33,000 多个模拟年，以及第一个包含气候干预模拟的数据集。我们发现，缩放后的 SPF 模型对跨气候模型的假设情景表现出良好的泛化性。 SPF 和 ClimateSuite 共同为跨时间尺度和现实未来场景的准确、高效、概率性气候模拟奠定了基础。数据和代码可在此 https URL 公开获取。

Title: Progressive Image Restoration via Text-Conditioned Video Generation

Authors: Peng Kang, Xijun Wang, Yu Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02273
Pdf URL: https://arxiv.org/pdf/2512.02273
Copy Paste: [[2512.02273]] Progressive Image Restoration via Text-Conditioned Video Generation(https://arxiv.org/abs/2512.02273)
Keywords: restoration, super-resolution, generation
Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.
摘要：最近的文本到视频模型已经表现出强大的时间生成能力，但其图像恢复的潜力仍未得到充分开发。在这项工作中，我们通过微调 CogVideo 来生成恢复轨迹而不是自然视频运动，从而将 CogVideo 重新用于渐进式视觉恢复任务。具体来说，我们构建了用于超分辨率、去模糊和低光增强的合成数据集，其中每个样本都描绘了从退化帧到干净帧的逐渐过渡。比较了两种提示策略：所有样本共享的统一文本提示，以及通过 LLaVA 多模态 LLM 生成并使用 ChatGPT 细化的场景特定提示方案。我们经过微调的模型学习将时间进程与恢复质量关联起来，生成可改善跨帧 PSNR、SSIM 和 LPIPS 等感知指标的序列。大量实验表明，CogVideo 可以有效恢复空间细节和照明一致性，同时保持时间相干性。此外，该模型无需额外训练即可推广到 ReLoBlur 数据集上的真实场景，通过时间恢复展现出强大的零样本鲁棒性和可解释性。

Title: Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation

Authors: Andre Juarez, Luis Salsavilca, Frida Coaquira, Celso Gonzales
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02290
Pdf URL: https://arxiv.org/pdf/2512.02290
Copy Paste: [[2512.02290]] Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation(https://arxiv.org/abs/2512.02290)
Keywords: generation, generative
Abstract: Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that is particularly severe along the Peruvian coast where labeled Sentinel-1 data remain scarce. To address this problem, we propose \textbf{MORP--Synth}, a two-stage synthetic augmentation framework designed to improve transfer from Mediterranean to Peruvian conditions. Stage~A applies Morphological Region Perturbation, a curvature guided label space method that generates realistic geometric variations of oil and look-alike regions. Stage~B renders SAR-like textures from the edited masks using a conditional generative INADE model. We compile a Peruvian dataset of 2112 labeled 512$\times$512 patches from 40 Sentinel-1 scenes (2014--2024), harmonized with the Mediterranean CleanSeaNet benchmark, and evaluate seven segmentation architectures. Models pretrained on Mediterranean data degrade from 67.8\% to 51.8\% mIoU on the Peruvian domain; MORP--Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).
摘要：由于海况、反向散射统计数据和浮油形态的差异，SAR 溢油分割的深度学习模型通常无法跨区域推广，这一限制在秘鲁海岸尤其严重，那里标记的 Sentinel-1 数据仍然稀缺。为了解决这个问题，我们提出了 \textbf{MORP--Synth}，这是一个两阶段的综合增强框架，旨在改善从地中海到秘鲁条件的转移。 Stage~A 应用形态区域扰动，这是一种曲率引导标签空间方法，可生成油和相似区域的真实几何变化。 Stage~B 使用条件生成 INADE 模型从编辑后的蒙版中渲染类似 SAR 的纹理。我们编译了来自 40 个 Sentinel-1 场景（2014--2024）的 2112 个秘鲁数据集，标记为 512$\times$512 个补丁，与地中海 CleanSeaNet 基准相协调，并评估了七种分割架构。在地中海数据上预训练的模型在秘鲁领域的 mIoU 从 67.8% 降到 51.8%； MORP--Synth 将性能提高高达 +6 mIoU，并提高了少数类 IoU（+10.8 油，+14.6 相似）。

Title: Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation

Authors: Kentaro Kubo, Hayato Goto
Subjects: cs.LG, quant-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2512.02323
Pdf URL: https://arxiv.org/pdf/2512.02323
Copy Paste: [[2512.02323]] Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation(https://arxiv.org/abs/2512.02323)
Keywords: generative
Abstract: Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.
摘要：玻尔兹曼机 (BM) 是强大的基于能量的生成模型，但其繁重的训练成本在很大程度上限制了使用称为对比散度的有效学习方法训练的受限玻尔兹曼机 (RBM) 的实际用途。更准确的学习通常需要马尔可夫链蒙特卡罗（MCMC）玻尔兹曼采样，但由于更具表现力的模型难以并行化，因此非常耗时。为了解决这个限制，我们首先提出了一种新的玻尔兹曼采样器，其灵感来自于量子启发的组合优化，称为模拟分叉（SB）。这种受 SB 启发的方法，我们将其命名为 Langevin SB (LSB)，可以实现并行采样，同时保持与 MCMC 相当的精度。此外，这不仅适用于RBM，也适用于具有一般耦合的BM。然而，LSB 无法控制输出玻尔兹曼分布的逆温度，这会阻碍学习并降低性能。为了克服这个限制，我们还开发了一种在学习过程中估计逆温度的有效方法，我们称之为条件期望匹配（CEM）。通过结合LSB和CEM，我们为BM建立了一个比RBM具有更强表达能力的高效学习框架。我们将该框架称为采样器自适应学习（SAL）。 SAL 为 RBM 之外的基于能量的生成建模开辟了新途径。

Title: SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02337
Pdf URL: https://arxiv.org/pdf/2512.02337
Copy Paste: [[2512.02337]] SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification(https://arxiv.org/abs/2512.02337)
Keywords: generation
Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.
摘要：代码生成、深度推理和长文档理解等任务的需求不断增长，使得长上下文生成成为大型语言模型 (LLM) 的关键功能。推测解码是加速生成的最直接、最有效的方法之一。它遵循草稿验证范例，其中轻量级草稿模型提出多个候选令牌，目标模型验证它们。然而，我们发现随着上下文长度的增长，验证成为主要瓶颈。为了进一步加速长上下文生成中的推测解码，我们引入了 SpecPV，这是一种自推测解码方法，它使用部分键值状态（KV）执行快速验证，并定期应用完整验证以消除累积错误。我们在多个长上下文基准和模型上验证 SpecPV，包括 LLaMA-3.1-8B-Instruct 和 Qwen3 系列。实验结果表明，SpecPV 的解码速度比标准自回归解码高出 6 倍，且性能下降较小。

Title: Understanding and Harnessing Sparsity in Unified Multimodal Models

Authors: Shwai He, Chaorui Deng, Ang Li, Shen Yan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02351
Pdf URL: https://arxiv.org/pdf/2512.02351
Copy Paste: [[2512.02351]] Understanding and Harnessing Sparsity in Unified Multimodal Models(https://arxiv.org/abs/2512.02351)
Keywords: generation
Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{this https URL}{this link}.
摘要：大型多模态模型在理解和生成方面都取得了显着的进步。最近的努力追求统一的多模式模型，该模型集成异构组件以支持单个框架内的两种功能。然而，这种统一会导致推理效率低下，例如，特定任务或样本可能不需要统一模型的全部知识或能力。然而，对这些低效率如何在不同组件中表现出来的系统性理解仍然有限。在这项工作中，我们首先使用免训练剪枝作为探测方法，考虑深度剪枝和宽度缩减，对统一多模态模型组件进行系统分析。我们的研究表明，理解组件在理解和生成任务中都表现出显着的可压缩性，这在后者中更为明显。相比之下，发电组件对压缩高度敏感，即使在中等压缩比下性能也会急剧恶化。为了解决这个限制，我们提出了混合专家（MoE）适应，其灵感来自于在不同样本中观察到的动态激活模式。这种方法将生成模块划分为多个专家，并启用稀疏激活来恢复生成质量。我们通过专家冻结的调整来验证稀疏激活的有效性，并进一步证明完全可训练的适应可以带来额外的收益。因此，调整后的 BAGEL 模型实现了与完整模型相当的性能，同时仅激活了大约一半的参数。代码发布于\href{此 https URL}{此链接}。

Title: On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning

Authors: Liyuan Lou, Wanyun Li, Wentian Gan, Yifei Yu, Tengfei Wang, Xin Wang, Zongqian Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02375
Pdf URL: https://arxiv.org/pdf/2512.02375
Copy Paste: [[2512.02375]] On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning(https://arxiv.org/abs/2512.02375)
Keywords: generation, quality assessment
Abstract: Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin maintenance. However, most existing methods focus on processing captured images or sequential frames in real time, without explicitly evaluating the quality of the on-the-go 3D reconstruction or providing guided feedback to enhance image acquisition in the target area. This work presents On-the-fly Feedback SfM, an explore-and-exploit framework for real-time UAV photogrammetry, enabling iterative exploration of unseen regions and exploitation of already observed and reconstructed areas in near real time. Built upon SfM on-the-fly , the proposed method integrates three modules: (1) online incremental coarse-mesh generation for dynamically expanding sparse 3D point cloud; (2) online mesh quality assessment with actionable indicators; and (3) predictive path planning for on-the-fly trajectory refinement. Comprehensive experiments demonstrate that our method achieves in-situ reconstruction and evaluation in near real time while providing actionable feedback that markedly reduces coverage gaps and re-flight costs. Via the integration of data collection, processing, 3D reconstruction and assessment, and online feedback, our on the-fly feedback SfM could be an alternative for the transition from traditional passive working mode to a more intelligent and adaptive exploration workflow. Code is now available at this https URL.
摘要：与传统的离线无人机摄影测量相比，实时无人机摄影测量对于灾难响应和主动数字孪生维护等时间关键的地理空间应用至关重要。然而，大多数现有方法侧重于实时处理捕获的图像或连续帧，而没有明确评估动态 3D 重建的质量或提供引导反馈以增强目标区域的图像采集。这项工作提出了动态反馈 SfM，这是一种用于实时无人机摄影测量的探索和利用框架，能够对未见过的区域进行迭代探索，并近乎实时地利用已观察和重建的区域。该方法基于 SfM on-the-fly 构建，集成了三个模块：（1）在线增量粗网格生成，用于动态扩展稀疏 3D 点云； (2)具有可操作指标的在线网格质量评估； (3) 用于动态轨迹细化的预测路径规划。综合实验表明，我们的方法可以近乎实时地实现原位重建和评估，同时提供可操作的反馈，显着减少覆盖差距和重新飞行成本。通过数据收集、处理、3D 重建和评估以及在线反馈的集成，我们的动态反馈 SfM 可以成为从传统被动工作模式过渡到更智能、更具适应性的探索工作流程的替代方案。现在可通过此 https URL 获取代码。

Title: ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

Authors: Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2512.02403
Pdf URL: https://arxiv.org/pdf/2512.02403
Copy Paste: [[2512.02403]] ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity(https://arxiv.org/abs/2512.02403)
Keywords: generation
Abstract: Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.
摘要：由 QKV 生成、注意力计算和 FFN 组成的 Transformer 因其出色的性能而成为各个领域的主导模型。然而，它们的高计算成本阻碍了高效的硬件部署。稀疏性提供了一个有前途的解决方案，但大多数现有的加速器仅利用关注中的行内稀疏性，而很少考虑行间稀疏性。利用行间稀疏性的方法通常依赖于昂贵的全局相似性估计，这削弱了稀疏性的加速优势，并且通常仅将稀疏性应用于一两个变压器组件。通过仔细分析注意力分布和计算流程，我们观察到局部相似性允许以较低的计算开销实现端到端稀疏加速。受这一观察的启发，我们提出了 ESACT，一种用于计算密集型 Transformer 的端到端稀疏加速器。 ESACT 以具有局部相似性的稀疏性预测 (SPLS) 机制为中心，该机制利用 HLog 量化在 QK 生成之前准确预测局部注意力稀疏性，从而在所有变压器组件中实现高效的稀疏性。为了支持高效的硬件实现，我们引入了三项架构创新。 26 个基准测试的实验结果表明，SPLS 将总计算量减少了 52.03%，精度损失不到 1%。 ESACT 实现了 3.29 TOPS/W 的端到端能效，与 SOTA 注意力加速器 SpAtten 和 Sanger 相比，注意力级能效分别提高了 2.95 倍和 2.26 倍。

Title: Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles

Authors: Yizhou Zhang, Lun Du
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02409
Pdf URL: https://arxiv.org/pdf/2512.02409
Copy Paste: [[2512.02409]] Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles(https://arxiv.org/abs/2512.02409)
Keywords: generation
Abstract: Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes \textbf{time-dependent data curation}, showing that an ideal oracle capable of tracking spectral residuals and continuously re-normalizing the tail can provably accelerate learning -- although practical systems can only approximate this behavior.
摘要：大规模神经模型越来越多地通过数据修剪、合成数据生成、跨模型蒸馏、人类反馈强化学习 (RLHF) 和基于难度的采样进行训练。虽然其中一些以数据为中心的策略可靠地提高了训练效率和下游性能，但其他策略却无法提供有意义的收益——最显着的是自我生成的合成数据，这通常会增加数据集数量而不会增强模型能力。我们将数据管理形式化为重新加权采样分布，并将其影响映射到数据引发算子的特征结构上。我们的第一个主要结果表明，\textbf{静态剪枝会产生有界算子，因此不能改变谱尾指数}；它最多提供有限区域的改进，并且不能改变渐近神经缩放。我们的第二个结果分析了\textbf{时间相关的数据管理}，表明能够跟踪光谱残差并不断重新标准化尾部的理想预言机可以证明加速学习——尽管实际系统只能近似这种行为。

Title: MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

Authors: Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02413
Pdf URL: https://arxiv.org/pdf/2512.02413
Copy Paste: [[2512.02413]] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture(https://arxiv.org/abs/2512.02413)
Keywords: generation
Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.
摘要：根据 2D 平面图自动 3D 重建室内空间需要对结构元素（尤其是墙壁）进行高精度语义分割。然而，针对标准指标优化的现有方法通常难以检测薄结构组件并产生具有不规则边界的掩模，缺乏后续矢量化所需的几何精度。为了解决这个问题，我们引入了 MitUNet，这是一种混合神经网络架构，专为 3D 建模背景下的墙分割任务而设计。在 MitUNet 中，我们利用分层 Mix-Transformer 编码器来捕获全局上下文，并利用 scSE 注意力块增强的 U-Net 解码器来实现精确的边界恢复。此外，我们提出了一种基于 Tversky 损失函数的优化策略，以有效平衡精度和召回率。通过微调损失函数的超参数，我们优先考虑沿壁边界抑制误报噪声，同时保持对薄结构的高灵敏度。我们在公共 CubiCasa5k 数据集和专有区域数据集上的实验表明，所提出的方法可确保生成具有高边界精度的结构正确的掩模，其性能优于标准单任务模型。 MitUNet 为自动化 3D 重建流程中的数据准备提供了强大的工具。

Title: LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework

Authors: Daeyoung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02437
Pdf URL: https://arxiv.org/pdf/2512.02437
Copy Paste: [[2512.02437]] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework(https://arxiv.org/abs/2512.02437)
Keywords: generative
Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.
摘要：作为一种代表性的视神经退行性疾病，青光眼因其不可逆性和对人类视野的严重影响而威胁着数百万人。青光眼的主要特征是视力暗淡和模糊，或周边视力丧失，众所周知，青光眼是由于眼内压（IOP）升高或视网膜内新生血管造成视神经损伤而发生的。传统上，大多数青光眼相关工作和临床诊断的重点是通过使用视野检查、视乳头检查和基于眼压计的眼压测量的患者数据来检测视神经的这些损伤。近年来，随着VGG16或Vision Transformers（ViT）等计算机视觉AI模型的进步，基于视网膜眼底图像或OCT的AI自动化青光眼检测和视杯分割最近在高性能辅助传统诊断方面表现出了显着的性能。然而，当前人工智能驱动的青光眼检测方法在可靠性、参数使用过多、检测中出现虚假相关性的可能性以及干预分析或临床模拟应用的局限性方面仍有很大的改进空间。因此，本研究引入了一种新颖的因果表示驱动的青光眼检测模型：LightHCG，这是一种极其轻量级的基于卷积 VAE 的潜在青光眼表示模型，可以考虑视神经区域内青光眼相关物理因素之间的真正因果关系。与现有的先进视觉模型（例如 InceptionV3、MobileNetV2 或 VGG16）相比，使用基于 HSIC 的潜在空间解缠和基于图自动编码器的无监督因果表示学习，LightHCG 不仅在青光眼分类方面表现出更高的性能，权重减少了 93~99%，而且还增强了人工智能驱动的干预分析的可能性。

Title: Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors

Authors: Fan Luo, Zeyu Gao, Xinhao Luo, Kai Zhao, Yanfeng Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02447
Pdf URL: https://arxiv.org/pdf/2512.02447
Copy Paste: [[2512.02447]] Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors(https://arxiv.org/abs/2512.02447)
Keywords: generation
Abstract: Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model's expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs' capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.
摘要：尖峰神经网络 (SNN) 凭借其受大脑启发的时空动力学和尖峰驱动计算，已成为人工神经网络 (ANN) 的有前途的节能替代品。然而，现有的 SNN 通常直接复制输入或以固定间隔将它们聚合到帧中。这种策略导致神经元在不同时间步长上接收几乎相同的刺激，严重限制了模型的表达能力，特别是在物体检测等复杂任务中。在这项工作中，我们提出了时间动态增强器（TDE）来增强 SNN 的时间信息建模能力。 TDE 由两个模块组成：一个尖峰编码器（SE），用于跨时间步生成不同的输入刺激；以及一个注意力门控模块（AGM），用于根据时间间依赖性指导 SE 生成。此外，为了消除 AGM 引入的高能量乘法运算，我们提出了尖峰驱动注意力（SDA）来减少与注意力相关的能量消耗。大量实验表明，TDE 可以无缝集成到现有的基于 SNN 的检测器中，并且始终优于最先进的方法，在静态 PASCAL VOC 数据集上实现 57.7% 的 mAP50-95 分数，在神经形态 EvDET200K 数据集上实现 47.6% 的 mAP50-95 分数。在能耗方面，SDA的能耗仅为传统注意力模块的0.240倍。

Title: ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation

Authors: Kerui Chen, Jianrong Zhang, Ming Li, Zhonglong Zheng, Hehe Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02453
Pdf URL: https://arxiv.org/pdf/2512.02453
Copy Paste: [[2512.02453]] ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation(https://arxiv.org/abs/2512.02453)
Keywords: generation
Abstract: Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
摘要：现有的风格化运动生成模型已经显示出其从风格运动中理解特定风格信息并将其插入到内容运动中的卓越能力。然而，捕捉风格内的多样性（单一风格应对应于不同的运动变化）仍然是一个重大挑战。在本文中，我们提出了一个基于集群的框架 ClusterStyle 来解决这个限制。我们不是从每个风格动作中学习非结构化嵌入，而是利用一组原型来有效地对属于同一风格类别的动作中的不同风格模式进行建模。我们考虑两种类型的风格多样性：同一类别的风格运动之间的全局水平多样性，以及运动序列的时间动态内的局部水平多样性。这些组件共同塑造了两个结构化风格的嵌入空间，即全局和局部，通过与不可学习的原型锚点对齐来优化。此外，我们使用风格调制适配器（SMA）增强了预训练的文本到运动生成模型，以集成风格特征。大量的实验表明，我们的方法在风格化运动生成和运动风格转移方面优于现有的最先进模型。

Title: Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Authors: Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02457
Pdf URL: https://arxiv.org/pdf/2512.02457
Copy Paste: [[2512.02457]] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation(https://arxiv.org/abs/2512.02457)
Keywords: generation, generative
Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
摘要：最近的音视频生成系统表明，耦合模式不仅有利于音视频同步，而且有利于视频模态本身。我们提出一个基本问题：即使我们只关心视频质量，音视频联合去噪训练是否可以改善视频生成？为了研究这一点，我们引入了一种参数高效的音频视频全 DiT (AVFullDiT) 架构，该架构利用预先训练的文本到视频 (T2V) 和文本到音频 (T2A) 模块进行联合去噪。我们在相同的设置下训练 (i) 使用 AVFullDiT 的 T2AV 模型和 (ii) 仅 T2V 模型。我们的结果提供了第一个系统证据，证明音视频联合去噪可以提供的不仅仅是同步。我们观察到具有大型和物体接触运动的挑战性子集的持续改进。我们假设预测音频充当特权信号，鼓励模型内化视觉事件与其声学后果（例如，碰撞$\times$撞击声）之间的因果关系，从而规范视频动态。我们的研究结果表明，跨模式协同训练是开发更强大、更基于物理的世界模型的一种有前途的方法。代码和数据集将公开。

Title: WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Authors: Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02473
Pdf URL: https://arxiv.org/pdf/2512.02473
Copy Paste: [[2512.02473]] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling(https://arxiv.org/abs/2512.02473)
Keywords: generation
Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
摘要：视频世界模型因其能够根据过去的观察和导航动作产生高保真度的未来视觉观察而引起了极大的关注。时间和空间一致的长期世界建模一直是一个长期存在的问题，即使是最近最先进的模型也未能解决，因为长上下文输入的计算成本极其昂贵。在本文中，我们提出了 WorldPack，一种具有高效压缩内存的视频世界模型，尽管上下文长度要短得多，但它可以显着提高长期生成的空间一致性、保真度和质量。我们的压缩内存由轨迹打包和内存检索组成；轨迹打包实现了高上下文效率，内存检索保持了推出的一致性，并有助于需要空间推理的长期生成。我们的性能通过 LoopNav 进行评估，LoopNav 是 Minecraft 的基准，专门用于评估长期一致性，并且我们验证了 WorldPack 的性能明显优于强大的最先进模型。

Title: YingVideo-MV: Music-Driven Multi-Stage Video Generation

Authors: Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02492
Pdf URL: https://arxiv.org/pdf/2512.02492
Copy Paste: [[2512.02492]] YingVideo-MV: Music-Driven Multi-Stage Video Generation(https://arxiv.org/abs/2512.02492)
Keywords: generation
Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: this https URL .
摘要：虽然用于音频驱动的头像视频生成的扩散模型在合成具有自然视听同步和身份一致性的长序列方面取得了显着的进展，但带有摄像机运动的音乐表演视频的生成在很大程度上仍未得到探索。我们推出了 YingVideo-MV，这是第一个用于音乐驱动的长视频生成的级联框架。我们的方法集成了音频语义分析、可解释的镜头规划模块 (MV-Director)、时间感知扩散 Transformer 架构和长序列一致性建模，以实现从音频信号自动合成高质量音乐表演视频。我们通过收集网络数据构建大规模的野外音乐数据集，以支持实现多样化、高质量的结果。观察到现有的长视频生成方法缺乏明确的相机运动控制，我们引入了一种相机适配器模块，将相机姿势嵌入到潜在噪声中。为了增强长序列推理期间片段之间的连续性，我们进一步提出了一种时间感知的动态窗口范围策略，该策略基于音频嵌入自适应地调整去噪范围。全面的基准测试表明，YingVideo-MV在生成连贯且富有表现力的音乐视频方面表现出色，并实现了精确的音乐-运动-摄像机同步。我们的项目页面提供了更多视频：此 https URL 。

Title: dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Authors: Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02498
Pdf URL: https://arxiv.org/pdf/2512.02498
Copy Paste: [[2512.02498]] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model(https://arxiv.org/abs/2512.02498)
Keywords: generation
Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce this http URL, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, this http URL establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.
摘要：文档布局解析是人工智能 (AI) 访问和解释世界上海量结构化知识的关键网关。这个过程包括布局检测、文本识别和关系理解，对于增强下一代视觉语言模型尤其重要。然而，当前的方法依赖于分散的多级管道，这些管道会受到错误传播的影响，并且无法利用联合训练的协同作用。在本文中，我们介绍了这个 http URL，这是一个单一的视觉语言模型，它首次展示了在统一的端到端框架内联合学习三个核心任务的优势。这是通过高度可扩展的数据引擎实现的，该引擎合成了庞大的多语言语料库，使模型能够在涵盖不同语言、布局和领域的各种任务中提供强大的性能。我们统一范例的功效已通过综合 OmniDocBench 上最先进的性能得到验证。此外，为了促进全球文档智能的研究，我们推出了 XDocParse，这是一个涵盖 126 种语言的具有挑战性的新基准。在此测试平台上，此 http URL 建立了强大的新基准，以 7.4 分的优势领先于次佳竞争对手，并证明了其无与伦比的多语言功能。

Title: GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02505
Pdf URL: https://arxiv.org/pdf/2512.02505
Copy Paste: [[2512.02505]] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding(https://arxiv.org/abs/2512.02505)
Keywords: generation, generative
Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
摘要：自回归模型在结构上与地理空间理解固有的并行性质不一致，迫使场景中出现严格的顺序叙述，并从根本上阻碍了结构化和连贯输出的生成。我们通过将地理空间生成重新定义为并行细化过程来挑战这种范式，从而实现整体的、从粗到细的综合，同时解决所有语义元素。为了实现这一点，我们引入了 GeoDiT，这是第一个为地理空间领域量身定制的基于扩散的视觉语言模型。大量实验表明，GeoDiT 在需要结构化、以对象为中心的输出的基准上建立了新的最先进水平。它在图像字幕、视觉基础和多目标检测方面取得了显着的成果，而这些任务正是自回归模型无法胜任的任务。我们的工作证明，将生成过程与数据的内在结构保持一致是在复杂地理空间分析中发挥卓越性能的关键。

Title: Water Quality Estimation Through Machine Learning Multivariate Analysis

Authors: Marco Cardia, Stefano Chessa, Alessio Micheli, Antonella Giuliana Luminare, Francesca Gambineri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02508
Pdf URL: https://arxiv.org/pdf/2512.02508
Copy Paste: [[2512.02508]] Water Quality Estimation Through Machine Learning Multivariate Analysis(https://arxiv.org/abs/2512.02508)
Keywords: quality assessment
Abstract: The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.
摘要：水的质量对于农产品行业的质量至关重要。水用于农业灌溉施肥、畜牧业和农产品加工业。在该行业逐步数字化的背景下，水质自动评估正成为一项重要资产。在这项工作中，我们提出了在水质评估的背景下将紫外可见（UV-Vis）光谱与机器学习相结合，旨在确保水安全和水法规的合规性。此外，我们通过采用 Shapley 加法解释 (SHAP) 来了解不同波长吸光度对预测的贡献，从而强调模型可解释性的重要性。我们的方法展示了对关键水质参数进行快速、准确和可解释评估的潜力。

Title: Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Authors: Aditya Chaudhary, Prachet Dev Singh, Ankit Jha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02512
Pdf URL: https://arxiv.org/pdf/2512.02512
Copy Paste: [[2512.02512]] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling(https://arxiv.org/abs/2512.02512)
Keywords: restoration, super-resolution
Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
摘要：在计算机视觉中，单图像超分辨率（SISR）仍然是一个难题。我们提出了 ViT-SR，这是一种采用两阶段训练策略来提高 Vision Transformer (ViT) 性能的新技术。在我们的方法中，模型通过着色任务的自我监督预训练阶段从数据本身学习丰富的、可概括的视觉表示。然后将预训练模型调整为 4 倍超分辨率。通过预测将高频残差图像添加到初始双三次插值中，该设计简化了残差学习。 ViT-SR 在 DIV2K 基准数据集上进行训练和评估，实现了令人印象深刻的 0.712 的 SSIM 和 22.90 dB 的 PSNR。这些结果证明了我们的两阶段方法的有效性，并凸显了自监督预训练对于复杂图像恢复任务的潜力。使用更大的 ViT 架构或替代借口任务可能会得到进一步的改进。

Title: OmniPerson: Unified Identity-Preserving Pedestrian Generation

Authors: Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, Yifan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02554
Pdf URL: https://arxiv.org/pdf/2512.02554
Copy Paste: [[2512.02554]] OmniPerson: Unified Identity-Preserving Pedestrian Generation(https://arxiv.org/abs/2512.02554)
Keywords: super-resolution, generation
Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.
摘要：由于数据隐私和标注成本的挑战，行人重新识别（ReID）缺乏大规模高质量的训练数据。虽然以前的方法已经探索了用于数据增强的行人生成，但它们通常无法确保身份一致性并且可控性不足，从而限制了它们在数据集增强方面的有效性。为了解决这个问题，我们引入了 OmniPerson，这是第一个用于可见光/红外图像/视频 ReID 任务的统一身份保护行人生成管道。我们的贡献有三个：1）我们提出了 OmniPerson，一个统一的生成模型，提供对所有关键行人属性的整体和细粒度的控制。支持任意数量的参考图像、两种人物姿势和文本的 RGB/IR 模态图像/视频生成。还包括 RGB 到 IR 传输和图像超分辨率能力。2) 我们设计了 Multi-Refer Fuser，以任意数量的参考图像作为输入，实现稳健的身份保存，使 OmniPerson 可以从一组多视图参考图像中提取统一的身份，确保我们生成的行人实现高保真行人生成。3) 我们引入 PersonSyn，这是第一个用于多参考、可控行人生成的大规模数据集，并展示其自动管理管道，该管道可转换公共、仅 ID 的 ReID 基准转化为具有丰富注释的资源，并具有该任务所需的密集、多模式监督。实验结果表明，OmniPerson 在行人生成方面实现了 SoTA，在视觉保真度和身份一致性方面均表现出色。此外，使用我们生成的数据增强现有数据集可以持续提高 ReID 模型的性能。我们将开源完整的代码库、预训练模型和 PersonSyn 数据集。

Title: Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Authors: Yafei Song, Peng Zhang, Bang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02576
Pdf URL: https://arxiv.org/pdf/2512.02576
Copy Paste: [[2512.02576]] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval(https://arxiv.org/abs/2512.02576)
Keywords: generation
Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
摘要：合成同步且自然的共同语音手势视频仍然是一个艰巨的挑战。最近的方法利用运动图来利用现有视频数据的潜力。为了从图中检索适当的轨迹，以前的方法要么利用从输入音频提取的特征与图中运动相关的特征之间的距离，要么将输入音频和运动嵌入到共享特征空间中。然而，由于音频和手势之间的多对多映射性质，这些技术可能不是最佳的，一对一映射无法充分解决这一问题。为了缓解这一限制，我们提出了一种新颖的框架，该框架最初采用扩散模型来生成手势运动。扩散模型隐式学习音频和运动的联合分布，从而能够从输入音频序列生成上下文适当的手势。此外，我们的方法从输入音频中提取低级和高级特征，以丰富扩散模型的训练过程。随后，应用精心设计的基于运动的检索算法，通过评估运动的全局和局部相似性来识别图中最合适的路径。鉴于并非检索到的路径中的所有节点都是顺序连续的，最后一步涉及将这些片段无缝拼接在一起以产生连贯的视频输出。实验结果证实了我们提出的方法的有效性，证明在同步准确性和生成手势的自然度方面比先前的方法有了显着的改进。

Title: GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Authors: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02581
Pdf URL: https://arxiv.org/pdf/2512.02581
Copy Paste: [[2512.02581]] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies(https://arxiv.org/abs/2512.02581)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.
摘要：强化学习（RL）面临着持续的压力：稳定而难以优化的策略通常过于简单，无法表示复杂控制所需的多模态动作分布。高斯策略提供了易于处理的可能性和平滑的梯度，但它们的单峰形式限制了表达能力。相反，基于扩散或流量匹配的生成策略可以模拟丰富的多模式行为；然而，在在线强化学习中，由于难以处理的可能性和通过深度采样链传播的噪声梯度，它们经常不稳定。我们通过一个关键的结构原则来解决这种紧张关系：将优化与发电脱钩。基于这一见解，我们引入了 GoRL（生成在线强化学习），这是一个框架，可以优化易于处理的潜在策略，同时利用条件生成解码器来合成动作。两个时间尺度的更新计划使潜在策略能够稳定地学习，同时解码器稳定地提高表达能力，而不需要易于处理的动作可能性。在一系列连续控制任务中，GoRL 始终优于高斯策略和最近的生成策略基线。值得注意的是，在 HopperStand 任务中，它的标准化回报达到了 870 以上，是最强基线的 3 倍多。这些结果表明，将优化与生成分开，为实现既稳定又具有高度表达力的政策提供了一条实用途径。

Title: RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Authors: Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, Boxi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02622
Pdf URL: https://arxiv.org/pdf/2512.02622
Copy Paste: [[2512.02622]] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence(https://arxiv.org/abs/2512.02622)
Keywords: generation
Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.
摘要：视频生成领域的最新进展使得视频合成具有很强的时间一致性和令人印象深刻的视觉质量，标志着迈向视觉基础模型的关键一步。为了评估这些视频生成模型，现有的基准主要关注与视觉感知和理解相关的因素，例如视觉美学、指令依从性和时间连贯性。然而，视频生成模型基于规则的推理能力在很大程度上仍未得到探索。尽管最近的研究对视频模型是否可以作为零样本学习器进行了初步探索，但仍然缺乏推理能力的细粒度分解和全面的评估协议。为了解决这一差距，我们引入了RULER-Bench，这是一个旨在从认知规则的角度评估视频生成模型的推理能力的基准。 RULER-Bench 基于两个基本范例：文本到视频和图像到视频，涵盖了跨越 6 个规则类别的 40 个代表性任务，以及 622 个高质量注释实例。为了评估每个生成的视频，我们构建了一个涵盖四个指标的清单，并利用 GPT-o3 为每个问题分配分数，与人类判断达到 85% 的一致性。大量实验表明，最先进的模型在规则一致性指标上仅达到 48.87%，这凸显了下一级视频模型的推理能力还有很大的改进空间。我们期望从 RULER-Bench 获得的见解将促进推理感知视频生成的进一步发展，将视频生成模型推向视觉基础智能。

Title: PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Authors: Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02624
Pdf URL: https://arxiv.org/pdf/2512.02624
Copy Paste: [[2512.02624]] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding(https://arxiv.org/abs/2512.02624)
Keywords: generation
Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.
摘要：PowerPoint 演示文稿将丰富的文本内容与结构化视觉布局相结合，使其成为评估现代 MLLM 的多模式推理和布局理解能力的天然测试平台。然而，现有的基准仅关注狭窄的子任务，而忽视了以布局为中心的挑战，而这对于现实世界的幻灯片创建和编辑至关重要。为了弥补这一差距，我们引入了 PPTBench，这是一个用于评估法学硕士在 PowerPoint 相关任务上的综合多模式基准。 PPTBench 利用 958 个 PPTX 文件的多种来源，通过 4,439 个样本评估四个类别的模型，包括检测、理解、修改和生成。我们的实验揭示了当前 MLLM 中语义理解和视觉布局推理之间存在巨大差距：模型可以解释幻灯片内容，但无法产生连贯的空间排列。消融和进一步分析表明，当前的 MLLM 很难将视觉提示与基于 JSON 的布局结构相结合，并且无法将视觉信息集成到其 API 规划能力中。案例研究直观地揭示了系统布局错误，例如未对齐和元素重叠。这些发现为在 PPT 场景中评估 VLLM 提供了新的视角，突出了视觉结构推理和连贯幻灯片生成未来研究的挑战和方向。所有数据集和代码均已完全发布，以支持可重复性和未来的研究。

Title: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Authors: Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.02636
Pdf URL: https://arxiv.org/pdf/2512.02636
Copy Paste: [[2512.02636]] Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models(https://arxiv.org/abs/2512.02636)
Keywords: generative
Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.
摘要：对数似然评估可实现生成模型的重要功能，包括模型比较、某些微调目标和许多下游应用。然而矛盾的是，当今一些最好的生成模型——基于扩散和流的模型——仍然需要数百到数千次神经功能评估（NFE）来计算单个可能性。虽然最近的蒸馏方法已经成功地将采样加速到仅几步之遥，但它们是以似然易处理性为代价实现的：现有方法要么完全放弃似然计算，要么仍然需要对整个轨迹进行昂贵的积分。我们提出了快速流联合蒸馏（F2D2），这是一个框架，可以同时将采样和可能性评估所需的 NFE 数量减少两个数量级。我们的主要见解是，在连续归一化流中，采样和似然的耦合 ODE 是根据共享的基础速度场计算的，使我们能够使用单个模型联合提取采样轨迹和累积散度。 F2D2 是模块化的，与现有的基于流的少步采样模型兼容，并且仅需要额外的发散预测头。实验证明 F2D2 能够通过几步评估实现准确的对数似然，同时保持高样本质量，解决基于流的生成模型中长期存在的计算瓶颈。作为我们方法的应用，我们提出了一种轻量级自引导方法，该方法使 2 步 MeanFlow 模型能够优于仅具有单个附加后向 NFE 的 1024 步教师模型。

Title: Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening

Authors: Yongchuan Cui, Peng Liu, Yi Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02643
Pdf URL: https://arxiv.org/pdf/2512.02643
Copy Paste: [[2512.02643]] Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening(https://arxiv.org/abs/2512.02643)
Keywords: generation
Abstract: Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.
摘要：由于真实训练数据的可用性有限以及不同卫星传感器之间的域差距，现有的遥感图像融合深度学习方法在应用于看不见的数据集时往往泛化能力较差。为了应对这一挑战，我们通过提出一种新颖的预训练策略来探索基础模型的潜力，该策略利用大规模模拟数据集来学习强大的空间光谱先验。具体来说，我们的方法首先通过对来自 ImageNet 的自然图像和来自 SkyScript 的遥感图像应用各种降级操作（模糊、噪声、下采样）和增强（波段生成、通道洗牌、高通滤波、颜色抖动等）来构建不同的模拟数据集。然后，我们在这些模拟数据上预训练融合模型，以学习可泛化的空间光谱表示。随后使用零样本和单样本范式在六个数据集（WorldView-2/3/4、IKONOS、QuickBird、GaoFen-2）上评估预训练模型，并使用完全调整和冻结调整方法进行微调。对不同网络架构（包括卷积神经网络、Transformer 和 Mamba）的大量实验表明，我们的预训练策略显着提高了不同卫星传感器和各种融合模型的成像条件的泛化性能。预训练模型在零样本场景中取得了优异的结果，并在单样本设置中以最少的真实数据显示出卓越的适应能力。我们的工作为跨域全色锐化提供了实用的解决方案，为遥感图像融合任务的泛化建立了新的基准，并为通过高级训练策略利用基础模型铺平了道路。

Title: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Authors: Junwon Lee, Juhan Nam, Jiyoung Lee
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2512.02650
Pdf URL: https://arxiv.org/pdf/2512.02650
Copy Paste: [[2512.02650]] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation(https://arxiv.org/abs/2512.02650)
Keywords: generation
Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at this https URL.
摘要：这项工作引入了一项新任务，即文本条件选择性视频到音频 (V2A) 生成，它仅从多对象视频中生成用户想要的声音。此功能在多媒体制作中尤其重要，其中每个音源的音轨均单独处理，以进行精确的编辑、混合和创意控制。然而，当前的方法会立即生成单源混合声音，这主要是因为视觉特征纠缠在一起，并且区域提示或提示通常无法指定源。我们提出了 SelVA，一种新颖的文本条件 V2A 模型，它将文本提示视为目标源的显式选择器，并调制视频编码器以明确提取与提示相关的视频特征。所提出的补充标记通过有效的参数调整来抑制与文本无关的激活，从而产生强大的语义和时间基础，从而促进交叉注意力。 SelVA 进一步采用自我增强方案来克服单声道音轨监控的缺乏。我们在 VGG-MONOAUDIO 上评估 SelVA，VGG-MONOAUDIO 是针对此类任务的干净单源视频的策划基准。大量的实验和消融一致验证了其在音频质量、语义对齐和时间同步方面的有效性。代码和演示可在此 https URL 获取。

Title: Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models

Authors: Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02657
Pdf URL: https://arxiv.org/pdf/2512.02657
Copy Paste: [[2512.02657]] Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models(https://arxiv.org/abs/2512.02657)
Keywords: generative
Abstract: The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Right to be Forgotten.'' This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.
摘要：最近，在庞大的网络规模数据集上训练的视觉生成模型的快速增长，与数据隐私法规和版权法（例如 GDPR 的“被遗忘权”）产生了巨大的紧张关系。这需要机器取消学习 (MU) 来删除特定概念，而无需承担高昂的再培训成本。然而，现有的 MU 技术从根本上来说不适用于删除请求按顺序到达的现实场景，这种设置称为持续遗忘 (CUL)。在持续环境中天真地应用一次性方法会引发稳定性危机，导致一系列退化，其特征是保留崩溃，对相关概念造成附带损害，并导致生成质量急剧下降。为了解决这一关键挑战，我们引入了一种新颖的基于生成蒸馏的持续忘却框架，该框架可确保在删除请求序列下有针对性且稳定的忘却。通过将每个遗忘步骤重新定义为多目标、师生蒸馏过程，该框架利用持续学习的原则来保持模型的完整性。 10步顺序基准测试表明，我们的方法以更好的保真度忘记了忘记的概念，并且在不显着干扰保留概念的性能或整体图像质量的情况下实现了这一点，大大优于基线。该框架为大规模生成模型的负责任部署和维护提供了可行的途径，使各行业能够以实际有效的方式满足持续的数据删除请求。

Title: Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Authors: Agathoklis Georgiou
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2512.02660
Pdf URL: https://arxiv.org/pdf/2512.02660
Copy Paste: [[2512.02660]] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation(https://arxiv.org/abs/2512.02660)
Keywords: generation
Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.
摘要：像 ColPali 这样的视觉语言模型 (VLM) 通过将页面嵌入为图像并计算查询标记和视觉补丁之间的细粒度相似性来实现最先进的文档检索。然而，它们返回整个页面而不是特定区域，限制了检索增强生成（RAG）的实用性，其中精确的上下文至关重要。相反，基于 OCR 的系统提取带有边界框坐标的结构化文本，但缺乏相关性评估的语义基础。我们提出了一种统一这些范例的混合架构：使用 ColPali 的补丁级相似性分数作为 OCR 提取区域的空间相关性过滤器。我们形式化了视觉变换器补丁网格和 OCR 边界框之间的坐标映射，引入了相关性传播的交叉度量，并建立了检索精度的理论界限。我们的方法在推理时运行，无需额外训练。我们发布了 Snappy，这是一个开源实现，展示了实际适用性，并且正在进行实证评估。

Title: Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

Authors: Haozhuo Zheng, Cheng Wang, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02667
Pdf URL: https://arxiv.org/pdf/2512.02667
Copy Paste: [[2512.02667]] Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents(https://arxiv.org/abs/2512.02667)
Keywords: generation, generative
Abstract: The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.
摘要：从头生成具有所需特性的分子是一项严峻的挑战，其中扩散模型的计算量很大，而自回归模型则难以应对误差传播。在这项工作中，我们引入了图VQ-Transformer（GVT），这是一种两阶段生成框架，可以实现高精度和高效率。我们方法的核心是一种新颖的图矢量量化变分自动编码器（VQ-VAE），它将分子图压缩为高保真离散潜在序列。通过将 Graph Transformer 与规范的反向 Cuthill-McKee (RCM) 节点排序和旋转位置嵌入 (RoPE) 协同结合，我们的 VQ-VAE 实现了近乎完美的重建率。然后，对这些离散潜在变量训练自回归 Transformer，有效地将图生成转换为结构良好的序列建模问题。至关重要的是，这种复杂图形到高保真离散序列的映射将分子设计与大规模序列建模的强大范式联系起来，释放了与大型语言模型（LLM）的潜在协同作用。大量实验表明，GVT 在 ZINC250k、MOSES 和 GuacaMol 等主要基准测试中实现了最先进或极具竞争力的性能，并且在 FCD 和 KL Divergence 等关键分布相似性指标上明显优于领先的扩散模型。凭借其卓越的性能、效率和新颖的架构，GVT 不仅为扩散模型提供了令人信服的替代方案，而且还为该领域建立了强大的新基准，为离散潜在空间分子生成的未来研究铺平了道路。

Title: PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution

Authors: Zhongbao Yang, Jiangxin Dong, Yazhou Yao, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02681
Pdf URL: https://arxiv.org/pdf/2512.02681
Copy Paste: [[2512.02681]] PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution(https://arxiv.org/abs/2512.02681)
Keywords: restoration, super-resolution
Abstract: Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at this https URL.
摘要：尽管基于扩散的模型在图像超分辨率方面取得了令人印象深刻的成果，但它们通常依赖于稳定扩散XL（SDXL）和扩散变压器（DiT）等大规模骨干网，这导致训练和推理过程中计算和内存成本过高。为了解决这个问题，我们开发了一种轻量级扩散方法 PGP-DiffSR，在输入相位信息的指导下从扩散模型中去除冗余信息，以实现高效的图像超分辨率。我们首先识别扩散主干内的块内冗余，并提出一种渐进式修剪方法，在保留恢复能力的同时删除冗余块。我们注意到，修剪扩散模型产生的恢复图像的相位信息没有得到很好的估计。为了解决这个问题，我们提出了一个相交换适配器模块，它探索输入的相位信息，以指导剪枝扩散模型以获得更好的恢复性能。我们将渐进剪枝方法和相交换适配器模块制定为统一模型。大量的实验表明，我们的方法实现了有竞争力的恢复质量，同时显着减少了计算负载和内存消耗。该代码可从此 https URL 获取。

Title: ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Authors: Yuxing Liu, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02686
Pdf URL: https://arxiv.org/pdf/2512.02686
Copy Paste: [[2512.02686]] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data(https://arxiv.org/abs/2512.02686)
Keywords: generation
Abstract: Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.
摘要：异常分割旨在检测和定位超出预定义语义类别的未知或分布外 (OoD) 对象，这是安全自动驾驶所必需的功能。然而，异常数据的稀缺性和有限的多样性严重限制了开放世界环境中的模型泛化。现有方法通过合成数据生成来缓解这个问题，要么将外部对象复制粘贴到驾驶场景中，要么利用文本到图像扩散模型来修复异常区域。虽然这些方法改善了异常多样性，但它们通常缺乏上下文连贯性和物理现实性，导致合成数据和真实数据之间存在领域差距。在本文中，我们提出了 ClimaDrive，一种语义引导的图像到图像框架，用于合成语义连贯、天气多样且物理上合理的 OoD 驾驶数据。 ClimaDrive 将结构引导的多天气生成与提示驱动的异常修复相结合，从而能够创建视觉上逼真的训练数据。基于这个框架，我们构建了 ClimaOoD，一个涵盖晴天和恶劣天气条件下六种代表性驾驶场景的大型基准。对四种最先进方法的广泛实验表明，使用 ClimaOoD 进行训练可以显着改善异常分割。在所有方法中，AUROC、AP 和 FPR95 都显示出显着的增益，其中 Fishyscapes LAF 上 RbA 的 FPR95 从 3.97 下降到 3.52。这些结果表明 ClimaOoD 增强了模型的稳健性，为开放世界异常检测的更好泛化提供了宝贵的训练数据。

Title: LumiX: Structured and Coherent Text-to-Intrinsic Generation

Authors: Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02781
Pdf URL: https://arxiv.org/pdf/2512.02781
Copy Paste: [[2512.02781]] LumiX: Structured and Coherent Text-to-Intrinsic Generation(https://arxiv.org/abs/2512.02781)
Keywords: generation
Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.
摘要：我们提出了 LumiX，一种用于连贯文本到内在生成的结构化扩散框架。根据文本提示，LumiX 联合生成一组全面的内在贴图（例如反照率、辐照度、法线、深度和最终颜色），提供底层场景的结构化且物理一致的描述。这是通过两个关键贡献实现的：1）查询广播注意力，一种通过在每个自注意力块中的所有映射之间共享查询来确保结构一致性的机制。 2）Tensor LoRA，一种基于张量的自适应方法，可有效地参数化跨图关系建模，以实现高效的联合训练。这些设计共同实现了稳定的联合扩散训练和多种内在属性的统一生成。实验表明，与现有技术相比，LumiX 产生了连贯且具有物理意义的结果，实现了 23% 高的对齐和更好的偏好分数（0.19 与 -0.41），并且它还可以在同一框架内执行图像条件内在分解。

Title: IC-World: In-Context Generation for Shared World Modeling

Authors: Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02793
Pdf URL: https://arxiv.org/pdf/2512.02793
Copy Paste: [[2512.02793]] IC-World: In-Context Generation for Shared World Modeling(https://arxiv.org/abs/2512.02793)
Keywords: generation
Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.
摘要：基于视频的世界模型最近因其合成多样化和动态视觉环境的能力而受到越来越多的关注。在本文中，我们专注于共享世界建模，其中模型从一组输入图像生成多个视频，每个视频代表不同相机姿势的相同底层世界。我们提出了 IC-World，一种新颖的生成框架，通过激活大型视频模型固有的上下文生成功能，可以并行生成所有输入图像。我们通过强化学习、组相对策略优化以及两个提出的新颖奖励模型进一步微调 IC-World，以增强生成的视频集之间的场景级几何一致性和对象级运动一致性。大量实验表明，IC-World 在几何和运动一致性方面都远远优于最先进的方法。据我们所知，这是第一个系统地探索基于视频的世界模型的共享世界建模问题的工作。

Title: PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Authors: Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02794
Pdf URL: https://arxiv.org/pdf/2512.02794
Copy Paste: [[2512.02794]] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation(https://arxiv.org/abs/2512.02794)
Keywords: generation
Abstract: Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.
摘要：最近基于扩散的文本到图像定制方法在理解控制生成过程的具体概念（例如样式和形状）方面取得了巨大成功。然而，很少有人深入研究物理概念的现实但具有挑战性的定制。当前方法的核心局限性源于在训练过程中缺乏明确引入物理知识。即使当输入文本提示中出现与物理相关的单词时，我们的实验也一致证明这些方法无法在生成的结果中准确反映相应的物理属性。在本文中，我们提出了 PhyCustom，一个微调框架，包含两个新颖的正则化损失，用于激活扩散模型来执行物理定制。具体来说，所提出的等距损失旨在激活扩散模型来学习物理概念，而解耦损失有助于消除独立概念的混合学习。我们在不同的数据集上进行了实验，我们的基准测试结果表明，PhyCustom 在物理定制方面在数量和质量上都优于以前最先进和流行的方法。

Title: From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Authors: Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02826
Pdf URL: https://arxiv.org/pdf/2512.02826
Copy Paste: [[2512.02826]] From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity(https://arxiv.org/abs/2512.02826)
Keywords: generative
Abstract: Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.
摘要：基于流的扩散模型已成为跨图像和视频训练生成模型的领先范例。然而，人们对它们的记忆泛化行为仍然知之甚少。在这项工作中，我们重新审视流量匹配（FM）目标并研究其边缘速度场，它允许封闭式表达式，从而允许精确计算预言机 FM 目标。分析这个预言速度场表明，基于流的扩散模型本质上制定了一个两阶段训练目标：早期阶段由混合数据模式引导，后期阶段由最近的数据样本主导。两个阶段的目标导致了不同的学习行为：早期的导航阶段概括了数据模式以形成全局布局，而后期的细化阶段则越来越多地记住细粒度的细节。利用这些见解，我们解释了实用技术的有效性，例如时间步转移计划、无分类器引导间隔和潜在空间设计选择。我们的研究加深了对扩散模型训练动态的理解，并为指导未来的架构和算法改进提供了原则。

Title: Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Authors: Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02870
Pdf URL: https://arxiv.org/pdf/2512.02870
Copy Paste: [[2512.02870]] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward(https://arxiv.org/abs/2512.02870)
Keywords: generation
Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
摘要：视频扩散模型的最新进展显着改进了摄像机控制的视频生成，但大多数方法仅依赖于监督微调（SFT），而在线强化学习（RL）训练后很大程度上尚未得到充分探索。在这项工作中，我们引入了一个在线 RL 后训练框架，该框架优化了预训练视频生成器以实现精确的相机控制。为了使强化学习在这种情况下发挥作用，我们设计了一种可验证的几何奖励，可以提供密集的分段级反馈来指导模型优化。具体来说，我们估计生成视频和参考视频的 3D 相机轨迹，将每个轨迹划分为短片段，并计算分段相对姿势。然后，奖励函数比较每个生成的参考片段对，并分配对齐分数作为奖励信号，这有助于缓解奖励稀疏性并提高优化效率。此外，我们构建了一个全面的数据集，其中包含各种大幅度的相机运动和具有不同主体动态的场景。大量实验表明，我们的在线 RL 后期训练在多个方面明显优于 SFT 基线，包括摄像机控制精度、几何一致性和视觉质量，证明了其在推进摄像机控制视频生成方面的优越性。

Title: MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Authors: Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02895
Pdf URL: https://arxiv.org/pdf/2512.02895
Copy Paste: [[2512.02895]] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm(https://arxiv.org/abs/2512.02895)
Keywords: generation
Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.
摘要：我们提出了 MindGPT-4ov，一种多模态大语言模型 (MLLM)，它引入了涵盖数据生产、模型训练和高效部署的通用后训练范例。它以低成本在多个基准上实现了最先进的性能，有效增强了MLLM的基础能力和泛化能力。这项工作围绕数据构建、监督微调策略和多模态强化学习方法，提出了三个关键创新：（1）基于信息密度的数据生成方案，与二维树结构标签系统集成，实现高质量跨域数据的自动生成。（2）协作课程监督微调方法，平衡特定领域知识的注入与一般能力的保留。（3）混合强化学习范式，增强推理能力，同时解决多目标优化问题，如多样性探索、多模态感知维持和响应简洁性。此外，我们还实施了一系列基础设施优化，例如5D并行训练、算子优化和推理量化，以提高训练和推理效率，同时降低领域适应成本。实验结果表明，MindGPT-4ov 模型在 MMBench、MMStar、MathVision 和 MathVista 等基准测试中的性能优于最先进的模型。此外，MindGPT-4ov还在垂直领域任务中展示了卓越的用户体验，实现了从学术研究到工业部署的无缝过渡。 MindGPT-4ov 提供了适用于各种 MLLM 的通用后训练范例。基于 Qwen3-VL 的变体的模型权重、数据集和代码最近将开源，以支持社区 MLLM 的开发。

Title: Glance: Accelerating Diffusion Models with 1 Sample

Authors: Zhuobai Dong, Rui Zhao, Songjie Wu, Junchao Yi, Linjie Li, Zhengyuan Yang, Lijuan Wang, Alex Jinpeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02899
Pdf URL: https://arxiv.org/pdf/2512.02899
Copy Paste: [[2512.02899]] Glance: Accelerating Diffusion Models with 1 Sample(https://arxiv.org/abs/2512.02899)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
摘要：扩散模型在图像生成方面取得了显着的成功，但其部署仍然受到繁重的计算成本和大量推理步骤的限制。先前在少步骤蒸馏方面的努力试图通过训练紧凑的学生模型来跳过冗余步骤，但它们经常遭受沉重的再训练成本和泛化能力下降的困扰。在这项工作中，我们采取了不同的视角：我们巧妙地而不是均匀地加速，将较小的加速应用于早期语义阶段，将较大的加速应用于后期的冗余阶段。我们与两位专门研究慢速和快速降噪阶段的专家一起实例化了这种阶段感知策略。令人惊讶的是，我们发现只需为基础模型配备轻量级 LoRA 适配器即可实现高效加速和强泛化，而不是投入大量精力重新训练学生模型。我们将这两个适配器称为 Slow-LoRA 和 Fast-LoRA。通过大量的实验，我们的方法比基本模型实现了高达 5 倍的加速，同时在不同的基准测试中保持了可比较的视觉质量。值得注意的是，LoRA 专家在一小时内在单个 V100 上仅使用 1 个样本进行训练，但生成的模型对未见过的提示具有很强的泛化能力。

Title: MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Authors: Fan Yang, Kaihao Zhang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2512.02906
Pdf URL: https://arxiv.org/pdf/2512.02906
Copy Paste: [[2512.02906]] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding(https://arxiv.org/abs/2512.02906)
Keywords: generation
Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window this http URL on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
摘要：理解高分辨率图像仍然是多模态大语言模型 (MLLM) 的重大挑战。最近的研究通过将图像分成更小的裁剪并使用预训练的检索增强生成（RAG）模型计算每个裁剪和查询之间的语义相似性来解决这个问题。然后选择最相关的裁剪来定位目标对象并抑制不相关的信息。然而，这种基于裁剪的处理可能会跨多个裁剪分割完整的对象，从而破坏语义相似性的计算。在我们的实验中，我们发现不同尺寸物体的图像裁剪在不同分辨率下可以得到更好的处理。基于这一观察，我们提出了多分辨率检索检测（MRD），这是一种用于高分辨率图像理解的免训练框架。为了解决由于对象被分割到不同图像作物而引起的语义相似性偏差问题，我们提出了一种多分辨率语义融合方法，该方法集成在不同分辨率下获得的语义相似性图，以产生更准确的语义信息并保持目标对象的完整性。此外，为了在全球范围内实现目标对象的直接定位，我们引入了一种开放词汇对象检测（OVD）模型，该模型使用滑动窗口来识别对象区域，该http URL在使用不同MLLM的高分辨率图像理解基准上证明了我们方法的有效性。

Title: DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation

Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Chenyang Si
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02931
Pdf URL: https://arxiv.org/pdf/2512.02931
Copy Paste: [[2512.02931]] DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation(https://arxiv.org/abs/2512.02931)
Keywords: generation, generative
Abstract: In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.
摘要：在本文中，我们研究了具有按位视觉分词器的自回归（AR）生成模型中样本多样性的未充分探索的挑战。我们首先分析了限制按位AR模型多样性的因素，并确定了两个关键问题：（1）按位建模的二元分类性质，限制了预测空间；（2）过于尖锐的logits分布，导致采样崩溃并降低多样性。基于这些见解，我们提出了 DiverseAR，这是一种原则性且有效的方法，可以在不牺牲视觉质量的情况下增强图像多样性。具体来说，我们引入了一种自适应逻辑分布缩放机制，可以在采样过程中动态调整二进制输出分布的锐度，从而实现更平滑的预测和更大的多样性。为了减轻分布平滑引起的潜在保真度损失，我们进一步开发了一种基于能量的生成路径搜索算法，避免采样低置信度令牌，从而保持高视觉质量。大量实验表明，DiverseAR 显着提高了按位自回归图像生成中的样本多样性。

Title: Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Authors: Lanxiang Hu, Abhilash Shankarampeta, Yixin Huang, Zilin Dai, Haoyang Yu, Yujie Zhao, Haoqiang Kang, Daniel Zhao, Tajana Rosing, Hao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02942
Pdf URL: https://arxiv.org/pdf/2512.02942
Copy Paste: [[2512.02942]] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench(https://arxiv.org/abs/2512.02942)
Keywords: generation
Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{this https URL}{this http URL}.
摘要：视频生成的下一个前沿在于开发能够进行零样本推理的模型，其中理解现实世界的科学定律对于在不同条件下进行准确的物理结果建模至关重要。然而，现有的视频基准是基于物理常识的，对视频模型的科学推理能力的洞察有限。我们推出了 VideoScience-Bench，这是一个旨在评估本科生对视频模型的科学理解的基准。每个提示都编码了一个复合的科学场景，需要理解和推理多个科学概念才能产生正确的现象。该基准包括 200 个精心策划的提示，涵盖物理和化学领域的 14 个主题和 103 个概念。我们在 T2V 和 I2V 设置中对七个最先进的视频模型进行了专家注释的评估，从五个维度进行：及时一致性、现象一致性、正确动态性、不变性和时空连续性。使用 VLM 作为法官来评估视频生成，我们观察到与人类评估的强烈相关性。据我们所知，VideoScience-Bench 是第一个评估视频模型的基准，不仅可以作为生成器，还可以作为推理器，要求他们的一代人展示与预期的物理和化学现象一致的科学理解。我们的数据和评估代码可在以下位置获取：\href{此 https URL}{此 http URL}。

Title: Layout Anything: One Transformer for Universal Room Layout Estimation

Authors: Md Sohag Mia, Muhammad Abdullah Adnan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02952
Pdf URL: https://arxiv.org/pdf/2512.02952
Copy Paste: [[2512.02952]] Layout Anything: One Transformer for Universal Room Layout Estimation(https://arxiv.org/abs/2512.02952)
Keywords: generation
Abstract: We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.
摘要：我们提出了 Layout Anything，这是一种基于 Transformer 的室内布局估计框架，它采用 OneFormer 的通用分割架构来进行几何结构预测。我们的方法将 OneFormer 的任务条件查询和对比学习与两个关键模块集成在一起：(1) 布局退化策略，可增强训练数据，同时通过拓扑感知转换保留曼哈顿世界约束；(2) 可微分几何损失，可在训练期间直接强制平面一致性和尖锐边界预测。通过将这些组件统一在端到端框架中，该模型消除了复杂的后处理管道，同时实现了 114 毫秒的高速推理。大量实验证明了跨标准基准的最先进性能，LSUN 上的像素误差 (PE) 为 5.43%，角误差 (CE) 为 4.02%，Hedau 上的 PE 为 7.04% (CE 5.17%)，Matterport3D-Layout 数据集上的 PE 为 4.03% (CE 3.15%)。该框架结合了几何感知和计算效率，使其特别适合增强现实应用和大规模 3D 场景重建任务。

Title: Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis

Authors: Jennifer Zvonek, Andrew Gillette
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2512.02967
Pdf URL: https://arxiv.org/pdf/2512.02967
Copy Paste: [[2512.02967]] Pruning AMR: Efficient Visualization of Implicit Neural Representations via Weight Matrix Analysis(https://arxiv.org/abs/2512.02967)
Keywords: generation
Abstract: An implicit neural representation (INR) is a neural network that approximates a spatiotemporal function. Many memory-intensive visualization tasks, including modern 4D CT scanning methods, represent data natively as INRs. While INRs are prized for being more memory-efficient than traditional data stored on a lattice, many visualization tasks still require discretization to a regular grid. We present PruningAMR, an algorithm that builds a mesh with resolution adapted to geometric features encoded by the INR. To identify these geometric features, we use an interpolative decomposition pruning method on the weight matrices of the INR. The resulting pruned network is used to guide adaptive mesh refinement, enabling automatic mesh generation tailored to the underlying resolution of the function. Starting from a pre-trained INR--without access to its training data--we produce a variable resolution visualization with substantial memory savings.
摘要：隐式神经表示（INR）是一种近似时空函数的神经网络。许多内存密集型可视化任务，包括现代 4D CT 扫描方法，本身将数据表示为 INR。虽然 INR 因比存储在网格上的传统数据更节省内存而受到好评，但许多可视化任务仍然需要离散化为规则网格。我们提出了 PruningAMR，这是一种构建网格的算法，其分辨率适合由 INR 编码的几何特征。为了识别这些几何特征，我们对 INR 的权重矩阵使用插值分解修剪方法。由此产生的修剪网络用于指导自适应网格细化，从而实现根据函数的底层分辨率定制的自动网格生成。从预先训练的 INR 开始（无需访问其训练数据），我们可以生成可变分辨率可视化，并节省大量内存。

Title: U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Authors: Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.02982
Pdf URL: https://arxiv.org/pdf/2512.02982
Copy Paste: [[2512.02982]] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences(https://arxiv.org/abs/2512.02982)
Keywords: generation, generative
Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
摘要：根据 LiDAR 序列对动态 3D 环境进行建模对于为自动驾驶和嵌入式 AI 构建可靠的 4D 世界至关重要。然而，现有的生成框架通常统一对待所有空间区域，忽略了现实世界场景中不同的不确定性。这种统一的生成会导致复杂或模糊区域中的伪影，从而限制了真实性和时间稳定性。在这项工作中，我们提出了 U4D，一种用于 4D LiDAR 世界建模的不确定性感知框架。我们的方法首先估计来自预训练分割模型的空间不确定性图，以定位语义上具有挑战性的区域。然后，它通过两个连续阶段以“难易”的方式执行生成：（1）不确定性区域建模，以精细的几何保真度重建高熵区域，以及（2）不确定性条件完成，在学习的结构先验下综合剩余区域。为了进一步确保时间一致性，U4D 结合了时空 (MoST) 块的混合，在扩散过程中自适应地融合空间和时间表示。大量实验表明，U4D 可以生成几何忠实且时间一致的 LiDAR 序列，从而提高自主感知和模拟的 4D 世界建模的可靠性。

Title: TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Authors: Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02993
Pdf URL: https://arxiv.org/pdf/2512.02993
Copy Paste: [[2512.02993]] TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond(https://arxiv.org/abs/2512.02993)
Keywords: generation
Abstract: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.
摘要：流行的 3D 纹理生成方法通常依赖于多视图融合，经常受到视图间不一致和复杂表面覆盖不完整的阻碍，从而限制了生成内容的保真度和完整性。为了克服这些挑战，我们引入了 TEXTRIX，这是一种原生 3D 属性生成框架，用于高保真纹理合成和精确 3D 零件分割等下游应用。我们的方法构建了一个潜在的 3D 属性网格，并利用配备稀疏注意力的 Diffusion Transformer，实现了体积空间中 3D 模型的直接着色，从根本上避免了多视图融合的限制。基于这种原生表示，该框架通过训练相同的架构来预测网格上的语义属性，自然地扩展到高精度 3D 分割。大量实验证明了这两项任务的最先进性能，可生成无缝、高保真纹理和具有精确边界的准确 3D 零件分割。

Title: AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry

Authors: Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Yilin Liu, Durvesh Malpure, Pete Meltzer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03018
Pdf URL: https://arxiv.org/pdf/2512.03018
Copy Paste: [[2512.03018]] AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry(https://arxiv.org/abs/2512.03018)
Keywords: generation
Abstract: The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at this https URL.
摘要：边界表示 (B-Rep) 是计算机辅助设计 (CAD) 中用于定义实体模型的标准数据结构。尽管最近取得了进展，但直接生成具有精确几何形状和防水拓扑的端到端 B-Reps 仍然是一个挑战。本文提出了 AutoBrep，一种新颖的 Transformer 模型，可以自回归生成高质量和有效性的 B-Reps。 AutoBrep 采用统一的标记化方案，将 B-Rep 模型的几何和拓扑特征编码为一系列离散标记。几何图元（即曲面和曲线）被编码为潜在几何标记，它们的结构关系被定义为特殊的拓扑参考标记。 AutoBrep 中的序列顺序自然遵循 B-Rep 面邻接图的广度优先遍历。在推理时，逐步生成相邻的面和边及其拓扑结构。大量实验证明了我们的统一表示与 B-Rep 生成的下一个令牌预测相结合的优势。 AutoBrep 的性能优于基线，具有更好的质量和防水性。它还可以高度扩展到复杂实体，并具有良好的保真度和推理速度。我们进一步表明，通过我们的统一标记化本身支持自动完成 B-Reps，从而以最小的更改实现用户可控的 CAD 生成。代码可从此 https URL 获取。

Title: Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Authors: Kehan Qi, Saumya Gupta, Qingqiao Hu, Weimin Lyu, Chao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03020
Pdf URL: https://arxiv.org/pdf/2512.03020
Copy Paste: [[2512.03020]] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction(https://arxiv.org/abs/2512.03020)
Keywords: generative
Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.
摘要：磁共振成像 (MRI) 无需电离辐射即可提供出色的软组织对比度，但其较长的采集时间限制了临床实用性。最近的方法通过对 $k$ 空间进行欠采样并使用深度学习重建结果图像来加速 MRI。展开网络由于其效率而被广泛用于重建任务，但由于中间步骤中可自由学习的参数而导致不稳定的演化。相比之下，基于随机微分方程的扩散模型在医学和自然图像任务中提供了理论稳定性，但计算成本昂贵。在这项工作中，我们通过理论上证明展开网络是条件概率流 ODE 的离散实现，将流 ODE 引入 MRI 重建。这种联系为参数提供了明确的公式，并阐明了中间状态应如何演变。基于这一见解，我们提出了流对齐训练（FLAT），它从 ODE 离散化中导出展开参数，并将中间重建与理想的 ODE 轨迹对齐，以提高稳定性和收敛性。对三个 MRI 数据集的实验表明，FLAT 实现了高质量的重建，与基于扩散的生成模型相比，迭代次数减少了 3 倍，并且比展开网络的稳定性显着提高。

Title: MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Authors: Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03034
Pdf URL: https://arxiv.org/pdf/2512.03034
Copy Paste: [[2512.03034]] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation(https://arxiv.org/abs/2512.03034)
Keywords: generation
Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human this http URL primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary this http URL Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these this http URL, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video this http URL, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content this http URL experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
摘要：我们提出了 MAViD，一种用于理解和生成视听对话的新型多模式框架。现有的方法主要关注非交互式系统，仅限于产生受约束和不自然的人类这个http URL。这项任务的主要挑战在于有效地集成理解和生成能力，以及实现无缝的多模态音视频融合。为了解决这些问题，我们提出了一种 Conductor-Creator 架构，它将对话系统分为两个主要部分：http URL Conductor 的任务是理解、推理并通过将指令分解为动作和语音组件来生成指令，从而实现对交互的细粒度控制。 Creator 然后根据这些 http URL 进行交互响应，为了解决使用双 DiT 结构生成身份、音色、音调一致的长视频的困难，Creator 采用了自回归（AR）和扩散模型相结合的结构。 AR模型负责音频生成，而扩散模型确保该http URL的高质量视频，我们提出了一种新颖的融合模块来增强上下文连续剪辑和模态之间的连接，从而实现同步的长时间视听内容。该http URL实验表明，我们的框架可以生成生动且上下文连贯的长时间对话交互，并准确解释用户的多模态查询。

Title: ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Authors: Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03036
Pdf URL: https://arxiv.org/pdf/2512.03036
Copy Paste: [[2512.03036]] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation(https://arxiv.org/abs/2512.03036)
Keywords: generation
Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: this https URL.
摘要：尽管视频到音频生成方面取得了进展，但该领域主要关注单声道输出，缺乏空间沉浸感。现有的双耳方法仍然受到两级管道的限制，该管道首先生成单声道音频，然后执行空间化，通常会导致错误累积和时空不一致。为了解决这个限制，我们引入了直接从无声视频生成端到端双耳空间音频的任务。为了支持这项任务，我们提出了 BiAudio 数据集，其中包含大约 97K 个视频双耳音频对，跨越不同的现实世界场景和摄像机旋转轨迹，通过半自动化管道构建。此外，我们提出了 ViSAudio，一种端到端框架，它采用与双分支音频生成架构相匹配的条件流，其中两个专用分支对音频潜在流进行建模。它与条件时空模块集成，平衡通道之间的一致性，同时保留独特的空间特征，确保音频和输入视频之间精确的时空对齐。综合实验表明，ViSAudio 在客观指标和主观评估方面均优于现有最先进的方法，生成具有空间沉浸感的高质量双耳音频，可有效适应视点变化、声源运动和不同的声学环境。项目网站：这个 https URL。

Title: Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Authors: Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03040
Pdf URL: https://arxiv.org/pdf/2512.03040
Copy Paste: [[2512.03040]] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation(https://arxiv.org/abs/2512.03040)
Keywords: generation, generative
Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.
摘要：我们研究视频生成模型是否可以仅使用视觉数据表现出视觉空间智能（人类认知的核心能力）。为此，我们提出了 Video4Spatial，这是一个框架，表明仅以基于视频的场景上下文为条件的视频扩散模型可以执行复杂的空间任务。我们验证两项任务：场景导航 - 遵循相机姿势指令，同时保持与场景的 3D 几何形状一致；以及对象接地 - 需要语义定位、指令遵循和规划。这两项任务都使用纯视频输入，没有深度或姿势等辅助模式。通过框架和数据管理中简单而有效的设计选择，Video4Spatial 展示了对视频上下文的强大空间理解：它端到端地规划导航和地面目标对象，遵循相机姿势指令，同时保持空间一致性，并推广到长上下文和域外环境。总而言之，这些结果将视频生成模型推向一般视觉空间推理。

Title: MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Authors: Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03041
Pdf URL: https://arxiv.org/pdf/2512.03041
Copy Paste: [[2512.03041]] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework(https://arxiv.org/abs/2512.03041)
Keywords: generation
Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.
摘要：目前的视频生成技术擅长单镜头剪辑，但难以制作叙事性多镜头视频，这需要灵活的镜头安排、连贯的叙事以及超越文本提示的可控性。为了应对这些挑战，我们提出了 MultiShotMaster，一个用于高度可控的多镜头视频生成的框架。我们通过集成 RoPE 的两种新颖变体来扩展预训练的单次模型。首先，我们介绍多镜头叙事 RoPE，它在镜头过渡时应用显式相移，实现灵活的镜头安排，同时保留时间叙事顺序。其次，我们设计了时空位置感知 RoPE 以合并参考令牌和接地信号，从而实现时空接地参考注入。此外，为了克服数据稀缺的问题，我们建立了一个自动数据注释管道来提取多镜头视频、字幕、交叉镜头接地信号和参考图像。我们的框架利用内在的架构属性来支持多镜头视频生成，具有文本驱动的镜头间一致性、具有运动控制的定制主题以及背景驱动的定制场景。拍摄次数和持续时间均可灵活配置。大量的实验证明了我们的框架的优越性能和出色的可控性。

Title: PPTArena: A Benchmark for Agentic PowerPoint Editing

Authors: Michael Ofengenden, Yunze Man, Ziqi Pang, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03042
Pdf URL: https://arxiv.org/pdf/2512.03042
Copy Paste: [[2512.03042]] PPTArena: A Benchmark for Agentic PowerPoint Editing(https://arxiv.org/abs/2512.03042)
Keywords: generation
Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.
摘要：我们推出 PPTArena，这是 PowerPoint 编辑的基准，可衡量在自然语言指令下对真实幻灯片的可靠修改。与图像 PDF 渲染或文本到幻灯片生成相比，PPTArena 专注于就地编辑 100 个幻灯片、2125 张幻灯片，以及涵盖文本、图表、表格、动画和大师级样式的 800 多个有针对性的编辑。每个案例都包含一个真实数据平台、一个完全指定的目标结果和一个双 VLM 作为判断管道，该管道使用结构差异和幻灯片图像分别对指令跟踪和视觉质量进行评分。在此设置的基础上，我们提出了 PPTPilot，这是一种结构感知幻灯片编辑代理，它可以规划语义编辑序列、高级编程工具和确定性 XML 操作之间的路由以实现精确控制，并通过针对特定任务约束的迭代计划-编辑-检查循环来验证输出。在我们的实验中，PPTPilot 在复合、布局敏感和交叉幻灯片编辑方面比强大的专有代理和前沿 VLM 系统高出 10 个百分点以上，在视觉保真度和整个面板的一致性方面有特别大的提升。尽管有这些改进，现有代理在 PPTArena 中的长期、文档规模任务上仍然表现不佳，这凸显了可靠 PPT 编辑方面仍然存在的挑战。

Title: CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Authors: Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Minseop Kwak, Jin-Hwa Kim, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03045
Pdf URL: https://arxiv.org/pdf/2512.03045
Copy Paste: [[2512.03045]] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models(https://arxiv.org/abs/2512.03045)
Keywords: generation
Abstract: Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.
摘要：多视图扩散模型最近已成为新颖视图合成的强大范例，但实现其视图一致性的基本机制仍不清楚。在这项工作中，我们首先验证这些模型的注意力图在整个训练过程中获得几何对应关系，关注参考视图和目标视图之间的几何对应区域，以实现视图一致的生成。然而，这种对应信号仍然不完整，在较大的视点变化下其准确性会下降。基于这些发现，我们引入了 CAMEO，这是一种简单而有效的训练技术，它使用几何对应来直接监督注意力图，以提高多视图扩散模型的训练效率和生成质量。值得注意的是，监督单个注意力层足以引导模型学习精确的对应关系，从而保留参考图像的几何结构和结构，加速收敛并提高新颖的视图合成性能。 CAMEO 将收敛所需的训练迭代次数减少了一半，同时在相同的迭代次数下实现了卓越的性能。我们进一步证明 CAMEO 与模型无关，并且可以应用于任何多视图扩散模型。

Title: MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03046
Pdf URL: https://arxiv.org/pdf/2512.03046
Copy Paste: [[2512.03046]] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues(https://arxiv.org/abs/2512.03046)
Keywords: generation, generative
Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.
摘要：我们提出了 MagicQuill V2，这是一个新颖的系统，它将 \textbf{分层合成} 范式引入生成图像编辑，弥合了扩散模型的语义能力与传统图形软件的粒度控制之间的差距。虽然扩散转换器擅长整体生成，但它们使用单一、整体的提示无法区分用户对内容、位置和外观的不同意图。为了克服这个问题，我们的方法将创意意图解构为一堆可控的视觉线索：一个用于创建内容的内容层，一个用于放置它的空间层，一个用于其形状的结构层，以及一个用于其调色板的颜色层。我们的技术贡献包括用于上下文感知内容集成的专用数据生成管道、用于处理所有视觉线索的统一控制模块，以及用于精确本地编辑（包括对象删除）的微调空间分支。大量实验验证了这种分层方法有效解决了用户意图差距，使创作者能够直接、直观地控制生成过程。