2025-12-12

Title: Diffusion Is Your Friend in Show, Suggest and Tell

Authors: Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.10038
Pdf URL: https://arxiv.org/pdf/2512.10038
Copy Paste: [[2512.10038]] Diffusion Is Your Friend in Show, Suggest and Tell(https://arxiv.org/abs/2512.10038)
Keywords: generation, generative
Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: this https URL\_suggest\_tell.
摘要：扩散去噪模型在生成计算机视觉任务中表现出了令人印象深刻的结果，但它们仍然无法在离散域中超越标准自回归解决方案，并且只能与它们相匹配。在这项工作中，我们提出了一种不同的范式，采用扩散模型为自回归生成提供建议而不是取代它们。通过这样做，我们将前者的双向和提炼能力与后者提供的强大语言结构结合起来。为了展示其有效性，我们展示了显示、建议和讲述 (SST)，它在 COCO 上在类似环境下的模型中取得了最先进的结果。特别是，SST 在没有强化学习的 COCO 数据集上达到了 125.1 CIDEr-D，比自回归和扩散模型 State-of-the-Art 结果分别高出 1.5 和 2.5 个点。除了强有力的结果之外，我们还进行了广泛的实验来验证该提案并分析建议模块的影响。结果表明建议和字幕质量之间呈正相关，总体表明目前尚未探索但有前途的研究方向。代码可在以下位置获得：此 https URL\_suggest\_tell。

Title: MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

Authors: Yihao Liu, Chenyu Gao, Lianrui Zuo, Michael E. Kim, Brian D. Boyd, Lisa L. Barnes, Walter A. Kukull, Lori L. Beason-Held, Susan M. Resnick, Timothy J. Hohman, Warren D. Taylor, Bennett A. Landman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10041
Pdf URL: https://arxiv.org/pdf/2512.10041
Copy Paste: [[2512.10041]] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata(https://arxiv.org/abs/2512.10041)
Keywords: generation, generative
Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible this http URL, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.
摘要：现代深度学习方法在从疾病分类、估计连续生物标志物到生成逼真的医学图像等任务中取得了令人印象深刻的成果。大多数这些方法都经过训练，可以对由特定预测方向和一组特定输入变量定义的条件分布进行建模。我们引入了 MetaVoxel，一种生成联合扩散建模框架，它通过学习跨越所有变量的单个扩散过程来对成像数据和临床元数据的联合分布进行建模。通过捕获联合分布，MetaVoxel 统一了传统上需要单独条件模型的任务，并支持使用任意输入子集进行灵活的零样本推理，而无需特定于任务的重新训练。使用超过 10,000 个 T1 加权 MRI 扫描以及来自九个数据集的临床元数据，我们表明单个 MetaVoxel 模型可以执行图像生成、年龄估计和性别预测，实现与已建立的特定任务基线相当的性能。其他实验强调了其灵活的 http URL 的能力，这些发现表明联合多模态扩散为统一医学 AI 模型和实现更广泛的临床适用性提供了一个有希望的方向。

Title: Local LLM Ensembles for Zero-shot Portuguese Named Entity Recognition

Authors: João Lucas Luz Lima Sarcinelli, Diego Furtado Silva
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.10043
Pdf URL: https://arxiv.org/pdf/2512.10043
Copy Paste: [[2512.10043]] Local LLM Ensembles for Zero-shot Portuguese Named Entity Recognition(https://arxiv.org/abs/2512.10043)
Keywords: generation
Abstract: Large Language Models (LLMs) excel in many Natural Language Processing (NLP) tasks through in-context learning but often under-perform in Named Entity Recognition (NER), especially for lower-resource languages like Portuguese. While open-weight LLMs enable local deployment, no single model dominates all tasks, motivating ensemble approaches. However, existing LLM ensembles focus on text generation or classification, leaving NER under-explored. In this context, this work proposes a novel three-step ensemble pipeline for zero-shot NER using similarly capable, locally run LLMs. Our method outperforms individual LLMs in four out of five Portuguese NER datasets by leveraging a heuristic to select optimal model combinations with minimal annotated data. Moreover, we show that ensembles obtained on different source datasets generally outperform individual LLMs in cross-dataset configurations, potentially eliminating the need for annotated data for the current task. Our work advances scalable, low-resource, and zero-shot NER by effectively combining multiple small LLMs without fine-tuning. Code is available at this https URL.
摘要：大型语言模型 (LLM) 通过上下文学习在许多自然语言处理 (NLP) 任务中表现出色，但在命名实体识别 (NER) 方面往往表现不佳，尤其是对于葡萄牙语等资源较低的语言。虽然开放式法学硕士支持本地部署，但没有一个模型能够主导所有任务，从而激发了集成方法的发展。然而，现有的 LLM 集成侧重于文本生成或分类，而 NER 尚未得到充分探索。在这种背景下，这项工作提出了一种新颖的三步集成管道，使用类似功能的本地运行的 LLM 进行零样本 NER。通过利用启发式方法以最少的注释数据选择最佳模型组合，我们的方法在五分之四的葡萄牙 NER 数据集中优于单个法学硕士。此外，我们表明，在不同源数据集上获得的集成通常在跨数据集配置中优于单个法学硕士，可能消除当前任务对注释数据的需求。我们的工作通过有效组合多个小型法学硕士而无需微调，从而推进了可扩展、低资源和零样本的 NER。代码可从此 https URL 获取。

Title: Detailed balance in large language model-driven agents

Authors: Zhuo-Yang Song, Qing-Hong Cao, Ming-xing Luo, Hua Xing Zhu
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, nlin.AO, physics.data-an
Abstract URL: https://arxiv.org/abs/2512.10047
Pdf URL: https://arxiv.org/pdf/2512.10047
Copy Paste: [[2512.10047]] Detailed balance in large language model-driven agents(https://arxiv.org/abs/2512.10047)
Keywords: generation, generative
Abstract: Large language model (LLM)-driven agents are emerging as a powerful new paradigm for solving complex problems. Despite the empirical success of these practices, a theoretical framework to understand and unify their macroscopic dynamics remains lacking. This Letter proposes a method based on the least action principle to estimate the underlying generative directionality of LLMs embedded within agents. By experimentally measuring the transition probabilities between LLM-generated states, we statistically discover a detailed balance in LLM-generated transitions, indicating that LLM generation may not be achieved by generally learning rule sets and strategies, but rather by implicitly learning a class of underlying potential functions that may transcend different LLM architectures and prompt templates. To our knowledge, this is the first discovery of a macroscopic physical law in LLM generative dynamics that does not depend on specific model details. This work is an attempt to establish a macroscopic dynamics theory of complex AI systems, aiming to elevate the study of AI agents from a collection of engineering practices to a science built on effective measurements that are predictable and quantifiable.
摘要：大型语言模型（LLM）驱动的代理正在成为解决复杂问题的强大新范例。尽管这些实践在经验上取得了成功，但仍然缺乏理解和统一其宏观动态的理论框架。这封信提出了一种基于最少行动原则的方法来估计嵌入在代理中的法学硕士的潜在生成方向性。通过实验测量LLM生成状态之间的转换概率，我们在统计上发现了LLM生成转换的详细平衡，这表明LLM生成可能不是通过一般学习规则集和策略来实现的，而是通过隐式学习一类可能超越不同LLM架构和提示模板的潜在函数来实现。据我们所知，这是法学硕士生成动力学中首次发现不依赖于特定模型细节的宏观物理定律。这项工作试图建立复杂人工智能系统的宏观动力学理论，旨在将人工智能代理的研究从工程实践的集合提升为建立在可预测和可量化的有效测量基础上的科学。

Title: Independent Density Estimation

Authors: Jiahao Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.10067
Pdf URL: https://arxiv.org/pdf/2512.10067
Copy Paste: [[2512.10067]] Independent Density Estimation(https://arxiv.org/abs/2512.10067)
Keywords: generation
Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.
摘要：大规模视觉语言模型在图像描述和条件图像生成等各个领域取得了显着的成果。然而，这些模型在实现类人的组合泛化方面仍然遇到困难。在本研究中，我们提出了一种称为独立密度估计（IDE）的新方法来应对这一挑战。 IDE 旨在学习句子中各个单词与图像中相应特征之间的联系，从而实现组合泛化。我们根据 IDE 的理念构建了两个模型。第一个利用完全解开的视觉表示作为输入，第二个利用变分自动编码器从原始图像中获取部分解开的特征。此外，我们提出了一种基于熵的组合推理方法来组合句子中每个单词的预测。在各种数据集上进行评估时，与当前模型相比，我们的模型对未见过的成分表现出优异的泛化能力。

Title: Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences

Authors: Sarwan Ali, Taslim Murad
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2512.10147
Pdf URL: https://arxiv.org/pdf/2512.10147
Copy Paste: [[2512.10147]] Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences(https://arxiv.org/abs/2512.10147)
Keywords: generation
Abstract: Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.
摘要：由 SARS-CoV-2 引起的冠状病毒病 (COVID-19) 的早期检测和表征对于有效的临床应对和公共卫生规划仍然至关重要。大规模病毒序列数据的全球可用性为计算分析提供了重要机会；然而，现有方法面临着显着的局限性。基于系统发育树的方法计算量大，并且不能有效地扩展到当今的数百万序列数据集。同样，当前基于嵌入的技术通常依赖于对齐序列或表现出次优的预测性能和高运行时间成本，从而为实际的大规模分析造成障碍。在这项研究中，我们关注与刺突蛋白区域相关的最流行的 SARS-CoV-2 谱系，并引入一种可扩展的嵌入方法，该方法利用散列生成刺突序列的紧凑、低维表示。这些嵌入随后用于训练各种机器学习模型以进行监督谱系分类。我们进行了广泛的评估，将我们的方法与跨不同指标的多个基线和最先进的生物序列嵌入方法进行比较。我们的结果表明，所提出的嵌入极大地提高了效率，实现了高达 86.4% 的分类精度，同时将嵌入生成时间减少了 99.81%。这凸显了该方法作为大规模病毒序列分析的快速、有效和可扩展解决方案的潜力。

Title: CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation

Authors: Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.10178
Pdf URL: https://arxiv.org/pdf/2512.10178
Copy Paste: [[2512.10178]] CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation(https://arxiv.org/abs/2512.10178)
Keywords: generation
Abstract: In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.
摘要：在实际的深度学习部署中，数据的稀缺和标签分布的不平衡往往会导致现实世界数据分布中语义未覆盖的区域，阻碍模型训练并导致类边界附近的错误分类以及外围区域的不稳定行为。尽管最近的大型语言模型（LLM）显示出数据增强的前景，但同时实现生成定向控制、域对齐和质量控制的集成框架尚未完全建立。为了应对这些挑战，我们提出了一个用于几何感知和域对齐数据增强的集群条件插值和外推框架（CIEGAD），它系统地补充了分布内和分布外语义未覆盖的区域。 CIEGAD通过聚类条件构建域轮廓，通过集成类频率和几何指标的分层频率几何分配来分配发电，并通过内插和外推综合并存精细控制发电方向。它还通过几何约束过滤结合法学硕士作为法官机制进一步执行质量控制。多个分类任务的实验表明，CIEGAD 有效扩展了现实世界数据分布的外围，同时保持生成数据和现实世界数据之间的高度一致性以及语义多样性。特别是，对于长尾和多类分类任务，CIEGAD 持续改进 F1 和召回率，验证了分布一致性、多样性和质量的三重和谐。这些结果表明，CIEGAD 作为一个面向实用的数据增强框架，可以补充代表性不足的区域，同时保持与现实世界数据的一致性。

Title: RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Authors: Zhuo Wang, Xiliang Liu, Ligang Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10248
Pdf URL: https://arxiv.org/pdf/2512.10248
Copy Paste: [[2512.10248]] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection(https://arxiv.org/abs/2512.10248)
Keywords: generative
Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.
摘要：人工智能生成视频技术的激增对信息完整性提出了挑战。虽然最近的基准测试推进了 AIGC 视频检测，但它们忽略了一个关键因素：许多最先进的生成模型在输出中嵌入数字水印，而检测器可能部分依赖这些模式。为了评估这种影响，我们提出了 RobustSora，这是一个旨在评估 AIGC 视频检测中水印稳健性的基准。我们系统地构建了包含 6,500 个视频的数据集，包括四种类型：真实干净 (A-C)、带有假水印的真实欺骗 (A-S)、生成水印 (G-W) 和生成去水印 (G-DeW)。我们的基准测试引入了两个评估任务：任务 I 测试去水印人工智能视频的性能，而任务 II 评估带有假水印的真实视频的误报率。对涵盖专门 AIGC 检测器、变压器架构和 MLLM 方法的 10 个模型进行的实验揭示了水印操作下 2-8pp 的性能变化。基于 Transformer 的模型表现出一致的中等依赖性 (6-8pp)，而 MLLM 表现出不同的模式 (2-8pp)。这些发现表明部分水印依赖性，并强调了水印感知训练策略的必要性。 RobustSora 提供了推进稳健 AIGC 检测研究的重要工具。

Title: MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Authors: Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, Dong Yu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.10284
Pdf URL: https://arxiv.org/pdf/2512.10284
Copy Paste: [[2512.10284]] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing(https://arxiv.org/abs/2512.10284)
Keywords: generative
Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.
摘要：我们介绍了 MotionEdit，这是一个用于以运动为中心的图像编辑的新颖数据集，其任务是修改主体动作和交互，同时保留身份、结构和物理合理性。与专注于静态外观变化或仅包含稀疏、低质量运动编辑的现有图像编辑数据集不同，MotionEdit 提供高保真图像对，描述从连续视频中提取和验证的真实运动变换。这项新任务不仅具有科学挑战性，而且具有实际意义，可为帧控制视频合成和动画等下游应用提供动力。为了评估新任务上的模型性能，我们引入了 MotionEdit-Bench，这是一个基准测试，挑战以运动为中心的编辑模型，并通过生成性、判别性和基于偏好的指标来衡量模型性能。基准结果表明，运动编辑对于现有最先进的基于扩散的编辑模型仍然具有挑战性。为了解决这一差距，我们提出了 MotionNFT（运动引导负感知微调），这是一种训练后框架，它根据输入图像和模型编辑图像之间的运动流与地面真实运动的匹配程度来计算运动对齐奖励，从而引导模型实现准确的运动变换。在 FLUX.1 Kontext 和 Qwen-Image-Edit 上进行的大量实验表明，MotionNFT 在不牺牲一般编辑能力的情况下，持续提高了两个基础模型在运动编辑任务上的编辑质量和运动保真度，证明了其有效性。

Title: ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

Authors: Xiaoxue Wu, Xinyuan Chen, Yaohui Wang, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10286
Pdf URL: https://arxiv.org/pdf/2512.10286
Copy Paste: [[2512.10286]] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions(https://arxiv.org/abs/2512.10286)
Keywords: generation
Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.
摘要：镜头过渡在多镜头视频生成中发挥着关键作用，因为它们决定了整体叙事表达和视觉叙事的导演设计。然而，最近的进展主要集中在镜头之间的低水平视觉一致性，忽略了如何设计过渡以及电影语言如何有助于连贯的叙事表达。这通常会导致仅仅连续的镜头变化，而没有有意的电影编辑模式。为了解决这个限制，我们提出了 ShotDirector，这是一个集成了参数级摄像机控制和分层编辑模式感知提示的高效框架。具体来说，我们采用了一个相机控制模块，该模块结合了 6-DoF 位姿和内部设置，以实现精确的相机信息注入。此外，采用镜头感知遮罩机制，引入感知专业编辑模式的分层提示，实现对镜头内容的细粒度控制。通过这种设计，我们的框架有效地将参数级条件与高级语义指导结合起来，实现了类似电影的可控镜头过渡。为了便于训练和评估，我们构建了 ShotWeaver40K，这是一个捕获电影类编辑模式先验的数据集，并开发了一组用于可控多镜头视频生成的评估指标。大量的实验证明了我们框架的有效性。

Title: Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings

Authors: Karthikeya KV, Narendra Bandaru
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10293
Pdf URL: https://arxiv.org/pdf/2512.10293
Copy Paste: [[2512.10293]] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings(https://arxiv.org/abs/2512.10293)
Keywords: generation
Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.
摘要：我们推出了 Disentangled360，这是一种创新的 3D 感知技术，它将方向解缠结体渲染与单图像 360° 独特视图合成的优势相结合，适用于医学成像和自然场景重建。与过度简化各向异性光行为或缺乏跨各种环境的通用性的当前技术相比，我们的框架明显区分了高斯泼溅骨干内的各向同性和各向异性贡献。我们实现了一个双分支调节框架，一个针对体积数据中 CT 强度驱动的散射进行了优化，另一个通过标准化相机嵌入针对真实世界的 RGB 场景进行了优化。为了解决尺度模糊性并保持结构真实性，我们提出了一种混合姿势不可知的锚定方法，该方法自适应地对场景深度和材质过渡进行采样，在场景蒸馏过程中充当稳定的枢轴。我们的设计将术前放射线照相模拟和消费级 360° 渲染集成到单一推理管道中，促进快速、逼真的视图合成和固有的方向性。对 Mip-NeRF 360、RealEstate10K 和 DeepDRR 数据集的评估表明其具有卓越的 SSIM 和 LPIPS 性能，而运行时评估则证实了其对于交互式应用程序的可行性。 Disentangled360 促进混合现实医疗监督、机器人感知和沉浸式内容创建，消除了特定场景微调或昂贵的光子模拟的必要性。

Title: Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset

Authors: Hyunsoo Lee, Daeum Jeon, Hyeokjae Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10321
Pdf URL: https://arxiv.org/pdf/2512.10321
Copy Paste: [[2512.10321]] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset(https://arxiv.org/abs/2512.10321)
Keywords: generative
Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.
摘要：我们提出了一种新颖的 3D 人体姿势估计生成方法。由于人体的复杂几何形状、自遮挡关节以及对大规模真实世界运动数据集的需求，3D 人体姿势估计提出了几个关键挑战。为了解决这些挑战，我们引入了 Point2Pose，这是一个框架，可以有效地模拟基于连续点云和姿势历史的人体姿势分布。具体来说，我们采用时空点云编码器和姿态特征编码器来提取联合特征，然后是基于注意力的生成回归器。此外，我们还提出了一个大规模室内数据集 MVPose3D，其中包含多种模态，包括重要人体运动的 IMU 数据、密集的多视点云和 RGB 图像。实验结果表明，所提出的方法优于基线模型，证明了其在各种数据集上的优越性能。

Title: A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images

Authors: Yi Liu, Yichi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10334
Pdf URL: https://arxiv.org/pdf/2512.10334
Copy Paste: [[2512.10334]] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images(https://arxiv.org/abs/2512.10334)
Keywords: generative
Abstract: Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.
摘要：细长的丝状结构，例如微管和肌动蛋白丝，通常在生物系统中发挥重要作用。在生物图像中分割这些细丝是定量分析的基本步骤。深度学习的最新进展显着提高了细丝分割的性能。然而，获取丝状结构的高质量像素级注释数据集面临着巨大的挑战，因为丝状结构的密集分布和几何特性使得手动注释极其费力和耗时。为了解决数据短缺问题，我们提出了一种基于 Pix2Pix 架构的条件生成框架，用于从二进制掩模生成显微图像中的真实细丝。我们还提出了一种灯丝感知结构损失，以提高生成合成图像时的结构相似性。我们的实验证明了我们方法的有效性，并且优于没有合成数据训练的现有模型。

Title: Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

Authors: Yi-Cheng Liao, Shyang-En Weng, Yu-Syuan Xu, Chi-Wei Hsiao, Wei-Chen Chiu, Ching-Chun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10340
Pdf URL: https://arxiv.org/pdf/2512.10340
Copy Paste: [[2512.10340]] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution(https://arxiv.org/abs/2512.10340)
Keywords: restoration, super-resolution, generative
Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.
摘要：真实世界图像超分辨率（Real-ISR）旨在从因未知和复杂的现实世界因素而降级的低质量输入中恢复高质量图像。现实世界的场景涉及多样化和耦合的退化，因此有必要为扩散模型提供更丰富、信息更丰富的指导。然而，现有方法通常假设已知的退化严重性，并依赖于无法捕获数字严重性的 CLIP 文本编码器，从而限制了它们的泛化能力。为了解决这个问题，我们提出 \textbf{HD-CLIP} （\textbf{H}ierarchical \textbf{D}egradation CLIP），它将低质量图像分解为语义嵌入和序数退化嵌入，捕获有序关系并允许跨不可见的级别进行插值。此外，我们通过无分类器引导（CFG）和提出的无分类器投影引导（CFPG）将其集成到扩散模型中。 HD-CLIP 利用语义线索来指导生成恢复，同时使用退化线索来抑制不需要的幻觉和伪影。作为一个即插即用模块，HD-CLIP 无需训练即可无缝集成到各种超分辨率框架中，从而显着提高了不同现实世界数据集的细节保真度和感知真实感。

Title: Topology-Agnostic Animal Motion Generation from Text Prompt

Authors: Keyi Chen, Mingze Sun, Zhenyu Liu, Zhangquan Chen, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10352
Pdf URL: https://arxiv.org/pdf/2512.10352
Copy Paste: [[2512.10352]] Topology-Agnostic Animal Motion Generation from Text Prompt(https://arxiv.org/abs/2512.10352)
Keywords: generation, generative
Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.
摘要：运动生成是计算机动画的基础，广泛应用于娱乐、机器人和虚拟环境。虽然最近的方法取得了令人印象深刻的结果，但大多数方法都依赖于固定的骨架模板，这阻止了它们推广到具有不同或扰动拓扑的骨架。我们解决了当前运动生成方法的核心局限性 - 缺乏大规模异构动物运动数据和能够联合建模任意骨骼拓扑和文本条件的统一生成框架。为此，我们引入了 OmniZoo，这是一个涵盖 140 个物种和 32,979 个序列的大型动物运动数据集，并富含多模态注释。在 OmniZoo 的基础上，我们提出了一种通用的自回归运动生成框架，能够为任意骨骼拓扑生成文本驱动的运动。我们模型的核心是拓扑感知的骨架嵌入模块，它将任何骨架的几何和结构属性编码到共享令牌空间中，从而实现与文本语义的无缝融合。给定文本提示和目标骨架，我们的方法生成时间连贯、物理合理且语义对齐的运动，并进一步实现跨物种运动风格迁移。

Title: Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views

Authors: Zhankuo Xu, Chaoran Feng, Yingtao Li, Jianbin Zhao, Jiashu Yang, Wangbo Yu, Li Yuan, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10369
Pdf URL: https://arxiv.org/pdf/2512.10369
Copy Paste: [[2512.10369]] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views(https://arxiv.org/abs/2512.10369)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at this https URL.
摘要：3D 高斯分布 (3DGS) 已成为新颖视图合成的最先进方法。然而，其性能在很大程度上依赖于密集、高质量的输入图像，这一假设在现实应用中经常被违反，因为现实世界中的数据通常是稀疏且运动模糊的。这两个问题造成了一个恶性循环：稀疏视图忽略了解决运动模糊所需的多视图约束，而运动模糊则消除了对于对齐有限视图至关重要的高频细节。因此，重建常常会灾难性地失败，并出现碎片化的视图和低频偏差。为了打破这个循环，我们引入了 CoherentGS，这是一种从稀疏和模糊图像进行高保真 3D 重建的新颖框架。我们的主要见解是使用双先验策略来解决这些化合物降解问题。具体来说，我们结合了两个预先训练的生成模型：一个专门的去模糊网络，用于恢复清晰的细节并提供光度指导，以及一个扩散模型，它提供几何先验来填充场景中未观察到的区域。这种双先验策略得到了多项关键技术的支持，包括自适应引导生成过程的一致性引导相机探索模块，以及确保几何合理性的深度正则化损失。我们使用少至 3、6 和 9 个输入视图，通过对合成场景和真实场景进行定量和定性实验来评估 CoherentGS。我们的结果表明，CoherentGS 的性能显着优于现有方法，为这项具有挑战性的任务设定了新的最先进方法。代码和视频演示可从此 https URL 获取。

Title: RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds

Authors: Jingyun Fu, Zhiyu Xiang, Na Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10376
Pdf URL: https://arxiv.org/pdf/2512.10376
Copy Paste: [[2512.10376]] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds(https://arxiv.org/abs/2512.10376)
Keywords: generation
Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.
摘要：最近的多模态融合方法，将图像与激光雷达点云集成，在场景流估计中显示出了前景。然而，4D毫米波雷达和激光雷达的融合仍有待探索。与激光雷达不同，雷达更便宜，在各种天气条件下更稳定，并且可以检测逐点速度，使其成为激光雷达的宝贵补充。然而，由于噪声、低分辨率和稀疏性，雷达输入带来了挑战。此外，目前还没有专门用于场景流估计的结合激光雷达和雷达数据的数据集。为了解决这一差距，我们基于公共的真实汽车数据集构建了雷达-激光雷达场景流数据集。我们提出了一种有效的雷达去噪和场景流标签生成的预处理策略，为物体边界外的雷达点导出更可靠的流地面实况。此外，我们还推出了RaLiFlow，这是第一个用于4D雷达和LiDAR的联合场景流学习框架，它通过新颖的动态感知双向跨模态融合（DBCF）模块和一组精心设计的损失函数实现了有效的雷达-LiDAR融合。 DBCF 模块将雷达的动态线索集成到本地交叉注意机制中，从而实现跨模态的上下文信息传播。同时，所提出的损失函数减轻了训练期间不可靠雷达数据的不利影响，并增强了两种模式场景流预测的实例级一致性，特别是对于动态前景区域。对重新利用的场景流数据集进行的大量实验表明，我们的方法明显优于现有的基于 LiDAR 和雷达的单模态方法。

Title: Disentangled and Distilled Encoder for Out-of-Distribution Reasoning with Rademacher Guarantees

Authors: Zahra Rahiminasab, Michael Yuhas, Arvind Easwaran
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.10522
Pdf URL: https://arxiv.org/pdf/2512.10522
Copy Paste: [[2512.10522]] Disentangled and Distilled Encoder for Out-of-Distribution Reasoning with Rademacher Guarantees(https://arxiv.org/abs/2512.10522)
Keywords: generative
Abstract: Recently, the disentangled latent space of a variational autoencoder (VAE) has been used to reason about multi-label out-of-distribution (OOD) test samples that are derived from different distributions than training samples. Disentangled latent space means having one-to-many maps between latent dimensions and generative factors or important characteristics of an image. This paper proposes a disentangled distilled encoder (DDE) framework to decrease the OOD reasoner size for deployment on resource-constrained devices while preserving disentanglement. DDE formalizes student-teacher distillation for model compression as a constrained optimization problem while preserving disentanglement with disentanglement constraints. Theoretical guarantees for disentanglement during distillation based on Rademacher complexity are established. The approach is evaluated empirically by deploying the compressed model on an NVIDIA
摘要：最近，变分自动编码器 (VAE) 的解缠结潜在空间已被用于推理多标签分布外 (OOD) 测试样本，这些测试样本源自与训练样本不同的分布。解开的潜在空间意味着潜在维度和图像的生成因素或重要特征之间具有一对多的映射。本文提出了一种解缠结蒸馏编码器 (DDE) 框架，以减少 OOD 推理机的大小，以便在资源受限的设备上部署，同时保持解缠结。 DDE 将模型压缩的学生-教师蒸馏形式化为约束优化问题，同时保留解缠结约束。建立了基于拉德马赫复杂度的蒸馏过程中解缠结的理论保证。通过在 NVIDIA 上部署压缩模型来对该方法进行实证评估

Title: Mode-Seeking for Inverse Problems with Diffusion Models

Authors: Sai Bharath Chandra Gutha, Ricardo Vinuesa, Hossein Azizpour
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.10524
Pdf URL: https://arxiv.org/pdf/2512.10524
Copy Paste: [[2512.10524]] Mode-Seeking for Inverse Problems with Diffusion Models(https://arxiv.org/abs/2512.10524)
Keywords: restoration
Abstract: A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.
摘要：预训练的无条件扩散模型与后验采样或最大后验（MAP）估计技术相结合，可以解决任意逆问题，而无需针对特定任务的训练或微调。然而，现有的后验采样和 MAP 估计方法通常依赖于建模近似，并且计算量要求较高。在这项工作中，我们提出了变分模式搜索损失（VML），当它在每个反向扩散步骤中最小化时，引导生成的样本进行 MAP 估计。 VML 源于最小化扩散后验 $p(\mathbf{x}_0|\mathbf{x}_t)$ 和测量后验 $p(\mathbf{x}_0|\mathbf{y})$ 之间的 Kullback-Leibler (KL) 散度的新颖视角，其中 $\mathbf{y}$ 表示测量。重要的是，对于线性反问题，VML 可以通过分析推导而无需近似。基于进一步的理论见解，我们提出了 VML-MAP，一种解决逆问题的经验有效算法，并通过跨多个数据集的各种图像恢复任务的广泛实验，验证了其在性能和计算时间方面相对于现有方法的有效性。

Title: Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Authors: Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10548
Pdf URL: https://arxiv.org/pdf/2512.10548
Copy Paste: [[2512.10548]] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding(https://arxiv.org/abs/2512.10548)
Keywords: super-resolution
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.
摘要：多模态大语言模型（MLLM）在各种视觉语言任务上取得了显着的进展，但它们的视觉感知仍然有限。相比之下，人类通过在连续的“眨眼式”过程中动态扫描和聚焦显着区域来有效地感知复杂场景。受这一策略的推动，我们首先研究 MLLM 是否表现出类似的行为。我们的试点分析表明，MLLM 自然地关注跨层的不同视觉区域，并且有选择地将更多计算分配给显着标记可以增强视觉感知。基于这种见解，我们提出了 Blink，这是一种动态视觉令牌解析框架，可以在单个前向传递中模拟受人类启发的过程。具体来说，Blink 包括两个模块：显着性引导扫描和动态令牌解析。它首先根据注意力图估计每层视觉标记的显着性，并通过即插即用标记超分辨率（TokenSR）模块扩展重要标记。在下一层中，当扩展令牌失去焦点时，它会丢弃它们。这种动态机制平衡了广泛的探索和细粒度的焦点，从而自适应且有效地增强视觉感知。大量实验验证了 Blink，证明了其在增强视觉感知和多模态理解方面的有效性。

Title: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

Authors: Haojie Zheng, Shuchen Weng, Jingqi Liu, Siqi Yang, Boxin Shi, Xinlong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10571
Pdf URL: https://arxiv.org/pdf/2512.10571
Copy Paste: [[2512.10571]] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner(https://arxiv.org/abs/2512.10571)
Keywords: generation
Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: this https URL.
摘要：视频生成领域的最新进展凸显出真实的视听同步对于吸引人的内容创作至关重要。然而，现有的视频编辑方法在很大程度上忽视了视听同步，并且缺乏精确实例级编辑所需的细粒度空间和时间可控性。在本文中，我们提出了 AVI-Edit，一个用于音频同步视频实例编辑的框架。我们提出了一种粒度感知掩模细化器，可以迭代地将用户提供的粗略掩模细化为精确的实例级区域。我们进一步设计了一个自反馈音频代理来策划高质量的音频指导，提供细粒度的时间控制。为了促进这项任务，我们还构建了一个具有以实例为中心的对应关系和全面注释的大型数据集。大量实验表明，AVI-Edit 在视觉质量、条件跟踪和视听同步方面优于最先进的方法。项目页面：此 https URL。

Title: Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

Authors: Wenlong Jiao, Heyang Lee, Ping Wang, Pengfei Zhu, Qinghua Hu, Dongwei Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10581
Pdf URL: https://arxiv.org/pdf/2512.10581
Copy Paste: [[2512.10581]] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration(https://arxiv.org/abs/2512.10581)
Keywords: restoration
Abstract: All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at this https URL.
摘要：一体化图像恢复旨在在统一的框架内处理各种退化（例如噪声、模糊、恶劣天气），但现有方法越来越依赖于复杂的架构（例如专家混合、扩散模型）和复杂的退化提示策略。在这项工作中，我们揭示了一个重要的见解：精心设计的特征提取本质上编码了携带退化的信息，而对称的 U-Net 架构足以有效地释放这些线索。通过跨编码器-解码器对齐特征尺度并实现简化的跨尺度传播，我们的对称设计稳健地保留了固有的退化信号，在跳跃连接中渲染简单的加法融合足以实现最先进的性能。我们的主要基线 SymUNet 建立在这个对称 U-Net 的基础上，在基准数据集上取得了比现有方法更好的结果，同时降低了计算成本。我们进一步提出了一种语义增强变体 SE-SymUNet，它通过简单的交叉注意力集成了来自冻结 CLIP 特征的直接语义注入，以显式放大退化先验。对多个基准的广泛实验验证了我们方法的优越性。 SymUNet 和 SE-SymUNet 两个基线都为一体化图像恢复的未来发展奠定了更简单、更强大的基础。源代码可从此 https URL 获取。

Title: Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

Authors: Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10617
Pdf URL: https://arxiv.org/pdf/2512.10617
Copy Paste: [[2512.10617]] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces(https://arxiv.org/abs/2512.10617)
Keywords: generation
Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
摘要：我们提出了 Lang2Motion，一个通过将运动流形与关节嵌入空间对齐来生成语言引导点轨迹的框架。与之前专注于人体运动或视频合成的工作不同，我们通过点跟踪使用从现实世界视频中提取的运动为任意对象生成明确的轨迹。我们基于 Transformer 的自动编码器通过双重监督学习轨迹表示：文本运动描述和渲染的轨迹可视化，两者都通过 CLIP 的冻结编码器进行映射。与视频生成基线相比，Lang2Motion 在文本到轨迹检索方面实现了 34.2% Recall@1，比基于视频的方法高出 12.5 个点，并将运动准确度提高了 33-52%（12.4 ADE vs 18.3-25.3）。尽管仅针对不同的物体运动进行训练，但我们在人类动作识别方面展示了 88.3% 的 Top-1 准确度，显示出跨运动领域的有效转移。 Lang2Motion 通过 CLIP 对齐的轨迹表示支持风格转换、语义插值和潜在空间编辑。

Title: DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

Authors: Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10619
Pdf URL: https://arxiv.org/pdf/2512.10619
Copy Paste: [[2512.10619]] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM(https://arxiv.org/abs/2512.10619)
Keywords: quality assessment
Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: this https URL.
摘要：文档解析旨在将非结构化PDF图像转换为半结构化数据，促进不同领域信息的数字化和利用。虽然视觉语言模型 (VLM) 显着推进了这项任务，但在现实场景中实现可靠、高质量的解析仍然具有挑战性。常见做法通常会选择标准基准上表现最好的模型。然而，这些基准可能带有特定于数据集的偏差，导致模型排名不一致以及与现实世界性能的相关性有限。此外，基准指标通常仅提供总体分数，这可能会掩盖输出中的不同错误模式。这就提出了一个关键挑战：我们如何可靠、全面地评估野外文档解析质量？我们使用 DOCR-Inspector 解决了这个问题，它将文档解析评估形式化为细粒度的错误检测和分析。 DOCR-Inspector 利用 VLM 作为法官，分析文档图像及其解析输出，识别所有错误，将它们分配给 28 种预定义类型之一，并生成全面的质量评估。为了实现这种能力，我们构建了 DOCRcase-200K 进行训练，并提出了 Chain-of-Checklist 推理范式，以实现解析质量评估的层次结构。为了进行实证验证，我们引入了 DOCRcaseBench，这是一组带有手动注释的 882 个真实文档解析案例。在此基准测试中，DOCR-Inspector-7B 的性能优于 Gemini 2.5 Pro 等商业模型以及领先的开源模型。进一步的实验表明，其质量评估为解析结果细化提供了宝贵的指导，使 DOCR-Inspector 既是实用的评估器，也是大规模推进文档解析系统的驱动程序。模型和代码发布于：此 https URL。

Title: TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Authors: Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou, Ling Lo, Sheng-Ping Yang, Yu-Wen Tseng, Kun-Hsiang Lin, Chia-Ling Chen, Yu-Ting Ta, Yan-Tsung Wang, Po-Ching Chen, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2512.10652
Pdf URL: https://arxiv.org/pdf/2512.10652
Copy Paste: [[2512.10652]] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection(https://arxiv.org/abs/2512.10652)
Keywords: generative
Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.
摘要：生成模型的进步使得制作真实的个人形象变得越来越容易，这给安全、通信和公众信任带来了严重的风险。检测这种人为操纵行为需要系统不仅能够区分更改的内容和真实媒体，而且能够提供清晰可靠的推理。在本文中，我们介绍了 TriDF，这是一个可解释的 DeepFake 检测的综合基准。 TriDF 包含来自高级合成模型的高质量伪造品，涵盖图像、视频和音频模式的 16 种 DeepFake 类型。该基准评估三个关键方面：感知，衡量模型使用人工注释证据识别细粒度操纵工件的能力；检测，评估不同伪造品家族和生成者的分类性能；幻觉，量化模型生成的解释的可靠性。对最先进的多模态大语言模型的实验表明，准确的感知对于可靠的检测至关重要，但幻觉会严重扰乱决策，揭示了这三个方面的相互依赖关系。 TriDF 提供了一个统一的框架，用于理解检测准确性、证据识别和解释可靠性之间的相互作用，为构建解决现实世界合成媒体威胁的可信系统奠定了基础。

Title: Learning by Analogy: A Causal Framework for Composition Generalization

Authors: Lingjing Kong, Shaoan Xie, Yang Jiao, Yetian Chen, Yanhui Guo, Simone Shao, Yan Gao, Guangyi Chen, Kun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.10669
Pdf URL: https://arxiv.org/pdf/2512.10669
Copy Paste: [[2512.10669]] Learning by Analogy: A Causal Framework for Composition Generalization(https://arxiv.org/abs/2512.10669)
Keywords: generative
Abstract: Compositional generalization -- the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences. While effective, the data structures and principles that enable this crucial capability remain poorly understood. We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice. In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes. We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms. Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects. Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.
摘要：组合泛化——理解和生成所学概念的新颖组合的能力——使模型能够将其能力扩展到有限的经验之外。虽然有效，但实现这一关键功能的数据结构和原理仍然知之甚少。我们提出，组合泛化从根本上需要将高级概念分解为基本的低级概念，这些概念可以在相似的上下文中重新组合，类似于人类在概念之间进行类比的方式。例如，从未见过孔雀吃米饭的人可以通过将其与之前观察到的鸡吃米饭联系起来来想象这一场景。在这项工作中，我们使用因果模块化和最小变化的原则将这些直观的过程形式化。我们引入了一种分层数据生成过程，该过程自然地编码不同级别的概念及其交互机制。从理论上讲，我们证明这种方法能够实现组合泛化，支持组合概念之间的复杂关系，超越了假设简单相互作用（如加性效应）的先前工作。至关重要的是，我们还证明这种潜在的层次结构可以从文本图像对等可观察数据中恢复（可识别），这是学习这种生成过程的必要步骤。为了验证我们的理论，我们应用理论框架中的见解并在基准数据集上实现了重大改进。

Title: CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images

Authors: Matias Cosarinsky, Nicolas Gaggion, Rodrigo Echeveste, Enzo Ferrante
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10715
Pdf URL: https://arxiv.org/pdf/2512.10715
Copy Paste: [[2512.10715]] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images(https://arxiv.org/abs/2512.10715)
Keywords: generative
Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (this http URL), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at this http URL and the source code at this http URL.
摘要：不确定性估计对于医学图像分割系统的安全临床部署至关重要，能够识别不可靠的预测并支持人工监督。虽然之前的工作主要集中在像素级的不确定性上，但基于地标的分割提供了固有的拓扑保证，但从不确定性的角度来看仍然没有得到充分探索。在这项工作中，我们研究了胸部 X 射线基于解剖标志的分割的不确定性估计。受到将标准图像卷积编码器与基于图的生成解码器相结合的混合神经网络架构的启发，并利用其变分潜在空间，我们得出了两个互补的度量：（i）直接从学习的分布参数中捕获的潜在不确定性，以及（ii）通过从潜在样本生成多个随机输出预测来获得的预测不确定性。通过受控腐败实验，我们表明，两种不确定性指标都随着扰动严重程度的增加而增加，反映了全局和局部的退化。我们证明这些不确定性信号可以通过与手动真实值进行比较来识别不可靠的预测，并支持 CheXmask 数据集上的分布外检测。更重要的是，我们发布了 CheXmask-U（此 http URL），这是一个包含 657,566 个胸部 X 射线标志性分割的大型数据集，具有每个节点的不确定性估计，使研究人员能够在使用这些解剖掩模时解释分割质量的空间变化。我们的研究结果表明，不确定性估计是增强胸部 X 射线中基于标志的解剖分割方法的鲁棒性和安全部署的一个有前途的方向。此 http URL 提供了该方法的完整工作交互式演示，此 http URL 提供了源代码。

Title: Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

Authors: Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song, Eric P. Xing, Kun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.10720
Pdf URL: https://arxiv.org/pdf/2512.10720
Copy Paste: [[2512.10720]] Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality(https://arxiv.org/abs/2512.10720)
Keywords: generation, generative
Abstract: Deep generative models, while revolutionizing fields like image and text generation, largely operate as opaque black boxes, hindering human understanding, control, and alignment. While methods like sparse autoencoders (SAEs) show remarkable empirical success, they often lack theoretical guarantees, risking subjective insights. Our primary objective is to establish a principled foundation for interpretable generative models. We demonstrate that the principle of causal minimality -- favoring the simplest causal explanation -- can endow the latent representations of diffusion vision and autoregressive language models with clear causal interpretation and robust, component-wise identifiable control. We introduce a novel theoretical framework for hierarchical selection models, where higher-level concepts emerge from the constrained composition of lower-level variables, better capturing the complex dependencies in data generation. Under theoretically derived minimality conditions (manifesting as sparsity or compression constraints), we show that learned representations can be equivalent to the true latent variables of the data-generating process. Empirically, applying these constraints to leading generative models allows us to extract their innate hierarchical concept graphs, offering fresh insights into their internal knowledge organization. Furthermore, these causally grounded concepts serve as levers for fine-grained model steering, paving the way for transparent, reliable systems.
摘要：深度生成模型虽然彻底改变了图像和文本生成等领域，但在很大程度上作为不透明的黑匣子运行，阻碍了人类的理解、控制和对齐。虽然稀疏自动编码器（SAE）等方法在经验上取得了显着的成功，但它们往往缺乏理论保证，存在主观见解的风险。我们的主要目标是为可解释的生成模型建立原则基础。我们证明，因果极小性原则（倾向于最简单的因果解释）可以赋予扩散视觉和自回归语言模型的潜在表示以清晰的因果解释和稳健的、组件方面的可识别控制。我们为分层选择模型引入了一种新颖的理论框架，其中较高级别的概念从较低级别变量的约束组合中产生，更好地捕获数据生成中的复杂依赖关系。在理论上推导的极小条件（表现为稀疏性或压缩约束）下，我们表明学习的表示可以等效于数据生成过程的真实潜在变量。根据经验，将这些约束应用于领先的生成模型使我们能够提取其固有的分层概念图，为其内部知识组织提供新的见解。此外，这些因果关系概念可以作为细粒度模型控制的杠杆，为透明、可靠的系统铺平道路。

Title: IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation

Authors: Yuan-Ming Li, Qize Yang, Nan Lei, Shenghao Fu, Ling-An Zeng, Jian-Fang Hu, Xihan Wei, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10730
Pdf URL: https://arxiv.org/pdf/2512.10730
Copy Paste: [[2512.10730]] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation(https://arxiv.org/abs/2512.10730)
Keywords: generation
Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: this https URL.
摘要：运动感知大型语言模型的最新进展在统一运动理解和生成任务方面显示出了巨大的前景。然而，这些模型通常将理解和生成分开处理，限制了任务之间的交互反馈可能产生的互惠互利。在这项工作中，我们揭示了运动评估和细化任务是实现理解和生成之间双向知识流动的关键桥梁。利用这一见解，我们提出了运动生成交错推理（IRMoGen），这是一种新颖的范式，通过迭代文本运动对话将运动生成与评估和细化紧密结合起来。为了实现这一目标，我们引入了 IRG-MotionLLM，这是第一个无缝交错运动生成、评估和细化以提高生成性能的模型。 IRG-MotionLLM 采用新颖的三阶段训练方案逐步开发，初始化并随后增强本地 IRMoGen 功能。为了促进这一发展，我们构建了一个自动化数据引擎来从现有的文本运动数据集中合成交错推理注释。大量实验表明：（i）评估和细化任务显着改善了文本与动作的对齐； (ii) 交错的运动生成、评估和细化步骤可在整个训练阶段产生一致的性能增益； (iii) IRG-MotionLLM 明显优于基线模型，并在标准文本到运动生成基准上实现了先进的性能。交叉评估者测试进一步验证了其有效性。代码和数据：此 https URL。

Title: LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

Authors: Tianyu Zhou, Junyi Tang, Zehui Li, Dahong Qian, Suncheng Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10750
Pdf URL: https://arxiv.org/pdf/2512.10750
Copy Paste: [[2512.10750]] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation(https://arxiv.org/abs/2512.10750)
Keywords: generation
Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.
摘要：结肠镜息肉诊断对于早期结直肠癌检测至关重要，但由于缺乏高质量的多模式医疗数据，传统的自动化报告存在不一致和幻觉的问题。为了弥补这一差距，我们提出了 LDP，这是一种利用多模态大语言模型 (MLLM) 来生成专业息肉诊断报告的新颖框架。具体来说，我们策划了 MMEndo，一个多模式内窥镜数据集，包含专家注释的结肠镜检查图像-文本对。我们使用参数高效微调 (LoRA) 微调 Qwen2-VL-7B 主干，并通过直接偏好优化 (DPO) 将其与临床标准保持一致。大量实验表明，我们的 LDP 在自动化指标和严格的临床专家评估方面均优于现有基线（达到 7.2/10 的医师评分），与完全微调相比，训练计算成本显着降低了 833 倍。所提出的解决方案为初级医疗保健提供了一条可扩展的、临床上可行的路径，并对 IU-XRay 数据集进行了额外验证，确认了其稳健性。

Title: Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography

Authors: Rene Lisasi, Michele Esposito, Chen Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10765
Pdf URL: https://arxiv.org/pdf/2512.10765
Copy Paste: [[2512.10765]] Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography(https://arxiv.org/abs/2512.10765)
Keywords: generation
Abstract: Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.
摘要：基于计算流体动力学 (CFD) 的冠状动脉血流模拟为诊断冠状动脉疾病 (CAD) 提供了有价值的血流动力学标记，例如压力梯度。然而，CFD 计算成本高、耗时且难以集成到大规模临床工作流程中。这些限制限制了用于训练 AI 模型的标记血流动力学数据的可用性，并阻碍了非侵入性、基于生理学的 CAD 评估的广泛采用。为了应对这些挑战，我们开发了一种端到端管道，可以自动从冠状动脉计算机断层扫描血管造影（CCTA）中提取冠状动脉几何形状，简化模拟数据生成，并能够有效学习冠状动脉血压分布。该管道减少了与传统 CFD 工作流程相关的手动负担，同时生成一致的训练数据。我们进一步引入了一种基于扩散的回归模型，旨在直接从 CCTA 导出的特征预测冠状动脉血压，从而绕过推理过程中缓慢的 CFD 计算的需要。在模拟冠状动脉血流动力学数据集上进行评估，所提出的模型实现了最先进的性能，R2 为 64.42%，均方根误差为 0.0974，归一化 RMSE 为 0.154，优于几种基线方法。这项工作为快速、无创血压预测提供了一个可扩展且可访问的框架，以支持 CAD 诊断。

Title: What matters for Representation Alignment: Global Information or Spatial Structure?

Authors: Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, Saining Xie
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.10794
Pdf URL: https://arxiv.org/pdf/2512.10794
Copy Paste: [[2512.10794]] What matters for Representation Alignment: Global Information or Spatial Structure?(https://arxiv.org/abs/2512.10794)
Keywords: generation, generative
Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at this https URL
摘要：表示对齐 (REPA) 通过将表示从强大的预训练视觉编码器提炼为中间扩散特征来指导生成训练。我们研究一个基本问题：目标表示的哪些方面对于生成至关重要，它的 \textit{global} \revision{semantic} 信息（例如，通过 ImageNet-1K 精度测量）或其空间结构（即补丁标记之间的成对余弦相似度）？普遍的观点认为，更强的全局语义性能可以更好地生成目标表示。为了研究这一点，我们首先对 27 种不同的视觉编码器和不同的模型尺度进行了大规模的实证分析。结果令人惊讶；空间结构而不是全局性能驱动目标表示的生成性能。为了进一步研究这一点，我们引入了两个简单的修改，它们特别强调了 \emph{spatial} 信息的传输。我们用简单的卷积层替换 REPA 中的标准 MLP 投影层，并为外部表示引入空间归一化层。令人惊讶的是，我们的简单方法（用 $<4 行代码实现），称为 iREPA，在不同的视觉编码器、模型大小和训练变体（例如 REPA、REPA-E、Meanflow、JiT 等）中持续提高了 REPA 的收敛速度。我们的工作促使人们重新审视表征对齐的基本工作机制，以及如何利用它来改进生成模型的训练。代码和项目页面可在此 https URL 获取

Title: Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

Authors: Akshay Kulkarni, Tsui-Wei Weng, Vivek Narayanaswamy, Shusen Liu, Wesam A. Sakla, Kowshik Thopalli
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.10805
Pdf URL: https://arxiv.org/pdf/2512.10805
Copy Paste: [[2512.10805]] Interpretable and Steerable Concept Bottleneck Sparse Autoencoders(https://arxiv.org/abs/2512.10805)
Keywords: generation
Abstract: Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.
摘要：稀疏自动编码器 (SAE) 为 LLM 和 LVLM 中的机械解释、概念发现和模型引导提供了一种统一的方法。然而，要实现这种潜力，需要学习到的特征既可解释又可操纵。为此，我们引入了两种新的计算成本低廉的可解释性和可操纵性指标，并对 LVLM 进行了系统分析。我们的分析揭示了两个观察结果： (i) 大多数 SAE 神经元要么表现出低可解释性，要么表现出低可操纵性，或者两者兼而有之，导致它们对于下游使用无效； (ii) 由于 SAE 的无监督性质，学习的词典中通常不存在用户所需的概念，从而限制了它们的实际用途。为了解决这些限制，我们提出了概念瓶颈稀疏自动编码器（CB-SAE）——一种新颖的事后框架，它可以修剪低效用神经元，并通过与用户定义的概念集对齐的轻量级概念瓶颈来增强潜在空间。由此产生的 CB-SAE 将 LVLM 和图像生成任务的可解释性提高了 32.1%，可操纵性提高了 14.5%。我们将提供我们的代码和模型权重。

Title: Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Authors: Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, Joan Bruna
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.10857
Pdf URL: https://arxiv.org/pdf/2512.10857
Copy Paste: [[2512.10857]] Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants(https://arxiv.org/abs/2512.10857)
Keywords: generative
Abstract: Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.
摘要：基于传输的方法已成为从大型、干净的数据集构建生成模型的领先范例。然而，在许多科学和工程领域，通常无法获得干净的数据：相反，我们只能观察到因噪声、病态通道而损坏的测量结果。因此，原始数据的生成模型需要解决分布级别的逆问题。在这项工作中，我们引入了一种基于随机插值的新方法来完成此任务：我们仅使用对损坏数据集的访问以及对损坏通道的黑盒访问来迭代更新损坏数据样本和干净数据样本之间的传输映射。在适当的条件下，这个迭代过程会收敛到一个自洽的传输图，该图可以有效地反转损坏通道，从而为干净数据提供生成模型。我们将所得方法称为自洽随机插值 (SCSI)。它（i）与变分替代方案相比计算效率高，（ii）高度灵活，仅通过黑盒访问即可处理任意非线性前向模型，并且（iii）享有理论保证。我们在自然图像处理和科学重建的逆问题上展示了优越的性能，并在适当的假设下建立了该方案的收敛保证。

Title: SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Authors: Kehong Gong, Zhengyu Wen, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Chenbin Li, Dongze Lian, Wei Zhao, Xiaoyu He, Mingyuan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10860
Pdf URL: https://arxiv.org/pdf/2512.10860
Copy Paste: [[2512.10860]] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation(https://arxiv.org/abs/2512.10860)
Keywords: generation
Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: this https URL
摘要：尽管 4D 内容生成取得了重大进展，但将单目视频转换为具有显式 4D 网格的高质量动画 3D 资产仍然相当具有挑战性。大规模、自然捕获的 4D 网格数据集的稀缺进一步限制了以纯粹数据驱动的方式从头开始训练可推广视频到 4D 模型的能力。与此同时，在广泛数据集的支持下，图像到 3D 生成的进步提供了可以利用的强大的先验模型。为了更好地利用这些先验，同时最大限度地减少对 4D 监督的依赖，我们引入了 SWiT-4D，这是一种用于无损、无参数时间 4D 网格生成的滑动窗口变换器。 SWiT-4D 与任何基于扩散变压器 (DiT) 的图像到 3D 生成器无缝集成，在视频帧中添加时空建模，同时保留原始的单图像前向过程，从而能够从任意长度的视频进行 4D 网格重建。为了恢复全局翻译，我们进一步引入了针对静态相机单目视频定制的基于优化的轨迹模块。 SWiT-4D 展示了强大的数据效率：仅用单个短（<10s）视频进行微调，即可实现高保真几何和稳定的时间一致性，表明在极其有限的 4D 监督下具有实际可部署性。对域内动物园测试集和具有挑战性的域外基准（C4D、Objaverse 和野外视频）的综合实验表明，SWiT-4D 在时间平滑度方面始终优于现有基线。项目页面：此 https URL

Title: DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10894
Pdf URL: https://arxiv.org/pdf/2512.10894
Copy Paste: [[2512.10894]] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance(https://arxiv.org/abs/2512.10894)
Keywords: generation
Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
摘要：最近基于视觉语言模型 (VLM) 的方法在 SVG 生成方面取得了令人印象深刻的成果。然而，由于它们在解码过程中仅生成文本并且缺乏视觉信号，因此它们经常难以处理复杂的语义，并且无法生成视觉上吸引人或几何上连贯的 SVG。我们引入 DuetSVG，一种统一的多模态模型，以端到端的方式联合生成图像标记和相应的 SVG 标记。 DuetSVG 在图像和 SVG 数据集上进行训练。在推理时，我们应用了一种新颖的测试时间缩放策略，该策略利用模型的本机视觉预测作为提高 SVG 解码质量的指导。大量实验表明，我们的方法优于现有方法，可以在各种应用程序中生成视觉上忠实、语义一致且语法干净的 SVG。

Title: Stronger Normalization-Free Transformers

Authors: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.10938
Pdf URL: https://arxiv.org/pdf/2512.10938
Copy Paste: [[2512.10938]] Stronger Normalization-Free Transformers(https://arxiv.org/abs/2512.10938)
Keywords: generation
Abstract: Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
摘要：尽管归一化层长期以来一直被视为深度学习架构不可或缺的组成部分，但最近推出的 Dynamic Tanh (DyT) 证明了替代方案是可能的。逐点函数 DyT 约束极值以实现稳定收敛并达到归一化级别的性能；这项工作进一步寻求可以超越它的功能设计。我们首先研究逐点函数的内在属性如何影响训练和性能。基于这些发现，我们对更有效的功能设计进行了大规模的搜索。通过这一探索，我们引入了 $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$，其中 $\mathrm{erf}(x)$ 是重新缩放的高斯累积分布函数，并将其确定为性能最佳的设计。 Derf 在多个领域都优于 LayerNorm、RMSNorm 和 DyT，包括视觉（图像识别和生成）、语音表示和 DNA 序列建模。我们的研究结果表明，Derf 的性能提升很大程度上源于其泛化能力的提高，而不是更强的拟合能力。其简单性和更强的性能使 Derf 成为无归一化 Transformer 架构的实用选择。

Title: GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Authors: Madhav Agarwal, Mingtian Zhang, Laura Sevilla-Lara, Steven McDonagh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10939
Pdf URL: https://arxiv.org/pdf/2512.10939
Copy Paste: [[2512.10939]] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting(https://arxiv.org/abs/2512.10939)
Keywords: generation
Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.
摘要：最近出现了语音驱动的说话头像，可以实现交互式化身。然而，现实世界的应用是有限的，因为当前的方法实现了高视觉保真度，但缓慢或快速但暂时不稳定。扩散方法可生成逼真的图像，但难以处理单次设置。高斯泼溅方法是实时的，但面部跟踪的不准确或高斯映射不一致会导致不稳定的输出和视频伪影，这对实际用例不利。我们通过使用 3D 可变形模型映射高斯分布来生成特定于人的化身来解决这个问题。我们直接从音频引入基于变压器的模型参数预测，以驱动时间一致性。通过单眼视频和独立音频语音输入，我们的方法可以生成实时头部说话视频，并在其中报告有竞争力的定量和定性性能。

Title: OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Authors: Xiang Fan, Sharath Girish, Vivek Ramanujan, Chaoyang Wang, Ashkan Mirzaei, Petr Sushko, Aliaksandr Siarohin, Sergey Tulyakov, Ranjay Krishna
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10940
Pdf URL: https://arxiv.org/pdf/2512.10940
Copy Paste: [[2512.10940]] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis(https://arxiv.org/abs/2512.10940)
Keywords: generation
Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at this https URL
摘要：先前将相机控制注入扩散模型的方法主要关注 4D 一致性任务的特定子集：新颖的视图合成、带有相机控制的文本到视频、图像到视频等。因此，这些碎片化方法是在可用 3D/4D 数据的不相交切片上进行训练的。我们引入了 OmniView，这是一个统一的框架，可概括广泛的 4D 一致性任务。我们的方法分别表示空间、时间和视图条件，从而实现这些输入的灵活组合。例如，OmniView 可以从静态、动态和多视图输入合成新颖的视图，及时向前和向后推断轨迹，并通过完全摄像头控制根据文本或图像提示创建视频。 OmniView 与跨不同基准和指标的特定任务模型具有竞争力，在多视图 NVS LLFF 数据集中，相机条件扩散模型的图像质量分数提高了 33\%，在动态 NVS 神经 3D 视频基准中提高了 60\%，在 RE-10K 上的静态相机控制中提高了 20\%，并且在文本条件视频生成中将相机轨迹误差减少了 4 倍。 OmniView 在一种模型中具有很强的通用性，展示了通用 4D 视频模型的可行性。项目页面可在此 https URL 获取

Title: Mull-Tokens: Modality-Agnostic Latent Thinking

Authors: Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10941
Pdf URL: https://arxiv.org/pdf/2512.10941
Copy Paste: [[2512.10941]] Mull-Tokens: Modality-Agnostic Latent Thinking(https://arxiv.org/abs/2512.10941)
Keywords: generation
Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
摘要：推理超越语言；现实世界需要对空间、时间、可供性以及更多仅靠语言无法表达的事物进行推理。现有的探索图像推理潜力的多模态模型很脆弱并且无法扩展。他们依靠调用专业工具、昂贵的图像生成或手工推理数据来在文本和图像思想之间切换。相反，我们提供了一种更简单的替代方案——Mull-Tokens——与模态无关的潜在标记，经过预先训练，可以以图像或文本模态保存中间信息，让模型以自由形式思考正确答案。我们研究了受潜在推理框架启发的训练 Mull-Tokens 的最佳实践。我们首先使用来自交错文本图像轨迹的监督来训练 Mull-Tokens，然后仅使用最终答案在没有任何监督的情况下进行微调。在涉及解决谜题和采取不同视角等任务的四个具有挑战性的空间推理基准中，我们证明了 Mull-Tokens 利用纯文本推理或交错图像文本推理在多个基线上进行了改进，与我们最强的基线相比，平均提高了 +3%，在解谜推理重的分割上实现了 +16% 的提升。 Mull-Tokens 增加了围绕文本和视觉推理基础挑战的对话，提供了一种简单的解决方案，可以以多种模式进行抽象思考。

Title: VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10942
Pdf URL: https://arxiv.org/pdf/2512.10942
Copy Paste: [[2512.10942]] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language(https://arxiv.org/abs/2512.10942)
Keywords: generation
Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
摘要：我们介绍 VL-JEPA，这是一种基于联合嵌入预测架构 (JEPA) 构建的视觉语言模型。 VL-JEPA 不是像经典 VLM 那样自回归生成标记，而是预测目标文本的连续嵌入。通过在抽象表示空间中学习，该模型专注于与任务相关的语义，同时抽象出表面级别的语言变异性。在与使用相同视觉编码器和训练数据的标准令牌空间 VLM 训练进行严格控制的比较中，VL-JEPA 实现了更强的性能，同时可训练参数减少了 50%。在推理时，仅在需要将 VL-JEPA 预测嵌入转换为文本时才调用轻量级文本解码器。我们表明，VL-JEPA 本身支持选择性解码，与非自适应统一解码相比，可将解码操作数量减少 2.85 倍，同时保持相似的性能。除了生成之外，VL-JEPA 的嵌入空间自然支持开放词汇分类、文本到视频检索和判别性 VQA，无需任何架构修改。在 8 个视频分类和 8 个视频检索数据集上，VL-JEPA 的平均性能超过了 CLIP、SigLIP2 和 Perception Encoder。同时，该模型在四个 VQA 数据集：GQA、TallyQA、POPE 和 POPEv2 上实现了与经典 VLM（InstructBLIP、QwenVL）相当的性能，尽管参数只有 1.6B。

Title: AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Authors: Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10943
Pdf URL: https://arxiv.org/pdf/2512.10943
Copy Paste: [[2512.10943]] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation(https://arxiv.org/abs/2512.10943)
Keywords: generation
Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at this https URL
摘要：使用大型扩散模型的主题驱动视频生成的最新进展使得基于用户提供的主题的个性化内容合成成为可能。然而，现有方法缺乏对主体出现和消失的细粒度时间控制，而这对于合成视频合成、故事板和可控动画等应用至关重要。我们提出了 AlcheMinT，这是一个统一的框架，为主题驱动的视频生成引入了显式时间戳调节。我们的方法引入了一种新颖的位置编码机制，该机制解锁了时间间隔的编码，在我们的例子中与主体身份相关联，同时与预训练的视频生成模型位置嵌入无缝集成。此外，我们还结合了主题描述性文本标记来加强视觉标识和视频字幕之间的绑定，从而减少生成过程中的歧义。通过 token-wise 连接，AlcheMinT 避免了任何额外的交叉注意力模块，并且产生的参数开销可以忽略不计。我们建立了一个评估多主体身份保存、视频保真度和时间依从性的基准。实验结果表明，AlcheMinT 实现了与最先进的视频个性化方法相匹配的视觉质量，同时首次实现了对视频中多主体生成的精确时间控制。项目页面位于此 https URL

Title: MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Authors: Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10945
Pdf URL: https://arxiv.org/pdf/2512.10945
Copy Paste: [[2512.10945]] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation(https://arxiv.org/abs/2512.10945)
Keywords: generation
Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at this https URL
摘要：本文提出了一种用于参考运动表达视频分割的大规模多模态数据集，重点是基于对象运动的语言描述来分割和跟踪视频中的目标对象。现有的参考视频分割数据集通常关注显着对象并使用富含静态属性的语言表达，可能允许在单帧中识别目标对象。此类数据集低估了运动在视频和语言中的作用。为了探索使用运动表达式和运动推理线索进行像素级视频理解的可行性，我们引入了 MeViS，这是一个包含 33,072 个人工注释的文本和音频运动表达式的数据集，涵盖了复杂场景的 2,006 个视频中的 8,171 个对象。我们对 MeViS 支持的 4 个任务中的 15 种现有方法进行了基准测试，包括 6 种参考视频对象分割（RVOS）方法、3 种音频引导视频对象分割（AVOS）方法、2 种参考多对象跟踪（RMOT）方法以及用于新引入的参考运动表达生成（RMEG）任务的 4 种视频字幕方法。结果证明了现有方法在解决运动表达引导视频理解方面的弱点和局限性。我们进一步分析了挑战，并提出了一种用于 RVOS/AVOS/RMOT 的 LMPM++ 方法，该方法取得了新的最先进的结果。我们的数据集提供了一个平台，有助于在复杂视频场景中开发运动表达引导的视频理解算法。建议的 MeViS 数据集和方法的源代码可在此 https URL 公开获取

Title: ClusIR: Towards Cluster-Guided All-in-One Image Restoration

Authors: Shengkai Hu, Jiaqi Ma, Jun Wan, Wenwen Min, Yongcheng Jing, Lefei Zhang, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10948
Pdf URL: https://arxiv.org/pdf/2512.10948
Copy Paste: [[2512.10948]] ClusIR: Towards Cluster-Guided All-in-One Image Restoration(https://arxiv.org/abs/2512.10948)
Keywords: restoration
Abstract: All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.
摘要：一体化图像恢复（AiOIR）旨在在统一框架内从各种退化中恢复高质量图像。然而，现有的方法通常无法明确地模拟退化类型，并且难以使其恢复行为适应复杂或混合的退化。为了解决这些问题，我们提出了 ClusIR，这是一个集群引导的图像恢复框架，它通过可学习的集群显式地模拟退化语义，并跨空间和频率域传播集群感知线索以进行自适应恢复。具体来说，ClusIR 包括两个关键组件：概率集群引导路由机制 (PCGRM) 和降级感知频率调制模块 (DAFMM)。所提出的 PCGRM 将退化识别与专家激活分开，从而实现有区别的退化感知和稳定的专家路由。同时，DAFMM 利用集群引导先验来执行自适应频率分解和目标调制，协同细化结构和纹理表示，以实现更高的恢复保真度。集群引导的协同作用将语义线索与频域调制无缝连接起来，使 ClusIR 能够在各种退化情况下获得显着的恢复结果。对不同基准的大量实验验证了 ClusIR 在多种场景下达到了有竞争力的性能。

Title: Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Authors: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.10949
Pdf URL: https://arxiv.org/pdf/2512.10949
Copy Paste: [[2512.10949]] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation(https://arxiv.org/abs/2512.10949)
Keywords: generation
Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at this https URL.
摘要：强化学习 (RL) 早先被证明在大型语言和多模态模型中有效，最近已成功扩展到增强 2D 图像生成。然而，由于 3D 对象的空间复杂性较高，需要全局一致的几何形状和细粒度的局部纹理，因此将 RL 应用于 3D 生成在很大程度上仍未得到探索。这使得 3D 生成对奖励设计和 RL 算法非常敏感。为了应对这些挑战，我们对跨多个维度的文本到 3D 自回归生成的强化学习进行了首次系统研究。 (1) 奖励设计：我们评估奖励维度和模型选择，表明与人类偏好的一致性至关重要，并且通用多模态模型为 3D 属性提供了稳健的信号。 (2) RL 算法：我们研究 GRPO 变体，强调 token 级优化的有效性，并进一步研究训练数据和迭代的扩展。（3）文本到3D基准：由于现有基准无法衡量3D生成模型中的隐式推理能力，因此我们引入了MME-3DR。 (4) 高级 RL 范式：受 3D 生成的自然层次结构的启发，我们提出了 Hi-GRPO，它通过专用奖励集成来优化全局到局部的层次化 3D 生成。基于这些见解，我们开发了 AR3D-R1，这是第一个 RL 增强的文本到 3D 模型，是从粗糙形状到纹理细化的专家。我们希望这项研究能够为 3D 生成的 RL 驱动推理提供见解。代码在此 https URL 发布。

Title: Bidirectional Normalizing Flow: From Data to Noise and Back

Authors: Yiyang Lu, Qiao Sun, Xianbang Wang, Zhicheng Jiang, Hanhong Zhao, Kaiming He
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.10953
Pdf URL: https://arxiv.org/pdf/2512.10953
Copy Paste: [[2512.10953]] Bidirectional Normalizing Flow: From Data to Noise and Back(https://arxiv.org/abs/2512.10953)
Keywords: generation, generative
Abstract: Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.
摘要：标准化流（NF）已被确立为生成建模的原则框架。标准 NF 由正向过程和反向过程组成：正向过程将数据映射到噪声，而反向过程通过反转数据来生成样本。典型的 NF 正向变换受到显式可逆性的约束，确保逆过程可以作为其精确的解析逆。 TARFlow 及其变体的最新发展通过结合 Transformer 和自回归流重振了 NF 方法，但也暴露了因果解码是一个主要瓶颈。在这项工作中，我们引入了双向归一化流（$\textbf{BiFlow}$），这是一个消除了精确解析逆的需要的框架。 BiFlow 学习一个反向模型，该模型近似底层噪声到数据的逆映射，从而实现更灵活的损失函数和架构。 ImageNet 上的实验表明，与对应的因果解码相比，BiFlow 提高了生成质量，同时将采样速度提高了两个数量级。 BiFlow 在基于 NF 的方法中产生了最先进的结果，在单一评估（“1-NFE”）方法中产生了具有竞争力的性能。继最近在 NF 方面取得令人鼓舞的进展之后，我们希望我们的工作能够引起人们对这一经典范式的进一步关注。

Title: Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Authors: Sicheng Mo, Thao Nguyen, Richard Zhang, Nick Kolkin, Siddharth Srinivasan Iyer, Eli Shechtman, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, Yuheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10954
Pdf URL: https://arxiv.org/pdf/2512.10954
Copy Paste: [[2512.10954]] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration(https://arxiv.org/abs/2512.10954)
Keywords: generation, generative
Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
摘要：在这项工作中，我们探索了扩散模型推理中未开发的信号。虽然之前的所有方法都在推理时独立生成图像，但我们转而询问是否可以协作生成样本。我们提出群组扩散，解锁在图像之间共享的注意力机制，而不仅仅局限于图像中的补丁。这使得图像能够在推理时联合去噪，学习图像内和图像间的对应关系。我们观察到明显的规模效应——更大的群体规模会产生更强的跨样本注意力和更好的生成质量。此外，我们引入了一种定性测量来捕获这种行为，并表明其强度与 FID 密切相关。我们的 GroupDiff 基于标准扩散变压器构建，在 ImageNet-256x256 上实现了高达 32.2% 的 FID 改进。我们的工作揭示了跨样本推理是一种有效的、先前未经探索的生成建模机制。

Title: Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10955
Pdf URL: https://arxiv.org/pdf/2512.10955
Copy Paste: [[2512.10955]] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization(https://arxiv.org/abs/2512.10955)
Keywords: generation, generative
Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
摘要：视觉概念个性化旨在仅将特定的图像属性（例如身份、表达、灯光和风格）转移到看不见的环境中。然而，现有的方法依赖于通用图像编码器的整体嵌入，这会纠缠多个视觉因素，使得很难隔离单个属性。这通常会导致信息泄漏和不连贯的合成。为了解决这个限制，我们引入了 Omni-Attribute，这是第一个开放词汇图像属性编码器，旨在学习高保真、特定于属性的表示。我们的方法联合设计数据和模型：（i）我们策划用正面和负面属性注释的语义链接图像对，以明确地教导编码器保留或抑制什么； (ii)我们采用双目标训练范式，平衡生成保真度和对比解开。由此产生的嵌入被证明对于开放词汇属性检索、个性化和组合生成是有效的，在多个基准测试中实现了最先进的性能。

Title: SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

Authors: Yukai Shi, Weiyu Li, Zihao Wang, Hongyang Li, Xingyu Chen, Ping Tan, Lei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10957
Pdf URL: https://arxiv.org/pdf/2512.10957
Copy Paste: [[2512.10957]] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model(https://arxiv.org/abs/2512.10957)
Keywords: generation
Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at this https URL.
摘要：在这项工作中，我们提出了一个名为 SceneMaker 的解耦 3D 场景生成框架。由于缺乏足够的开放集去遮挡和姿态估计先验，现有方法很难在严重遮挡和开放集设置下同时生成高质量的几何图形和准确的姿态。为了解决这些问题，我们首先将去遮挡模型与 3D 对象生成分离，并通过利用图像数据集和收集的去遮挡数据集来增强它，以获得更多样化的开放集遮挡模式。然后，我们提出了一个统一的姿态估计模型，该模型集成了自注意力和交叉注意力的全局和局部机制，以提高准确性。此外，我们构建了一个开放集 3D 场景数据集，以进一步扩展姿态估计模型的泛化。综合实验证明了我们的解耦框架在室内和开放场景中的优越性。我们的代码和数据集在此 https URL 发布。

Title: WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Authors: Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10958
Pdf URL: https://arxiv.org/pdf/2512.10958
Copy Paste: [[2512.10958]] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World(https://arxiv.org/abs/2512.10958)
Keywords: generation, generative
Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
摘要：生成世界模型正在重塑具体的人工智能，使智能体能够合成逼真的 4D 驾驶环境，这些环境看起来令人信服，但在物理或行为上往往会失败。尽管进展迅速，该领域仍然缺乏统一的方法来评估生成的世界是否保留几何形状、服从物理或支持可靠的控制。我们推出了 WorldLens，这是一个全方位基准测试，用于评估模型在其生成的世界中构建、理解和行为的情况。它涵盖五个方面——生成、重构、行动跟踪、下游任务和人类偏好——共同涵盖视觉真实性、几何一致性、物理合理性和功能可靠性。在这些维度上，没有一个现有的世界模型能够普遍胜出：那些具有强纹理的世界模型常常违反物理原理，而几何稳定的世界模型则缺乏行为保真度。为了使客观指标与人类判断保持一致，我们进一步构建了 WorldLens-26K，这是一个包含数字分数和文本原理的人工注释视频的大型数据集，并开发了 WorldLens-Agent，这是一个从这些注释中提炼出来的评估模型，以实现可扩展、可解释的评分。基准、数据集和代理一起形成了一个统一的生态系统，用于衡量世界保真度——标准化未来模型的判断方式，不仅通过它们看起来有多真实，还通过它们的行为有多真实。

Title: StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Authors: Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.10959
Pdf URL: https://arxiv.org/pdf/2512.10959
Copy Paste: [[2512.10959]] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space(https://arxiv.org/abs/2512.10959)
Keywords: generation
Abstract: We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
摘要：我们引入了 StereoSpace，这是一种基于扩散的单目到立体合成框架，它纯粹通过视点调节来建模几何形状，没有明确的深度或扭曲。规范的校正空间和条件引导生成器推断对应关系并端到端地填充去遮挡。为了确保公平且无泄漏的评估，我们引入了一种端到端协议，该协议在测试时排除任何地面实况或代理几何估计。该协议强调反映下游相关性的指标：用于感知舒适度的 iSQoE 和用于几何一致性的 MEt3R。 StereoSpace 超越了扭曲和修复、潜在扭曲和扭曲调节类别中的其他方法，在分层和非朗伯场景上实现了锐利的视差和强大的鲁棒性。这将视点条件扩散建立为立体生成的可扩展、无深度解决方案。