2025-12-09

Title: Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

Authors: Hokin Deng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.05969
Pdf URL: https://arxiv.org/pdf/2512.05969
Copy Paste: [[2512.05969]] Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices(https://arxiv.org/abs/2512.05969)
Keywords: generation
Abstract: We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href{this https URL}{results}$ and our $\href{this https URL}{VMEvalKit}$ codebase.
摘要：我们证明视频生成模型现在可以推理。在国际象棋、迷宫、数独、心理旋转和乌鸦矩阵等任务测试中，Sora-2 等领先模型的成功率达到了 60%。我们建立了一个以“任务对”设计为中心的稳健的实验范式。我们构建了一个代码框架，已有 39 个可用模型，支持这种范例并允许轻松扩展 - 用户可以有效地添加模型和任务。我们证明我们的自动化评估与人类判断密切相关，因此这种范例具有高度可扩展性。鉴于我们范例的可用性，我们看到了进行强化学习以改进视频模型推理的机会。您可以查看我们所有的原始 $\href{this https URL}{results}$ 和 $\href{this https URL}{VMEvalKit}$ 代码库。

Title: EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

Authors: Chang Liu, Tianjiao Jing, Chengcheng Ma, Xuanqi Zhou, Zhengxuan Lian, Qin Jin, Hongliang Yuan, Shi-Sheng Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05991
Pdf URL: https://arxiv.org/pdf/2512.05991
Copy Paste: [[2512.05991]] EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head(https://arxiv.org/abs/2512.05991)
Keywords: generation
Abstract: Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.
摘要：最近通过 3D Gaussian Splatting 实现的逼真 3D 会说话的头部在情感表达操纵方面仍然存在显着的缺点，特别是对于使用多模式控制的细粒度和扩展的动态情感编辑。本文介绍了一种新的可编辑3D高斯说话头，即EmoDiffTalk。我们的关键思想是一种新颖的情感感知高斯扩散，其中包括用于细粒度面部动画师的动作单元（AU）提示高斯扩散过程，以及精确的文本到 AU 情感控制器，以使用文本输入提供准确且广泛的动态情感编辑。在公共 EmoTalk3D 和 RenderMe-360 数据集上进行的实验表明，我们的 EmoDiffTalk 比之前的作品具有卓越的情感微妙性、口型同步保真度和可控性，从而建立了一条通往高质量、扩散驱动、多模式可编辑 3D 头部说话合成的原则性途径。据我们所知，我们的 EmoDiffTalk 是最早的几个 3D Gaussian Splatting 头部说话生成框架之一，特别是支持基于 AU 的表达空间内的连续、多模式情感编辑。

Title: Domain-Specific Foundation Model Improves AI-Based Analysis of Neuropathology

Authors: Ruchika Verma, Shrishtee Kandoi, Robina Afzal, Shengjia Chen, Jannes Jegminat, Michael W. Karlovich, Melissa Umphlett, Timothy E. Richardson, Kevin Clare, Quazi Hossain, Jorge Samanamud, Phyllis L. Faust, Elan D. Louis, Ann C. McKee, Thor D. Stein, Jonathan D. Cherry, Jesse Mez, Anya C. McGoldrick, Dalilah D. Quintana Mora, Melissa J. Nirenberg, Ruth H. Walker, Yolfrankcis Mendez, Susan Morgello, Dennis W. Dickson, Melissa E. Murray, Carlos Cordon-Cardo, Nadejda M. Tsankova, Jamie M. Walker, Diana K. Dangoor, Stephanie McQuillan, Emma L. Thorn, Claudia De Sanctis, Shuying Li, Thomas J. Fuchs, Kurt Farrell, John F. Crary, Gabriele Campanella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.05993
Pdf URL: https://arxiv.org/pdf/2512.05993
Copy Paste: [[2512.05993]] Domain-Specific Foundation Model Improves AI-Based Analysis of Neuropathology(https://arxiv.org/abs/2512.05993)
Keywords: generation, generative
Abstract: Foundation models have transformed computational pathology by providing generalizable representations from large-scale histology datasets. However, existing models are predominantly trained on surgical pathology data, which is enriched for non-nervous tissue and overrepresents neoplastic, inflammatory, metabolic, and other non-neurological diseases. Neuropathology represents a markedly different domain of histopathology, characterized by unique cell types (neurons, glia, etc.), distinct cytoarchitecture, and disease-specific pathological features including neurofibrillary tangles, amyloid plaques, Lewy bodies, and pattern-specific neurodegeneration. This domain mismatch may limit the ability of general-purpose foundation models to capture the morphological patterns critical for interpreting neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and cerebellar ataxias. To address this gap, we developed NeuroFM, a foundation model trained specifically on whole-slide images of brain tissue spanning diverse neurodegenerative pathologies. NeuroFM demonstrates superior performance compared to general-purpose models across multiple neuropathology-specific downstream tasks, including mixed dementia disease classification, hippocampal region segmentation, and neurodegenerative ataxia identification encompassing cerebellar essential tremor and spinocerebellar ataxia subtypes. This work establishes that domain-specialized foundation models trained on brain tissue can better capture neuropathology-specific features than models trained on general surgical pathology datasets. By tailoring foundation models to the unique morphological landscape of neurodegenerative diseases, NeuroFM enables more accurate and reliable AI-based analysis for brain disease diagnosis and research, setting a precedent for domain-specific model development in specialized areas of digital pathology.
摘要：基础模型通过提供大规模组织学数据集的通用表示，改变了计算病理学。然而，现有的模型主要是根据外科病理学数据进行训练的，这些数据丰富了非神经组织，并且过度代表了肿瘤、炎症、代谢和其他非神经系统疾病。神经病理学代表了组织病理学的一个显着不同的领域，其特征是独特的细胞类型（神经元、神经胶质细胞等）、独特的细胞结构和疾病特异性病理特征，包括神经原纤维缠结、淀粉样蛋白斑、路易体和模式特异性神经变性。这种域不匹配可能会限制通用基础模型捕获对于解释神经退行性疾病（如阿尔茨海默病、帕金森病和小脑性共济失调）至关重要的形态模式的能力。为了解决这一差距，我们开发了 NeuroFM，这是一种专门针对跨越不同神经退行性疾病的脑组织全幻灯片图像进行训练的基础模型。与通用模型相比，NeuroFM 在多个神经病理学特定下游任务中表现出优越的性能，包括混合痴呆疾病分类、海马区域分割和包括小脑特发性震颤和脊髓小脑共济失调亚型的神经退行性共济失调识别。这项工作表明，与在一般外科病理学数据集上训练的模型相比，在脑组织上训练的领域专业基础模型可以更好地捕获神经病理学特定的特征。通过根据神经退行性疾病独特的形态学特征定制基础模型，NeuroFM 能够为脑部疾病诊断和研究提供更准确、更可靠的基于人工智能的分析，为数字病理学专业领域的特定领域模型开发奠定了先例。

Title: PrunedCaps: A Case For Primary Capsules Discrimination

Authors: Ramin Sharifi, Pouya Shiri, Amirali Baniasadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06003
Pdf URL: https://arxiv.org/pdf/2512.06003
Copy Paste: [[2512.06003]] PrunedCaps: A Case For Primary Capsules Discrimination(https://arxiv.org/abs/2512.06003)
Keywords: generation
Abstract: Capsule Networks (CapsNets) are a generation of image classifiers with proven advantages over Convolutional Neural Networks (CNNs). Better robustness to affine transformation and overlapping image detection are some of the benefits associated with CapsNets. However, CapsNets cannot be classified as resource-efficient deep learning architecture due to the high number of Primary Capsules (PCs). In addition, CapsNets' training and testing are slow and resource hungry. This paper investigates the possibility of Primary Capsules pruning in CapsNets on MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and SVHN datasets. We show that a pruned version of CapsNet performs up to 9.90 times faster than the conventional architecture by removing 95 percent of Capsules without a loss of accuracy. Also, our pruned architecture saves on more than 95.36 percent of floating-point operations in the dynamic routing stage of the architecture. Moreover, we provide insight into why some datasets benefit significantly from pruning while others fall behind.
摘要：胶囊网络 (CapsNets) 是一代图像分类器，与卷积神经网络 (CNN) 相比，其优势已得到证实。 CapsNet 的一些优势包括仿射变换和重叠图像检测的更好鲁棒性。然而，由于主胶囊（PC）数量较多，CapsNet 不能被归类为资源高效的深度学习架构。此外，CapsNets 的训练和测试速度缓慢且资源匮乏。本文研究了 CapsNet 中 MNIST 手写数字、Fashion-MNIST、CIFAR-10 和 SVHN 数据集上初级胶囊修剪的可能性。我们表明，通过删除 95% 的 Capsule，CapsNet 的修剪版本的性能比传统架构快 9.90 倍，而不会损失准确性。此外，我们的修剪架构在架构的动态路由阶段节省了超过 95.36% 的浮点运算。此外，我们还深入了解了为什么某些数据集从修剪中受益匪浅，而另一些数据集却落后了。

Title: VAT: Vision Action Transformer by Unlocking Full Representation of ViT

Authors: Wenhao Li, Chengwei Ma, Weixin Mao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.06013
Pdf URL: https://arxiv.org/pdf/2512.06013
Copy Paste: [[2512.06013]] VAT: Vision Action Transformer by Unlocking Full Representation of ViT(https://arxiv.org/abs/2512.06013)
Keywords: generation
Abstract: In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of perception and action generation. On a suite of simulated manipulation tasks, VAT achieves a 98.15\% average success rate across four LIBERO benchmarks, establishing a new state-of-the-art by outperforming prior methods like OpenVLA-OFT. Our work presents not only a powerful model for imitation learning but also demonstrates the critical importance of leveraging the complete ''representation trajectory'' of vision models to advance robotic policy. The GitHub URL for the project code is this https URL.
摘要：在机器人学习中，视觉变换器 (ViT) 是视觉感知的标准，但大多数方法仅使用最后一层的特征，从而丢弃了有价值的信息。我们认为这提供了不充分的表示，并提出了 Vision Action Transformer (VAT)，这是一种从 ViT 扩展而来的新颖架构，并解锁了 ViT 的完整功能层次结构。 VAT 处理具有跨所有变压器层的视觉特征的专门动作令牌，从而实现感知和动作生成的深度和渐进融合。在一系列模拟操作任务中，VAT 在四个 LIBERO 基准测试中实现了 98.15% 的平均成功率，通过超越 OpenVLA-OFT 等现有方法建立了新的最先进技术。我们的工作不仅提出了一个强大的模仿学习模型，而且还证明了利用视觉模型的完整“表示轨迹”来推进机器人策略的至关重要性。项目代码的 GitHub URL 是此 https URL。

Title: PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Ligong Han, Ying Ba, Dimitris N. Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06020
Pdf URL: https://arxiv.org/pdf/2512.06020
Copy Paste: [[2512.06020]] PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation(https://arxiv.org/abs/2512.06020)
Keywords: generation, generative
Abstract: Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.
摘要：偏好条件图像生成旨在使生成模型适应个人用户，生成反映超出给定文本提示的个人审美选择的输出。尽管最近取得了进展，但现有方法要么无法捕捉细致入微的用户偏好，要么缺乏对个性化视觉信号进行编码的有效机制。在这项工作中，我们提出了一个多模态框架，该框架利用多模态大语言模型（MLLM）来提取丰富的用户表示并将其注入基于扩散的图像生成中。我们使用面向偏好的视觉问答任务来训练 MLLM，以捕获细粒度的语义线索。为了隔离与偏好相关的特征，我们引入了两个互补的探测任务：用于区分不同用户的用户间区分，以及用于区分喜欢和不喜欢的内容的用户内区分。为了确保与扩散文本编码器的兼容性，我们设计了基于最大平均差异的对齐损失，以弥合模态间隙，同时保留多模态结构。由此产生的嵌入用于调节生成器，从而能够忠实地遵守提示和用户偏好。大量的实验表明，我们的方法在图像质量和偏好对齐方面都远远优于强大的基线，突出了表示提取和个性化生成对齐的有效性。

Title: The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06032
Pdf URL: https://arxiv.org/pdf/2512.06032
Copy Paste: [[2512.06032]] The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation(https://arxiv.org/abs/2512.06032)
Keywords: generation
Abstract: This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.
摘要：本文研究了最新的两个 Segment Anything 模型（SAM2 和 SAM3）之间的根本不连续性。我们解释了为什么 SAM2 基于提示的分割的专业知识没有转移到 SAM3 的多模态概念驱动范例。 SAM2 通过空间提示点、框和掩模进行操作，产生纯粹的几何和时间分割。相比之下，SAM3 引入了统一的视觉语言架构，能够进行开放词汇推理、语义基础、对比对齐和基于示例的概念理解。我们通过五个核心组件构建此分析：（1）基于提示和基于概念的分割之间的概念突破，将 SAM2 的空间提示语义与 SAM3 的多模态融合和文本条件掩码生成进行对比； (2) 架构分歧，详细介绍了 SAM2 的纯视觉时态设计与视觉语言编码器、几何和范例编码器、融合模块、DETR 型解码器、对象查询以及通过 SAM3 中的专家混合进行歧义处理的集成； (3) 数据集和注释差异，将 SA-V 视频掩模与 SAM3 的多模态概念注释语料库进行对比； (4) 训练和超参数区别，展示为什么 SAM2 优化知识不适用于 SAM3； (5) 评估、指标和故障模式，概述了从几何 IoU 指标到语义、开放词汇评估的转变。这些分析共同将 SAM3 确立为一类新型细分基础模型，并为新兴的概念驱动细分时代描绘了未来方向。

Title: Deep learning recognition and analysis of Volatile Organic Compounds based on experimental and synthetic infrared absorption spectra

Authors: Andrea Della Valle, Annalisa D'Arco, Tiziana Mancini, Rosanna Mosetti, Maria Chiara Paolozzi, Stefano Lupi, Sebastiano Pilati, Andrea Perali
Subjects: cs.LG, physics.app-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2512.06059
Pdf URL: https://arxiv.org/pdf/2512.06059
Copy Paste: [[2512.06059]] Deep learning recognition and analysis of Volatile Organic Compounds based on experimental and synthetic infrared absorption spectra(https://arxiv.org/abs/2512.06059)
Keywords: generative
Abstract: Volatile Organic Compounds (VOCs) are organic molecules that have low boiling points and therefore easily evaporate into the air. They pose significant risks to human health, making their accurate detection the crux of efforts to monitor and minimize exposure. Infrared (IR) spectroscopy enables the ultrasensitive detection at low-concentrations of VOCs in the atmosphere by measuring their IR absorption spectra. However, the complexity of the IR spectra limits the possibility to implement VOC recognition and quantification in real-time. While deep neural networks (NNs) are increasingly used for the recognition of complex data structures, they typically require massive datasets for the training phase. Here, we create an experimental VOC dataset for nine different classes of compounds at various concentrations, using their IR absorption spectra. To further increase the amount of spectra and their diversity in term of VOC concentration, we augment the experimental dataset with synthetic spectra created via conditional generative NNs. This allows us to train robust discriminative NNs, able to reliably identify the nine VOCs, as well as to precisely predict their concentrations. The trained NN is suitable to be incorporated into sensing devices for VOCs recognition and analysis.
摘要：Volatile Organic Compounds (VOCs) are organic molecules that have low boiling points and therefore easily evaporate into the air. They pose significant risks to human health, making their accurate detection the crux of efforts to monitor and minimize exposure.红外 (IR) 光谱通过测量红外吸收光谱，能够对大气中低浓度的 VOC 进行超灵敏检测。 However, the complexity of the IR spectra limits the possibility to implement VOC recognition and quantification in real-time.虽然深度神经网络 (NN) 越来越多地用于识别复杂数据结构，但它们通常需要大量数据集进行训练阶段。在这里，我们使用红外吸收光谱为不同浓度的九种不同类别的化合物创建了实验 VOC 数据集。为了进一步增加光谱的数量及其在 VOC 浓度方面的多样性，我们使用通过条件生成神经网络创建的合成光谱来扩充实验数据集。这使我们能够训练强大的判别神经网络，能够可靠地识别九种挥发性有机化合物，并精确预测它们的浓度。 The trained NN is suitable to be incorporated into sensing devices for VOCs recognition and analysis.

Title: When Privacy Isn't Synthetic: Hidden Data Leakage in Generative AI Models

Authors: S.M. Mustaqim, Anantaa Kotal, Paul H. Yi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06062
Pdf URL: https://arxiv.org/pdf/2512.06062
Copy Paste: [[2512.06062]] When Privacy Isn't Synthetic: Hidden Data Leakage in Generative AI Models(https://arxiv.org/abs/2512.06062)
Keywords: generation, generative
Abstract: Generative models are increasingly used to produce privacy-preserving synthetic data as a safe alternative to sharing sensitive training datasets. However, we demonstrate that such synthetic releases can still leak information about the underlying training samples through structural overlap in the data manifold. We propose a black-box membership inference attack that exploits this vulnerability without requiring access to model internals or real data. The attacker repeatedly queries the generative model to obtain large numbers of synthetic samples, performs unsupervised clustering to identify dense regions of the synthetic distribution, and then analyzes cluster medoids and neighborhoods that correspond to high-density regions in the original training data. These neighborhoods act as proxies for training samples, enabling the adversary to infer membership or reconstruct approximate records. Our experiments across healthcare, finance, and other sensitive domains show that cluster overlap between real and synthetic data leads to measurable membership leakage-even when the generator is trained with differential privacy or other noise mechanisms. The results highlight an under-explored attack surface in synthetic data generation pipelines and call for stronger privacy guarantees that account for distributional neighborhood inference rather than sample-level memorization alone, underscoring its role in privacy-preserving data publishing. Implementation and evaluation code are publicly available at:this http URL.
摘要：生成模型越来越多地用于生成保护隐私的合成数据，作为共享敏感训练数据集的安全替代方案。然而，我们证明这种合成版本仍然可以通过数据流形中的结构重叠泄漏有关底层训练样本的信息。我们提出了一种黑盒成员推理攻击，可以利用此漏洞，而无需访问模型内部或真实数据。攻击者反复查询生成模型以获得大量合成样本，执行无监督聚类来识别合成分布的密集区域，然后分析与原始训练数据中的高密度区域相对应的聚类中心点和邻域。这些邻域充当训练样本的代理，使对手能够推断成员资格或重建近似记录。我们在医疗保健、金融和其他敏感领域的实验表明，真实数据和合成数据之间的集群重叠会导致可测量的成员泄漏——即使生成器接受了差异隐私或其他噪声机制的训练。结果凸显了合成数据生成管道中尚未充分探索的攻击面，并呼吁提供更强的隐私保证，以考虑分布式邻域推理而不是单独的样本级记忆，强调其在隐私保护数据发布中的作用。实现和评估代码可在以下位置公开获取：此 http URL。

Title: Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation

Authors: Junwen Zheng, Xinran Xu, Li Rong Wang, Chang Cai, Lucinda Siyun Tan, Dingyuan Wang, Hong Liang Tey, Xiuyi Fan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06105
Pdf URL: https://arxiv.org/pdf/2512.06105
Copy Paste: [[2512.06105]] Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation(https://arxiv.org/abs/2512.06105)
Keywords: generation
Abstract: Deep learning has demonstrated expert-level performance in melanoma classification, positioning it as a powerful tool in clinical dermatology. However, model opacity and the lack of interpretability remain critical barriers to clinical adoption, as clinicians often struggle to trust the decision-making processes of black-box models. To address this gap, we present a Cross-modal Explainable Framework for Melanoma (CEFM) that leverages contrastive learning as the core mechanism for achieving interpretability. Specifically, CEFM maps clinical criteria for melanoma diagnosis-namely Asymmetry, Border, and Color (ABC)-into the Vision Transformer embedding space using dual projection heads, thereby aligning clinical semantics with visual features. The aligned representations are subsequently translated into structured textual explanations via natural language generation, creating a transparent link between raw image data and clinical interpretation. Experiments on public datasets demonstrate 92.79% accuracy and an AUC of 0.961, along with significant improvements across multiple interpretability metrics. Qualitative analyses further show that the spatial arrangement of the learned embeddings aligns with clinicians' application of the ABC rule, effectively bridging the gap between high-performance classification and clinical trust.
摘要：深度学习在黑色素瘤分类方面表现出了专家级的性能，使其成为临床皮肤病学的强大工具。然而，模型的不透明性和缺乏可解释性仍然是临床采用的关键障碍，因为临床医生常常很难相信黑盒模型的决策过程。为了解决这一差距，我们提出了黑色素瘤的跨模式可解释框架（CEFM），该框架利用对比学习作为实现可解释性的核心机制。具体来说，CEFM 使用双投影头将黑色素瘤诊断的临床标准（即不对称、边界和颜色 (ABC)）映射到 Vision Transformer 嵌入空间中，从而使临床语义与视觉特征保持一致。随后通过自然语言生成将对齐的表示翻译成结构化文本解释，从而在原始图像数据和临床解释之间建立透明的链接。在公共数据集上进行的实验表明，准确度为 92.79%，AUC 为 0.961，并且多个可解释性指标都有显着改进。定性分析进一步表明，学习嵌入的空间排列与临床医生对 ABC 规则的应用相一致，有效地弥合了高性能分类和临床信任之间的差距。

Title: Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Authors: Su Sun, Cheng Zhao, Himangi Mittal, Gaurav Mittal, Rohith Kukkala, Yingjie Victor Chen, Mei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06158
Pdf URL: https://arxiv.org/pdf/2512.06158
Copy Paste: [[2512.06158]] Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation(https://arxiv.org/abs/2512.06158)
Keywords: generation
Abstract: Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling. \emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.
摘要：从稀疏输入生成动态 4D 对象很困难，因为它需要跨视图和时间共同保留外观和运动一致性，同时抑制伪影和时间漂移。我们假设视图差异是由于仅限于像素或潜在空间视频扩散损失的监督而产生的，这些损失缺乏明确的时间感知、特征级跟踪指导。我们提出了 \emph{Track4DGen}，这是一个两阶段框架，它将多视图视频扩散模型与基点跟踪器和混合 4D 高斯泼溅 (4D-GS) 重建器结合起来。中心思想是将跟踪器导出的运动先验显式注入到多视图视频生成和 4D-GS 的中间特征表示中。在第一阶段，我们在扩散生成器内强制执行密集的特征级点对应，产生时间一致的特征，从而抑制外观漂移并增强跨视图连贯性。在第二阶段，我们使用混合运动编码重建动态 4D-GS，该编码将同位扩散特征（携带第一阶段跟踪先验）与六角平面特征连接起来，并使用 4D 球谐函数增强它们以实现更高保真度的动态建模。 \emph{Track4DGen} 超越了多视图视频生成和 4D 生成基准，产生时间稳定、可文本编辑的 4D 资产。最后，我们策划了 \emph{Sketchfab28}，这是一个高质量的数据集，用于对以对象为中心的 4D 生成进行基准测试并促进未来的研究。

Title: Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction

Authors: Shilin Hu, Jingyi Xu, Akshat Dave, Dimitris Samaras, Hieu Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06174
Pdf URL: https://arxiv.org/pdf/2512.06174
Copy Paste: [[2512.06174]] Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction(https://arxiv.org/abs/2512.06174)
Keywords: generation
Abstract: Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.
摘要：阴影生成旨在产生在视觉上与对象几何形状和场景照明一致的逼真阴影。在阴影形成的物理学中，遮挡器阻挡了一些从光源投射的光线，否则这些光线会到达表面，从而创建跟随遮挡器轮廓的阴影。然而，这种显式的物理建模很少用于基于深度学习的阴影生成。在本文中，我们提出了一种新颖的框架，将显式物理建模（几何和照明）嵌入到基于深度学习的阴影生成中。首先，给定单目 RGB 图像，我们以密集点图的形式获得近似 3D 几何形状，并预测单个主光方向。这些信号使我们能够根据阴影形成的物理原理恢复相当准确的阴影位置和形状。然后，我们将这种基于物理的初始估计集成到扩散框架中，将阴影细化为逼真的高保真外观，同时确保与场景几何和照明的一致性。我们的模型在 DESOBAV2 上进行训练，产生的阴影既视觉逼真又物理连贯，优于现有方法，尤其是在具有复杂几何形状或模糊照明的场景中。

Title: How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?

Authors: Tomohiro Yamashita, Daichi Amagata, Yusuke Matsui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.06200
Pdf URL: https://arxiv.org/pdf/2512.06200
Copy Paste: [[2512.06200]] How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?(https://arxiv.org/abs/2512.06200)
Keywords: generation
Abstract: Approximate Nearest Neighbor Search (ANNS) has recently gained significant attention due to its many applications, such as Retrieval-Augmented Generation. Such applications require ANNS algorithms that support dynamic data, so the ANNS problem on dynamic data has attracted considerable interest. However, a comprehensive evaluation methodology for data deletion in ANNS has yet to be established. This study proposes an experimental framework and comprehensive evaluation metrics to assess the efficiency of data deletion for ANNS indexes under practical use cases. Specifically, we categorize data deletion methods in graph-based ANNS into three approaches and formalize them mathematically. The performance is assessed in terms of accuracy, query speed, and other relevant metrics. Finally, we apply the proposed evaluation framework to Hierarchical Navigable Small World, one of the state-of-the-art ANNS methods, to analyze the effects of data deletion, and propose Deletion Control, a method which dynamically selects the appropriate deletion method under a required search accuracy.
摘要：近似最近邻搜索（ANNS）最近因其许多应用（例如检索增强生成）而受到广泛关注。此类应用需要支持动态数据的 ANNS 算法，因此动态数据上的 ANNS 问题引起了相当大的兴趣。然而，针对ANNS数据删除的综合评估方法尚未建立。本研究提出了一个实验框架和综合评估指标来评估实际用例下 ANNS 索引的数据删除效率。具体来说，我们将基于图的 ANNS 中的数据删除方法分为三种方法，并以数学形式将它们形式化。性能根据准确性、查询速度和其他相关指标进行评估。最后，我们将所提出的评估框架应用于最先进的人工神经网络方法之一的分层可导航小世界，以分析数据删除的影响，并提出删除控制，一种在所需的搜索精度下动态选择适当的删除方法的方法。

Title: RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

Authors: Tianyi Gao, Hao Li, Han Fang, Xin Wei, Xiaodong Dong, Hongbo Sun, Ye Yuan, Zhongjiang He, Jinglin Xu, Jingmin Xin, Hao Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06276
Pdf URL: https://arxiv.org/pdf/2512.06276
Copy Paste: [[2512.06276]] RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension(https://arxiv.org/abs/2512.06276)
Keywords: generation
Abstract: Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions, establishing a stronger baseline for REC. Extensive experiments demonstrate that our RefBench-PRO enables interpretable evaluation of MLLM on referring expression comprehension, presenting greater challenges in both perception and reasoning.
摘要：参考表达理解（REC）是一种视觉语言任务，它根据文本描述来定位特定图像区域。现有的REC基准主要评估感知能力，缺乏可解释的评分机制，无法揭示多模态大语言模型（MLLM）跨不同认知能力的基础能力。为了解决这个限制，我们引入了RefBench-PRO，这是一个全面的REC基准测试，它将指称表达式分解为两个核心维度，即感知和推理，并进一步将其细分为六个逐渐具有挑战性的任务，例如属性、位置、交互、常识、关系和拒绝。我们还开发了一个完全自动化的数据生成管道，可以在这六个子维度上生成不同的引用表达式。此外，我们提出了 Ref-R1，一种基于 RL 的学习方案，它结合了基于动态 IoU 的 GRPO，以在日益复杂的推理条件下提高定位精度，为 REC 建立更强的基线。大量实验表明，我们的 RefBench-PRO 能够在指称表达理解方面对 MLLM 进行可解释的评估，这在感知和推理方面提出了更大的挑战。

Title: ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models

Authors: Jiahao Li, Yusheng Luo, Yunzhong Lou, Xiangdong Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06328
Pdf URL: https://arxiv.org/pdf/2512.06328
Copy Paste: [[2512.06328]] ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models(https://arxiv.org/abs/2512.06328)
Keywords: generation, generative
Abstract: We present ReCAD, a reinforcement learning (RL) framework that bootstraps pretrained large models (PLMs) to generate precise parametric computer-aided design (CAD) models from multimodal inputs by leveraging their inherent generative capabilities. With just access to simple functional interfaces (e.g., point coordinates), our approach enables the emergence of complex CAD operations (e.g., pattern replication and mirror). This stands in contrast to previous methods, which typically rely on knowledge injected through supervised fine-tuning (SFT), offer limited support for editability, and fail to exploit the strong generative priors of PLMs. Specifically, the ReCAD framework begins by fine-tuning vision-language models (VLMs) to equip them with basic CAD model generation capabilities, where we rewrite CAD scripts into parameterized code that is leveraged to generate accurate textual descriptions for supervision. Then, we propose a novel RL strategy that incorporates parameterized code as guidance to enhance the model's reasoning on challenging questions. Furthermore, we employ a hierarchical primitive learning process to progressively teach structured and compositional skills under a unified reward function that ensures both geometric accuracy and semantic fidelity. ReCAD sets a new state-of-the-art in both text-to-CAD and image-to-CAD tasks, significantly improving geometric accuracy across in-distribution and out-of-distribution settings. In the image-to-CAD task, for instance, it reduces the mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), outperforming existing baselines by a substantial margin.
摘要：我们提出了 ReCAD，这是一种强化学习 (RL) 框架，可引导预训练的大型模型 (PLM)，利用其固有的生成能力，从多模式输入生成精确的参数化计算机辅助设计 (CAD) 模型。只需访问简单的功能接口（例如点坐标），我们的方法就可以实现复杂的 CAD 操作（例如图案复制和镜像）。这与以前的方法形成鲜明对比，以前的方法通常依赖于通过监督微调 (SFT) 注入的知识，对可编辑性的支持有限，并且无法利用 PLM 强大的生成先验。具体来说，ReCAD 框架首先微调视觉语言模型 (VLM)，为其配备基本的 CAD 模型生成功能，其中我们将 CAD 脚本重写为参数化代码，用于生成准确的文本描述以进行监督。然后，我们提出了一种新颖的强化学习策略，该策略结合了参数化代码作为指导，以增强模型对挑战性问题的推理能力。此外，我们采用分层原始学习过程，在统一的奖励函数下逐步教授结构化和组合技能，确保几何准确性和语义保真度。 ReCAD 在文本到 CAD 和图像到 CAD 任务方面树立了新的最先进水平，显着提高了分布内和分布外设置的几何精度。例如，在图像到 CAD 任务中，它将平均倒角距离从 73.47 减少到 29.61（分布内），从 272.06 减少到 80.23（分布外），大大优于现有基线。

Title: Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

Authors: Kaile Wang, Lijun He, Haisheng Fu, Haixia Bi, Fan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06344
Pdf URL: https://arxiv.org/pdf/2512.06344
Copy Paste: [[2512.06344]] Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate(https://arxiv.org/abs/2512.06344)
Keywords: generation, generative
Abstract: Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.
摘要：生成图像压缩最近显示出令人印象深刻的感知质量，但经常遭受超低比特率（bpp < 0.05）下生成幻觉引起的语义偏差，限制了其在带宽受限的 6G 语义通信场景中的可靠部署。在这项工作中，我们重新评估了多模态引导的定位和作用，并提出了多模态引导任务感知生成图像压缩（MTGC）框架。具体来说，MTGC 集成了三种指导模式来增强语义一致性：用于全局语义的简洁但稳健的文本标题、保留低级视觉信息的高度压缩图像（HCI）以及用于细粒度任务相关语义的语义伪词（SPW）。 SPW是由我们设计的任务感知语义压缩模块（TASCM）生成的，该模块以面向任务的方式运行，驱动多头自注意力机制关注并提取与生成任务相关的语义，同时过滤掉冗余。随后，为了促进这些模态的协同引导，我们设计了一种多模态引导扩散解码器（MGDD），采用双路径协作引导机制，协同交叉注意力和ControlNet加性残差，将这三种引导精确地注入到扩散过程中，并利用扩散模型强大的生成先验来重建图像。大量实验表明，MTGC 持续提高了语义一致性（例如，DIV2K 数据集上的 DISTS 下降了 10.59%），同时在超低比特率下在感知质量和像素级保真度方面也取得了显着的进步。

Title: TreeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search

Authors: Kaicheng Yang, Kaisen Yang, Baiting Wu, Xun Zhang, Qianrui Yang, Haotong Qin, He Zhang, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06353
Pdf URL: https://arxiv.org/pdf/2512.06353
Copy Paste: [[2512.06353]] TreeQ: Pushing the Quantization Boundary of Diffusion Transformer via Tree-Structured Mixed-Precision Search(https://arxiv.org/abs/2512.06353)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture's linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at this https URL
摘要：扩散变压器 (DiT) 已成为图像生成的高度可扩展且有效的骨干，在可扩展性和性能方面均优于 U-Net 架构。然而，由于高计算和内存需求，它们的实际部署仍然具有挑战性。混合精度量化 (MPQ) 旨在突破量化极限，在将 U-Net 量化提升到低于 4 位设置方面取得了显着成功，同时显着降低了计算和内存开销。然而，它在 DiT 架构中的应用仍然有限且尚未得到充分探索。在这项工作中，我们提出了 TreeQ，这是一个解决 DiT 量化中关键挑战的统一框架。首先，为了解决低效搜索和代理错位问题，我们引入了树结构搜索（TSS）。这种特定于 DiT 的方法利用架构的线性属性在 O(n) 时间内遍历解决方案空间，同时通过基于比较的修剪来提高目标准确性。其次，为了统一优化目标，我们提出了环境噪声指导（ENG），它使用单个超参数来调整训练后量化（PTQ）和量化感知训练（QAT）配置。第三，为了缓解超低位体制中的信息瓶颈，我们设计了通用君主分支（GMB）。这种结构化的稀疏分支可防止不可逆的信息丢失，从而实现更精细的细节生成。通过大量实验，我们的 TreeQ 框架在 W3A3 和 W4A4 PTQ/PEFT 设置下展示了 DiT-XL/2 上最先进的性能。值得注意的是，我们的工作是第一个在 DiT 模型上实现近乎无损 4 位 PTQ 性能的工作。代码和模型将在此 https URL 中提供

Title: DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction

Authors: Yifan Song, Fenglin Yu, Yihong Luo, Xingjian Tao, Siya Qiu, Kai Han, Jing Tang
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2512.06356
Pdf URL: https://arxiv.org/pdf/2512.06356
Copy Paste: [[2512.06356]] DDFI: Diverse and Distribution-aware Missing Feature Imputation via Two-step Reconstruction(https://arxiv.org/abs/2512.06356)
Keywords: generation
Abstract: Incomplete node features are ubiquitous in real-world scenarios, e.g., the attributes of web users may be partly private, which causes the performance of Graph Neural Networks (GNNs) to decline significantly. Feature propagation (FP) is a well-known method that performs well for imputation of missing node features on graphs, but it still has the following three issues: 1) it struggles with graphs that are not fully connected, 2) imputed features face the over-smoothing problem, and 3) FP is tailored for transductive tasks, overlooking the feature distribution shift in inductive tasks. To address these challenges, we introduce DDFI, a Diverse and Distribution-aware Missing Feature Imputation method that combines feature propagation with a graph-based Masked AutoEncoder (MAE) in a nontrivial manner. It first designs a simple yet effective algorithm, namely Co-Label Linking (CLL), that randomly connects nodes in the training set with the same label to enhance the performance on graphs with numerous connected components. Then we develop a novel two-step representation generation process at the inference stage. Specifically, instead of directly using FP-imputed features as input during inference, DDFI further reconstructs the features through the whole MAE to reduce feature distribution shift in the inductive tasks and enhance the diversity of node features. Meanwhile, since existing feature imputation methods for graphs only evaluate by simulating the missing scenes with manually masking the features, we collect a new dataset called Sailing from the records of voyages that contains naturally missing features to help better evaluate the effectiveness. Extensive experiments conducted on six public datasets and Sailing show that DDFI outperforms the state-of-the-art methods under both transductive and inductive settings.
摘要：不完整的节点特征在现实场景中普遍存在，例如，网络用户的属性可能部分是私有的，这导致图神经网络（GNN）的性能显着下降。特征传播（FP）是一种众所周知的方法，在图上缺失节点特征的插补方面表现良好，但它仍然存在以下三个问题：1）它难以处理未完全连接的图，2）插补特征面临过度平滑问题，3）FP是为传导任务量身定制的，忽略了归纳任务中的特征分布变化。为了应对这些挑战，我们引入了 DDFI，这是一种多样化和分布感知的缺失特征插补方法，它以一种不平凡的方式将特征传播与基于图的掩码自动编码器 (MAE) 结合起来。它首先设计了一种简单而有效的算法，即联合标签链接（CLL），该算法随机连接训练集中具有相同标签的节点，以增强具有大量连接组件的图的性能。然后，我们在推理阶段开发了一种新颖的两步表示生成过程。具体来说，DDFI 不是在推理过程中直接使用 FP 估算的特征作为输入，而是通过整个 MAE 进一步重构特征，以减少归纳任务中的特征分布偏移并增强节点特征的多样性。同时，由于现有的图特征插补方法仅通过手动屏蔽特征来模拟缺失场景来进行评估，因此我们从航次记录中收集了一个名为 Sailing 的新数据集，其中包含自然缺失的特征，以帮助更好地评估有效性。对六个公共数据集和 Sailing 进行的大量实验表明，DDFI 在转导和归纳设置下均优于最先进的方法。

Title: Rectifying Latent Space for Generative Single-Image Reflection Removal

Authors: Mingjia Li, Jin Hu, Hainuo Wang, Qiming Hu, Jiarui Wang, Xiaojie Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06358
Pdf URL: https://arxiv.org/pdf/2512.06358
Copy Paste: [[2512.06358]] Rectifying Latent Space for Generative Single-Image Reflection Removal(https://arxiv.org/abs/2512.06358)
Keywords: generative
Abstract: Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.
摘要：单图像反射去除是一个非常不适定的问题，现有方法很难推理损坏区域的组成，导致它们无法在野外进行恢复和泛化。这项工作重新构建了用于编辑的潜在扩散模型，以有效地感知和处理高度模糊的分层图像输入，从而产生高质量的输出。我们认为这种转换的挑战源于一个关键但被忽视的问题，即语义编码器的潜在空间缺乏将合成图像解释为其组成层的线性叠加的固有结构。我们的方法建立在三个协同组件的基础上，包括将潜在空间与反射形成的线性物理对齐的反射等变 VAE、用于绕过模糊语言的精确指导的可学习的特定于任务的文本嵌入，以及利用生成随机性获得有希望的结果的深度引导的早期分支采样策略。大量实验表明，我们的模型在多个基准上实现了新的 SOTA 性能，并且可以很好地推广到具有挑战性的现实案例。

Title: Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection

Authors: Jiabao Guo, Yadian Wang, Hui Ma, Yuhao Fu, Ju Jia, Hui Liu, Shengeng Tang, Lechao Cheng, Yunfeng Diao, Ajian Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06363
Pdf URL: https://arxiv.org/pdf/2512.06363
Copy Paste: [[2512.06363]] Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection(https://arxiv.org/abs/2512.06363)
Keywords: generation
Abstract: Real-world face recognition systems are vulnerable to both physical presentation attacks (PAs) and digital forgery attacks (DFs). We aim to achieve comprehensive protection of biometric data by implementing a unified physical-digital defense framework with advanced detection. Existing approaches primarily employ CLIP with regularization constraints to enhance model generalization across both tasks. However, these methods suffer from conflicting optimization directions between physical and digital attack detection under same category prompt spaces. To overcome this limitation, we propose a Spoofing-aware Prompt Learning for Unified Attack Detection (SPL-UAD) framework, which decouples optimization branches for physical and digital attacks in the prompt space. Specifically, we construct a learnable parallel prompt branch enhanced with adaptive Spoofing Context Prompt Generation, enabling independent control of optimization for each attack type. Furthermore, we design a Cues-awareness Augmentation that leverages the dual-prompt mechanism to generate challenging sample mining tasks on data, significantly enhancing the model's robustness against unseen attack types. Extensive experiments on the large-scale UniAttackDataPlus dataset demonstrate that the proposed method achieves significant performance improvements in unified attack detection tasks.
摘要：现实世界中的人脸识别系统容易受到物理呈现攻击 (PA) 和数字伪造攻击 (DF) 的影响。我们的目标是通过实施具有先进检测功能的统一物理数字防御框架，实现对生物识别数据的全面保护。现有方法主要采用具有正则化约束的 CLIP 来增强跨这两个任务的模型泛化。然而，这些方法在同一类别提示空间下，物理攻击检测和数字攻击检测之间的优化方向存在冲突。为了克服这一限制，我们提出了一种用于统一攻击检测的欺骗感知提示学习（SPL-UAD）框架，该框架在提示空间中解耦物理和数字攻击的优化分支。具体来说，我们构建了一个可学习的并行提示分支，并通过自适应欺骗上下文提示生成来增强，从而能够独立控制每种攻击类型的优化。此外，我们设计了一种线索感知增强，利用双提示机制生成具有挑战性的数据样本挖掘任务，显着增强模型针对看不见的攻击类型的鲁棒性。在大规模 UniAttackDataPlus 数据集上的大量实验表明，所提出的方法在统一攻击检测任务中实现了显着的性能提升。

Title: Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

Authors: Xinhao Xiang, Abhijeet Rastogi, Jiawei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06376
Pdf URL: https://arxiv.org/pdf/2512.06376
Copy Paste: [[2512.06376]] Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework(https://arxiv.org/abs/2512.06376)
Keywords: generation, quality assessment
Abstract: Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.
摘要：最近的文本到视频模型已经能够根据自然语言提示生成高分辨率驾驶场景。这些人工智能生成的驾驶视频 (AIGV) 为自动驾驶 (AD) 的真实或模拟器数据提供了一种低成本、可扩展的替代方案。但一个关键问题仍然存在：此类视频能否可靠地支持 AD 模型的训练和评估？我们提出了一个系统研究这个问题的诊断框架。首先，我们介绍了常见 AIGV 故障模式的分类，包括视觉伪影、物理上不可信的运动和违反交通语义，并证明了它们对对象检测、跟踪和实例分割的负面影响。为了支持这一分析，我们构建了 ADGV-Bench，这是一个以驾驶为中心的基准，具有人类质量注释和用于多种感知任务的密集标签。然后，我们提出 ADGVE，一种驾驶感知评估器，它将静态语义、时间线索、车道服从信号和视觉语言模型 (VLM) 引导推理结合到每个剪辑的单个质量分数中。实验表明，盲目添加原始 AIGV 会降低感知性能，而使用 ADGVE 对其进行过滤可以持续改善一般视频质量评估指标和下游 AD 模型，并将 AIGV 变成对现实世界数据的有益补充。我们的研究强调了 AIGV 的风险和前景，并提供了在未来 AD 管道中安全利用大规模视频生成的实用工具。

Title: Rethinking Training Dynamics in Scale-wise Autoregressive Generation

Authors: Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.06421
Pdf URL: https://arxiv.org/pdf/2512.06421
Copy Paste: [[2512.06421]] Rethinking Training Dynamics in Scale-wise Autoregressive Generation(https://arxiv.org/abs/2512.06421)
Keywords: generation, generative
Abstract: Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.
摘要：自回归（AR）生成模型的最新进展已经产生了越来越强大的媒体合成系统。其中，下一尺度预测已成为一种流行的范例，其中模型以从粗到细的方式生成图像。然而，按比例缩放的 AR 模型存在曝光偏差，从而降低了生成质量。我们确定了此问题的两个主要原因：（1）训练测试不匹配，模型在推理过程中必须依赖于其自身的不完美预测；（2）规模学习难度不平衡，其中某些规模表现出不成比例的更高优化复杂性。通过对训练动态的全面分析，我们提出自回归细化（SAR）来解决这些限制。 SAR 引入了交错规模推出 (SSR) 机制，该机制执行轻量级自回归推出，使模型暴露于其自身的中间预测，从而调整训练测试模式，并引入了补充的对比学生强迫损失 (CSFL)，为自生成的上下文提供充分的监督，以确保稳定的训练。实验结果表明，将 SAR 应用于预训练的 AR 模型能够以最小的计算开销持续提高生成质量。例如，在 ImageNet 256 上训练的 FlexVAR-d16 上，SAR 在 10 个 epoch 内（在 32xA100 GPU 上为 5 小时）使 FID 减少了 5.2%。鉴于其效率、可扩展性和有效性，我们期望 SAR 能够作为视觉自回归生成的可靠的训练后方法。

Title: DragMesh: Interactive 3D Generation Made Easy

Authors: Tianshan Zhang, Zeyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06424
Pdf URL: https://arxiv.org/pdf/2512.06424
Copy Paste: [[2512.06424]] DragMesh: Interactive 3D Generation Made Easy(https://arxiv.org/abs/2512.06424)
Keywords: generation, generative
Abstract: While generative models have excelled at creating static 3D content, the pursuit of systems that understand how objects move and respond to interactions remains a fundamental challenge. Current methods for articulated motion lie at a crossroads: they are either physically consistent but too slow for real-time use, or generative but violate basic kinematic constraints. We present DragMesh, a robust framework for real-time interactive 3D articulation built around a lightweight motion generation core. Our core contribution is a novel decoupled kinematic reasoning and motion generation framework. First, we infer the latent joint parameters by decoupling semantic intent reasoning (which determines the joint type) from geometric regression (which determines the axis and origin using our Kinematics Prediction Network (KPP-Net)). Second, to leverage the compact, continuous, and singularity-free properties of dual quaternions for representing rigid body motion, we develop a novel Dual Quaternion VAE (DQ-VAE). This DQ-VAE receives these predicted priors, along with the original user drag, to generate a complete, plausible motion trajectory. To ensure strict adherence to kinematics, we inject the joint priors at every layer of the DQ-VAE's non-autoregressive Transformer decoder using FiLM (Feature-wise Linear Modulation) conditioning. This persistent, multi-scale guidance is complemented by a numerically-stable cross-product loss to guarantee axis alignment. This decoupled design allows DragMesh to achieve real-time performance and enables plausible, generative articulation on novel objects without retraining, offering a practical step toward generative 3D intelligence. Code: this https URL. Website: this https URL.
摘要：虽然生成模型在创建静态 3D 内容方面表现出色，但追求能够理解对象如何移动和响应交互的系统仍然是一个基本挑战。当前的关节运动方法正处于十字路口：它们要么是物理上一致的，但对于实时使用来说太慢，要么是生成式的，但违反了基本的运动学约束。我们推出了 DragMesh，这是一个围绕轻量级运动生成核心构建的实时交互式 3D 关节的强大框架。我们的核心贡献是一种新颖的解耦运动学推理和运动生成框架。首先，我们通过将语义意图推理（确定关节类型）与几何回归（使用我们的运动学预测网络（KPP-Net）确定轴和原点）解耦来推断潜在关节参数。其次，为了利用双四元数的紧凑、连续和无奇点特性来表示刚体运动，我们开发了一种新颖的双四元数 VAE (DQ-VAE)。该 DQ-VAE 接收这些预测先验以及原始用户拖动，以生成完整的、合理的运动轨迹。为了确保严格遵守运动学，我们使用 FiLM（特征线性调制）调节在 DQ-VAE 的非自回归 Transformer 解码器的每一层注入联合先验。这种持久的多尺度指导辅以数值稳定的叉积损失，以保证轴对齐。这种解耦设计使 DragMesh 能够实现实时性能，并无需重新训练即可对新对象进行合理的生成式接合，从而为生成 3D 智能迈出了切实可行的一步。代码：此 https URL。网站：此 https URL。

Title: AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars

Authors: Ramazan Fazylov, Sergey Zagoruyko, Aleksandr Parkin, Stamatis Lefkimmiatis, Ivan Laptev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06438
Pdf URL: https://arxiv.org/pdf/2512.06438
Copy Paste: [[2512.06438]] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars(https://arxiv.org/abs/2512.06438)
Keywords: generation, generative
Abstract: The generation of high-fidelity, animatable 3D human avatars remains a core challenge in computer graphics and vision, with applications in VR, telepresence, and entertainment. Existing approaches based on implicit representations like NeRFs suffer from slow rendering and dynamic inconsistencies, while 3D Gaussian Splatting (3DGS) methods are typically limited to static head generation, lacking dynamic control. We bridge this gap by introducing AGORA, a novel framework that extends 3DGS within a generative adversarial network to produce animatable avatars. Our key contribution is a lightweight, FLAME-conditioned deformation branch that predicts per-Gaussian residuals, enabling identity-preserving, fine-grained expression control while allowing real-time inference. Expression fidelity is enforced via a dual-discriminator training scheme leveraging synthetic renderings of the parametric mesh. AGORA generates avatars that are not only visually realistic but also precisely controllable. Quantitatively, we outperform state-of-the-art NeRF-based methods on expression accuracy while rendering at 250+ FPS on a single GPU, and, notably, at $\sim$9 FPS under CPU-only inference - representing, to our knowledge, the first demonstration of practical CPU-only animatable 3DGS avatar synthesis. This work represents a significant step toward practical, high-performance digital humans. Project website: this https URL
摘要：高保真、可动画的 3D 人体头像的生成仍然是计算机图形和视觉领域的核心挑战，并应用于 VR、远程呈现和娱乐领域。基于 NeRF 等隐式表示的现有方法存在渲染速度慢和动态不一致的问题，而 3D 高斯泼溅 (3DGS) 方法通常仅限于静态头部生成，缺乏动态控制。我们通过引入 AGORA 来弥补这一差距，AGORA 是一种新颖的框架，可在生成对抗网络中扩展 3DGS，以生成可动画化身。我们的关键贡献是一个轻量级的 FLAME 条件变形分支，它可以预测每高斯残差，从而实现身份保留、细粒度的表达控制，同时允许实时推理。表达保真度是通过利用参数化网格的合成渲染的双鉴别器训练方案来强制执行的。 AGORA 生成的头像不仅视觉逼真，而且可精确控制。从数量上讲，我们在表达准确性方面优于最先进的基于 NeRF 的方法，同时在单个 GPU 上以 250+ FPS 进行渲染，特别是在仅 CPU 推理下以 $\sim$9 FPS 进行渲染 - 据我们所知，这代表了实用的仅 CPU 动画 3DGS 头像合成的首次演示。这项工作代表了向实用、高性能数字人类迈出的重要一步。项目网站：这个https URL

Title: Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction

Authors: Kush Revankar, Shreyas Deshpande, Araham Sayeed, Ansh Tandale, Sarika Bobde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06485
Pdf URL: https://arxiv.org/pdf/2512.06485
Copy Paste: [[2512.06485]] Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction(https://arxiv.org/abs/2512.06485)
Keywords: generation
Abstract: Communication between deaf users, visually im paired users, and the general hearing population often relies on tools that support only one direction of interaction. To address this limitation, this work presents Sanvaad, a lightweight multimodal accessibility framework designed to support real time, two-way communication. For deaf users, Sanvaad includes an ISL recognition module built on MediaPipe landmarks. MediaPipe is chosen primarily for its efficiency and low computational load, enabling the system to run smoothly on edge devices without requiring dedicated hardware. Spoken input from a phone can also be translated into sign representations through a voice-to-sign component that maps detected speech to predefined phrases and produces corresponding GIFs or alphabet-based visualizations. For visually impaired users, the framework provides a screen free voice interface that integrates multilingual speech recognition, text summarization, and text-to-speech generation. These components work together through a Streamlit-based interface, making the system usable on both desktop and mobile environments. Overall, Sanvaad aims to offer a practical and accessible pathway for inclusive communication by combining lightweight computer vision and speech processing tools within a unified framework.
摘要：聋哑用户、视障用户和一般听力人群之间的通信通常依赖于仅支持一种交互方向的工具。为了解决这一限制，这项工作提出了 Sanvaad，这是一个轻量级多模式可访问性框架，旨在支持实时双向通信。对于聋哑用户，Sanvaad 包括一个基于 MediaPipe 地标构建的 ISL 识别模块。选择 MediaPipe 主要是因为它的效率和低计算负载，使系统能够在边缘设备上平稳运行，而不需要专用硬件。来自电话的语音输入还可以通过语音到手语组件转换为手语表示，该组件将检测到的语音映射到预定义的短语，并生成相应的 GIF 或基于字母的可视化效果。对于视障用户，该框架提供了一个无屏幕语音界面，集成了多语言语音识别、文本摘要和文本转语音生成。这些组件通过基于 Streamlit 的界面协同工作，使系统可在桌面和移动环境中使用。总体而言，Sanvaad 的目标是通过在统一框架内结合轻量级计算机视觉和语音处理工具，为包容性沟通提供实用且易于访问的途径。

Title: Method of UAV Inspection of Photovoltaic Modules Using Thermal and RGB Data Fusion

Authors: Andrii Lysyi, Anatoliy Sachenko, Pavlo Radiuk, Mykola Lysyi, Oleksandr Melnychenko, Diana Zahorodnia
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2512.06504
Pdf URL: https://arxiv.org/pdf/2512.06504
Copy Paste: [[2512.06504]] Method of UAV Inspection of Photovoltaic Modules Using Thermal and RGB Data Fusion(https://arxiv.org/abs/2512.06504)
Keywords: generation
Abstract: The subject of this research is the development of an intelligent, integrated framework for the automated inspection of photovoltaic (PV) infrastructure that addresses the critical shortcomings of conventional methods, including thermal palette bias, data redundancy, and high communication bandwidth requirements. The goal of this study is to design, develop, and validate a comprehensive, multi-modal system that fully automates the monitoring workflow, from data acquisition to the generation of actionable, geo-located maintenance alerts, thereby enhancing plant safety and operational efficiency. The methods employed involve a synergistic architecture that begins with a palette-invariant thermal embedding, learned by enforcing representational consistency, which is fused with a contrast-normalized RGB stream via a gated mechanism. This is supplemented by a closed-loop, adaptive re-acquisition controller that uses Rodrigues-based updates for targeted confirmation of ambiguous anomalies and a geospatial deduplication module that clusters redundant alerts using DBSCAN over the haversine distance. In conclusion, this study establishes a powerful new paradigm for proactive PV inspection, with the proposed system achieving a mean Average Precision (mAP@0.5) of 0.903 on the public PVF-10 benchmark, a significant 12-15% improvement over single-modality baselines. Field validation confirmed the system's readiness, achieving 96% recall, while the de-duplication process reduced duplicate-induced false positives by 15-20%, and relevance-only telemetry cut airborne data transmission by 60-70%.
摘要：本研究的主题是开发一种用于光伏 (PV) 基础设施自动检测的智能集成框架，解决传统方法的关键缺点，包括热调色板偏差、数据冗余和高通信带宽要求。本研究的目标是设计、开发和验证一个全面的多模式系统，该系统完全自动化监控工作流程，从数据采集到生成可操作的地理定位维护警报，从而提高工厂安全性和运营效率。所采用的方法涉及协同架构，该架构从调色板不变的热嵌入开始，通过强制表示一致性来学习，并通过门控机制与对比度归一化的 RGB 流融合。闭环自适应重新采集控制器使用基于 Rodrigues 的更新来有针对性地确认模糊异常，还补充了地理空间重复数据删除模块，该模块使用 DBSCAN 在半正弦距离上对冗余警报进行集群。总之，本研究为主动 PV 检查建立了一个强大的新范例，所提出的系统在公共 PVF-10 基准上实现了 0.903 的平均精度 (mAP@0.5)，比单模态基线显着提高了 12-15%。现场验证确认了系统的就绪性，实现了 96% 的召回率，而重复数据删除过程将重复引起的误报减少了 15-20%，仅相关性遥测将机载数据传输减少了 60-70%。

Title: Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning

Authors: Ming Chen, Sheng Tang, Rong-Xi Tan, Ziniu Li, Jiacheng Chen, Ke Xue, Chao Qian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06533
Pdf URL: https://arxiv.org/pdf/2512.06533
Copy Paste: [[2512.06533]] Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning(https://arxiv.org/abs/2512.06533)
Keywords: generation
Abstract: Decoding-based regression, which reformulates regression as a sequence generation task, has emerged as a promising paradigm of applying large language models for numerical prediction. However, its progress is hindered by the misalignment between discrete token-level objectives (e.g., cross-entropy) and continuous numerical values. Existing approaches relying on token-level constraints often fail to capture the global magnitude of the target value, limiting their precision and generalization. In this paper, we propose to unlock the potential of decoding-based regression via Reinforcement Learning (RL). We formulate the generation process as a Markov Decision Process, utilizing sequence-level rewards to enforce global numerical coherence. Extensive experiments on tabular regression and code metric regression demonstrate that our method (specifically with ReMax and GRPO) consistently outperforms both state-of-the-art token-level baselines and traditional regression heads, showing the superiority of introducing sequence-level signals. Our analysis further reveals that RL significantly enhances sampling efficiency and predictive precision, establishing decoding-based regression as a robust and accurate paradigm for general-purpose numerical prediction.
摘要：基于解码的回归将回归重新表述为序列生成任务，已成为应用大型语言模型进行数值预测的有前途的范例。然而，离散代币级目标（例如交叉熵）与连续数值之间的不一致阻碍了其进展。依赖于令牌级约束的现有方法通常无法捕获目标值的全局大小，从而限制了其精度和泛化性。在本文中，我们建议通过强化学习（RL）释放基于解码的回归的潜力。我们将生成过程表述为马尔可夫决策过程，利用序列级奖励来强制全局数值一致性。对表格回归和代码度量回归的大量实验表明，我们的方法（特别是使用 ReMax 和 GRPO）始终优于最先进的标记级基线和传统回归头，显示了引入序列级信号的优越性。我们的分析进一步表明，强化学习显着提高了采样效率和预测精度，将基于解码的回归建立为通用数值预测的稳健且准确的范例。

Title: SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities

Authors: Dung Thuy Nguyen, Quang Nguyen, Preston K. Robinette, Eli Jiang, Taylor T. Johnson, Kevin Leach
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06562
Pdf URL: https://arxiv.org/pdf/2512.06562
Copy Paste: [[2512.06562]] SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities(https://arxiv.org/abs/2512.06562)
Keywords: generative
Abstract: Recent advances in 3D-aware generative models have enabled high-fidelity image synthesis of human identities. However, this progress raises urgent questions around user consent and the ability to remove specific individuals from a model's output space. We address this by introducing SUGAR, a framework for scalable generative unlearning that enables the removal of many identities (simultaneously or sequentially) without retraining the entire model. Rather than projecting unwanted identities to unrealistic outputs or relying on static template faces, SUGAR learns a personalized surrogate latent for each identity, diverting reconstructions to visually coherent alternatives while preserving the model's quality and diversity. We further introduce a continual utility preservation objective that guards against degradation as more identities are forgotten. SUGAR achieves state-of-the-art performance in removing up to 200 identities, while delivering up to a 700% improvement in retention utility compared to existing baselines. Our code is publicly available at this https URL.
摘要：3D 感知生成模型的最新进展使得人类身份的高保真图像合成成为可能。然而，这一进展引发了有关用户同意和从模型输出空间中删除特定个人的能力的紧迫问题。我们通过引入 SUGAR 来解决这个问题，SUGAR 是一个可扩展的生成式遗忘框架，可以（同时或顺序）删除许多身份，而无需重新训练整个模型。 SUGAR 不是将不需要的身份投射到不切实际的输出或依赖静态模板面孔，而是学习每个身份潜在的个性化代理，将重建转移到视觉上连贯的替代方案，同时保留模型的质量和多样性。我们进一步引入了持续的效用保存目标，以防止随着更多身份被遗忘而退化。 SUGAR 在删除多达 200 个身份方面实现了最先进的性能，同时与现有基准相比，保留效用提高了 700%。我们的代码可通过此 https URL 公开获取。

Title: A Fast and Effective Solution to the Problem of Look-ahead Bias in LLMs

Authors: Humzah Merchant, Bradford Levy
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.06607
Pdf URL: https://arxiv.org/pdf/2512.06607
Copy Paste: [[2512.06607]] A Fast and Effective Solution to the Problem of Look-ahead Bias in LLMs(https://arxiv.org/abs/2512.06607)
Keywords: generation
Abstract: Applying LLMs to predictive tasks in finance is challenging due to look-ahead bias resulting from their training on long time-series data. This precludes the backtests typically employed in finance since retraining frontier models from scratch with a specific knowledge cutoff is prohibitive. In this paper, we introduce a fast, effective, and low-cost alternative. Our method guides generation at inference time by adjusting the logits of a large base model using a pair of smaller, specialized models -- one fine-tuned on information to be forgotten and another on information to be retained. We demonstrate that our method effectively removes both verbatim and semantic knowledge, corrects biases, and outperforms prior methods.
摘要：将法学硕士应用于金融领域的预测任务具有挑战性，因为他们对长期时间序列数据的训练会产生前瞻偏差。这排除了通常在金融领域使用的回溯测试，因为使用特定的知识截止点从头开始重新训练前沿模型是令人望而却步的。在本文中，我们介绍了一种快速、有效且低成本的替代方案。我们的方法通过使用一对较小的专用模型调整大型基础模型的逻辑来指导推理时的生成——一个模型对要忘记的信息进行微调，另一个对要保留的信息进行微调。我们证明，我们的方法有效地消除了逐字和语义知识，纠正了偏见，并且优于先前的方法。

Title: Masked Autoencoder Pretraining on Strong-Lensing Images for Joint Dark-Matter Model Classification and Super-Resolution

Authors: Achmad Ardani Prasha, Clavino Ourizqi Rachmadi, Muhamad Fauzan Ibnu Syahlan, Naufal Rahfi Anugerah, Nanda Garin Raditya, Putri Amelia, Sabrina Laila Mutiara, Hilman Syachr Ramadhan
Subjects: cs.CV, astro-ph.CO, astro-ph.IM, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.06642
Pdf URL: https://arxiv.org/pdf/2512.06642
Copy Paste: [[2512.06642]] Masked Autoencoder Pretraining on Strong-Lensing Images for Joint Dark-Matter Model Classification and Super-Resolution(https://arxiv.org/abs/2512.06642)
Keywords: super-resolution
Abstract: Strong gravitational lensing can reveal the influence of dark-matter substructure in galaxies, but analyzing these effects from noisy, low-resolution images poses a significant challenge. In this work, we propose a masked autoencoder (MAE) pretraining strategy on simulated strong-lensing images from the DeepLense ML4SCI benchmark to learn generalizable representations for two downstream tasks: (i) classifying the underlying dark matter model (cold dark matter, axion-like, or no substructure) and (ii) enhancing low-resolution lensed images via super-resolution. We pretrain a Vision Transformer encoder using a masked image modeling objective, then fine-tune the encoder separately for each task. Our results show that MAE pretraining, when combined with appropriate mask ratio tuning, yields a shared encoder that matches or exceeds a ViT trained from scratch. Specifically, at a 90% mask ratio, the fine-tuned classifier achieves macro AUC of 0.968 and accuracy of 88.65%, compared to the scratch baseline (AUC 0.957, accuracy 82.46%). For super-resolution (16x16 to 64x64), the MAE-pretrained model reconstructs images with PSNR ~33 dB and SSIM 0.961, modestly improving over scratch training. We ablate the MAE mask ratio, revealing a consistent trade-off: higher mask ratios improve classification but slightly degrade reconstruction fidelity. Our findings demonstrate that MAE pretraining on physics-rich simulations provides a flexible, reusable encoder for multiple strong-lensing analysis tasks.
摘要：强引力透镜效应可以揭示星系中暗物质子结构的影响，但从嘈杂的低分辨率图像中分析这些影响提出了重大挑战。在这项工作中，我们提出了一种基于 DeepLense ML4SCI 基准的模拟强透镜图像的掩码自动编码器 (MAE) 预训练策略，以学习两个下游任务的通用表示：(i) 对底层暗物质模型（冷暗物质、类轴子或无子结构）进行分类，以及 (ii) 通过超分辨率增强低分辨率透镜图像。我们使用掩模图像建模目标预训练 Vision Transformer 编码器，然后针对每个任务分别微调编码器。我们的结果表明，MAE 预训练与适当的掩模比调整相结合，产生的共享编码器可以匹配或超过从头开始训练的 ViT。具体来说，在 90% 的掩码率下，与划痕基线（AUC 0.957，准确度 82.46%）相比，微调分类器的宏观 AUC 为 0.968，准确度为 88.65%。对于超分辨率（16x16 至 64x64），MAE 预训练模型以 PSNR ~33 dB 和 SSIM 0.961 重建图像，相对于划痕训练略有改进。我们消除了 MAE 掩模比，揭示了一致的权衡：较高的掩模比可以改善分类，但会稍微降低重建保真度。我们的研究结果表明，对物理丰富的模拟进行 MAE 预训练为多个强透镜分析任务提供了灵活、可重复使用的编码器。

Title: GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Authors: Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06655
Pdf URL: https://arxiv.org/pdf/2512.06655
Copy Paste: [[2512.06655]] GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering(https://arxiv.org/abs/2512.06655)
Keywords: generation
Abstract: Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), which extends SAEs with a Laplacian smoothness penalty on the neuron co-activation graph. Unlike standard SAEs that assign each concept to a single latent feature, GSAEs recover smooth, distributed safety representations as coherent patterns spanning multiple features. We empirically demonstrate that GSAE enables effective runtime safety steering, assembling features into a weighted set of safety-relevant directions and controlling them with a two-stage gating mechanism that activates interventions only when harmful prompts or continuations are detected during generation. This approach enforces refusals adaptively while preserving utility on benign queries. Across safety and QA benchmarks, GSAE steering achieves an average 82% selective refusal rate, substantially outperforming standard SAE steering (42%), while maintaining strong task accuracy (70% on TriviaQA, 65% on TruthfulQA, 74% on GSM8K). Robustness experiments further show generalization across LLaMA-3, Mistral, Qwen, and Phi families and resilience against jailbreak attacks (GCG, AutoDAN), consistently maintaining >= 90% refusal of harmful content.
摘要：大型语言模型 (LLM) 面临着严峻的安全挑战，因为它们可以通过对抗性提示和越狱攻击来被操纵生成有害内容。许多防御措施通常是过滤输出的黑匣子护栏，或者是基于内部的方法，通过将安全性作为单个潜在特征或维度来操作来引导隐藏的激活。虽然对于简单的概念有效，但这种假设是有限的，因为最近的证据表明，诸如拒绝和时间性之类的抽象概念分布在多个特征中，而不是孤立在一个特征中。为了解决这个限制，我们引入了图正则稀疏自编码器（GSAE），它通过神经元共激活图上的拉普拉斯平滑度惩罚来扩展 SAE。与将每个概念分配给单个潜在特征的标准 SAE 不同，GSAE 将平滑的分布式安全表示恢复为跨越多个特征的连贯模式。我们凭经验证明，GSAE 能够实现有效的运行时安全引导，将特征组装成一组加权的安全相关方向，并使用两阶段门控机制来控制它们，该机制仅在生成过程中检测到有害提示或连续时才激活干预措施。这种方法自适应地强制拒绝，同时保留良性查询的实用性。在安全和 QA 基准中，GSAE 转向平均实现了 82% 的选择性拒绝率，大大优于标准 SAE 转向 (42%)，同时保持了较高的任务准确性（TriviaQA 上为 70%，TruthfulQA 上为 65%，GSM8K 上为 74%）。鲁棒性实验进一步显示了 LLaMA-3、Mistral、Qwen 和 Phi 系列的泛化能力以及针对越狱攻击（GCG、AutoDAN）的弹性，始终保持对有害内容的拒绝率 >= 90%。

Title: Personalized Image Descriptions from Attention Sequences

Authors: Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Hoai, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06662
Pdf URL: https://arxiv.org/pdf/2512.06662
Copy Paste: [[2512.06662]] Personalized Image Descriptions from Attention Sequences(https://arxiv.org/abs/2512.06662)
Keywords: generation
Abstract: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.
摘要：人们对同一幅图像的看法可能不同：他们以不同的顺序关注不同的区域、物体和细节，并用不同的语言风格来描述它们。这导致图像描述的显着变化。然而，现有的个性化图像描述模型仅关注语言风格，之前没有利用个人观看模式的工作。我们通过明确地将个性化观看行为建模为描述生成的核心因素来解决这一差距。我们的方法 DEPER（DEscription-PERception 角色编码器）在辅助注意力预测任务的指导下，学习捕获语言风格和观看行为的主题嵌入。轻量级适配器将这些嵌入与冻结的视觉语言模型对齐，从而无需重新训练即可实现少量的个性化。在跨越不同观看任务以及简短和详细描述的四个数据集中，DEPER 实现了 24% 的平均改进，这表明对个性化注意力进行建模可以产生更符合人类需求的高质量描述。我们认为，了解人们如何看待有助于预测他们所说的话；对人类感知多样性进行建模可以提高多模式系统中的性能和人类一致性。

Title: Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods

Authors: Panagiota Kiourti, Anu Singh, Preeti Duraipandian, Weichao Zhou, Wenchao Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.06665
Pdf URL: https://arxiv.org/pdf/2512.06665
Copy Paste: [[2512.06665]] Rethinking Robustness: A New Approach to Evaluating Feature Attribution Methods(https://arxiv.org/abs/2512.06665)
Keywords: generative
Abstract: This paper studies the robustness of feature attribution methods for deep neural networks. It challenges the current notion of attributional robustness that largely ignores the difference in the model's outputs and introduces a new way of evaluating the robustness of attribution methods. Specifically, we propose a new definition of similar inputs, a new robustness metric, and a novel method based on generative adversarial networks to generate these inputs. In addition, we present a comprehensive evaluation with existing metrics and state-of-the-art attribution methods. Our findings highlight the need for a more objective metric that reveals the weaknesses of an attribution method rather than that of the neural network, thus providing a more accurate evaluation of the robustness of attribution methods.
摘要：本文研究深度神经网络特征归因方法的鲁棒性。它挑战了当前归因稳健性的概念，该概念在很大程度上忽略了模型输出的差异，并引入了一种评估归因方法稳健性的新方法。具体来说，我们提出了相似输入的新定义、新的鲁棒性度量以及基于生成对抗网络来生成这些输入的新方法。此外，我们还利用现有指标和最先进的归因方法进行综合评估。我们的研究结果强调需要一种更客观的指标来揭示归因方法而不是神经网络的弱点，从而对归因方法的稳健性提供更准确的评估。

Title: RunawayEvil: Jailbreaking the Image-to-Video Generative Models

Authors: Songping Wang, Rufan Qian, Yueming Lyu, Qinglong Liu, Linzhuang Zou, Jie Qin, Songhua Liu, Caifeng Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06674
Pdf URL: https://arxiv.org/pdf/2512.06674
Copy Paste: [[2512.06674]] RunawayEvil: Jailbreaking the Image-to-Video Generative Models(https://arxiv.org/abs/2512.06674)
Keywords: generation, generative
Abstract: Image-to-Video (I2V) generation synthesizes dynamic visual content from image and text inputs, providing significant creative control. However, the security of such multimodal systems, particularly their vulnerability to jailbreak attacks, remains critically underexplored. To bridge this gap, we propose RunawayEvil, the first multimodal jailbreak framework for I2V models with dynamic evolutionary capability. Built on a "Strategy-Tactic-Action" paradigm, our framework exhibits self-amplifying attack through three core components: (1) Strategy-Aware Command Unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based on the selected strategies; (3) Tactical Action Unit that executes and evaluates the multimodal coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. Specifically, RunawayEvil outperforms existing methods by 58.5 to 79 percent on COCO2017. This work provides a critical tool for vulnerability analysis of I2V models, thereby laying a foundation for more robust video generation systems.
摘要：图像到视频 (I2V) 生成从图像和文本输入合成动态视觉内容，提供重要的创意控制。然而，这种多模式系统的安全性，特别是它们对越狱攻击的脆弱性，仍然没有得到充分的研究。为了弥补这一差距，我们提出了 RunawayEvil，这是第一个具有动态进化能力的 I2V 模型多模式越狱框架。我们的框架建立在“战略-战术-行动”范式的基础上，通过三个核心组件展示了自我放大攻击：（1）策略感知指挥单元，使攻击能够通过强化学习驱动的策略定制和基于LLM的策略探索来自我进化其策略；（2）多模态战术规划单元，根据所选策略生成协调的文本越狱指令和图像篡改指南； (3) 执行和评估多模式协同攻击的战术行动单元。这种自我进化的架构允许框架在无需人工干预的情况下不断调整和强化其攻击策略。大量实验表明 RunawayEvil 在商业 I2V 模型（例如 Open-Sora 2.0 和 CogVideoX）上实现了最先进的攻击成功率。具体来说，RunawayEvil 在 COCO2017 上的性能比现有方法高出 58.5% 到 79%。这项工作为 I2V 模型的漏洞分析提供了一个关键工具，从而为更强大的视频生成系统奠定了基础。

Title: GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning

Authors: Shrihari Sridharan, Deepak Ravikumar, Anand Raghunathan, Kaushik Roy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06678
Pdf URL: https://arxiv.org/pdf/2512.06678
Copy Paste: [[2512.06678]] GradientSpace: Unsupervised Data Clustering for Improved Instruction Tuning(https://arxiv.org/abs/2512.06678)
Keywords: generation
Abstract: Instruction tuning is one of the key steps required for adapting large language models (LLMs) to a broad spectrum of downstream applications. However, this procedure is difficult because real-world datasets are rarely homogeneous; they consist of a mixture of diverse information, causing gradient interference, where conflicting gradients pull the model in opposing directions, degrading performance. A common strategy to mitigate this issue is to group data based on semantic or embedding similarity. However, this fails to capture how data influences model parameters during learning. While recent works have attempted to cluster gradients directly, they randomly project gradients into lower dimensions to manage memory, which leads to accuracy loss. Moreover, these methods rely on expert ensembles which necessitates multiple inference passes and expensive on-the-fly gradient computations during inference. To address these limitations, we propose GradientSpace, a framework that clusters samples directly in full-dimensional gradient space. We introduce an online SVD-based algorithm that operates on LoRA gradients to identify latent skills without the infeasible cost of storing all sample gradients. Each cluster is used to train a specialized LoRA expert along with a lightweight router trained to select the best expert during inference. We show that routing to a single, appropriate expert outperforms expert ensembles used in prior work, while significantly reducing inference latency. Our experiments across mathematical reasoning, code generation, finance, and creative writing tasks demonstrate that GradientSpace leads to coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
摘要：指令调优是将大型语言模型 (LLM) 适应广泛的下游应用程序所需的关键步骤之一。然而，这个过程很困难，因为现实世界的数据集很少是同质的；它们由不同信息的混合组成，导致梯度干扰，其中冲突的梯度将模型拉向相反的方向，从而降低性能。缓解此问题的常见策略是根据语义或嵌入相似性对数据进行分组。然而，这无法捕捉数据在学习过程中如何影响模型参数。虽然最近的工作尝试直接对梯度进行聚类，但它们随机地将梯度投影到较低的维度以管理内存，这会导致准确性损失。此外，这些方法依赖于专家集成，这需要在推理过程中进行多次推理和昂贵的动态梯度计算。为了解决这些限制，我们提出了 GradientSpace，一个直接在全维梯度空间中聚类样本的框架。我们引入了一种基于 SVD 的在线算法，该算法在 LoRA 梯度上运行来识别潜在技能，而无需存储所有样本梯度的不可行成本。每个集群都用于训练一名专门的 LoRA 专家，以及一个经过训练以在推理过程中选择最佳专家的轻量级路由器。我们表明，路由到单个合适的专家的性能优于先前工作中使用的专家集合，同时显着减少了推理延迟。我们在数学推理、代码生成、金融和创意写作任务方面的实验表明，与最先进的聚类方法和微调技术相比，GradientSpace 可以带来连贯的专家专业化和一致的准确性增益。

Title: Mitigating Barren plateaus in quantum denoising diffusion probabilistic models

Authors: Haipeng Cao, Kaining Zhang, Dacheng Tao, Zhaofeng Su
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2512.06695
Pdf URL: https://arxiv.org/pdf/2512.06695
Copy Paste: [[2512.06695]] Mitigating Barren plateaus in quantum denoising diffusion probabilistic models(https://arxiv.org/abs/2512.06695)
Keywords: generative
Abstract: Quantum generative models leverage quantum superposition and entanglement to enhance learning efficiency for both classical and quantum data. The quantum denoising diffusion probabilistic model (QuDDPM), inspired by its classical counterpart, has been proposed as a promising framework for quantum generative learning. QuDDPM is capable of efficiently learning and generating quantum data, and it demonstrates excellent performance in learning correlated quantum noise models, quantum many-body phases, and the topological structure of quantum data. However, we show that barren plateaus emerge in QuDDPMs due to the use of 2-design states as the input for the denoising process, which severely undermines the performance of QuDDPM. Through theoretical analysis and experimental validation, we confirm the presence of barren plateaus in the original QuDDPM. To address this issue, we introduce an improved QuDDPM that utilizes a distribution maintaining a certain distance from the Haar distribution, ensuring better trainability. Experimental results demonstrate that our approach effectively mitigates the barren plateau problem and generates samples with higher quality, paving the way for scalable and efficient quantum generative learning.
摘要：量子生成模型利用量子叠加和纠缠来提高经典数据和量子数据的学习效率。量子去噪扩散概率模型（QuDDPM）受到其经典模型的启发，被提出作为量子生成学习的一个有前景的框架。 QuDDPM能够高效地学习和生成量子数据，在学习相关量子噪声模型、量子多体相和量子数据的拓扑结构方面表现出优异的性能。然而，我们发现，由于使用 2 设计状态作为去噪过程的输入，QuDDPM 中出现了贫瘠的平台，这严重损害了 QuDDPM 的性能。通过理论分析和实验验证，我们证实了原始 QuDDPM 中存在贫瘠高原。为了解决这个问题，我们引入了一种改进的 QuDDPM，它利用与 Haar 分布保持一定距离的分布，确保更好的可训练性。实验结果表明，我们的方法有效缓解了贫瘠高原问题，并生成了更高质量的样本，为可扩展和高效的量子生成学习铺平了道路。

Title: Pathway to $O(\sqrt{d})$ Complexity bound under Wasserstein metric of flow-based models

Authors: Xiangjun Meng, Zhongjian Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.06702
Pdf URL: https://arxiv.org/pdf/2512.06702
Copy Paste: [[2512.06702]] Pathway to $O(\sqrt{d})$ Complexity bound under Wasserstein metric of flow-based models(https://arxiv.org/abs/2512.06702)
Keywords: generative
Abstract: We provide attainable analytical tools to estimate the error of flow-based generative models under the Wasserstein metric and to establish the optimal sampling iteration complexity bound with respect to dimension as $O(\sqrt{d})$. We show this error can be explicitly controlled by two parts: the Lipschitzness of the push-forward maps of the backward flow which scales independently of the dimension; and a local discretization error scales $O(\sqrt{d})$ in terms of dimension. The former one is related to the existence of Lipschitz changes of variables induced by the (heat) flow. The latter one consists of the regularity of the score function in both spatial and temporal directions. These assumptions are valid in the flow-based generative model associated with the Föllmer process and $1$-rectified flow under the Gaussian tail assumption. As a consequence, we show that the sampling iteration complexity grows linearly with the square root of the trace of the covariance operator, which is related to the invariant distribution of the forward process.
摘要：我们提供了可实现的分析工具来估计 Wasserstein 度量下基于流的生成模型的误差，并建立相对于维度 $O(\sqrt{d})$ 的最佳采样迭代复杂度界限。我们证明这个误差可以通过两个部分来明确控制：向后流的前推映射的 Lipschitzness，其缩放与维度无关；局部离散误差在维度上缩放 $O(\sqrt{d})$。前一个与（热）流引起的变量的利普希茨变化的存在有关。后者由得分函数在空间和时间方向上的规律性组成。这些假设在与 Föllmer 过程和高斯尾部假设下的 1$ 校正流相关的基于流的生成模型中是有效的。结果，我们表明采样迭代复杂度随着协方差算子迹的平方根线性增长，这与前向过程的不变分布有关。

Title: UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement

Authors: Weiqi Li, Xuanyu Zhang, Bin Chen, Jingfen Xie, Yan Wang, Kexin Zhang, Junlin Li, Li Zhang, Jian Zhang, Shijie Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06750
Pdf URL: https://arxiv.org/pdf/2512.06750
Copy Paste: [[2512.06750]] UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement(https://arxiv.org/abs/2512.06750)
Keywords: restoration, generation, generative, quality assessment
Abstract: Image quality assessment (IQA) and image restoration are fundamental problems in low-level vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be available at this https URL.
摘要：图像质量评估（IQA）和图像恢复是低级视觉的基本问题。尽管 IQA 和恢复在概念上密切相关，但大多数现有工作将它们分开对待。统一多模态理解生成模型的最新进展展示了有希望的结果，并表明更强的理解可以提高生成性能。这催生了一个统一 IQA 和恢复的单一模型，并明确研究 IQA 如何指导恢复，这一设置在很大程度上尚未得到充分探索，但非常有价值。在本文中，据我们所知，我们提出了 UARE，这是第一个用于图像质量评估、恢复和增强的统一视觉语言模型。基于预训练的统一理解和生成模型，我们引入了一个两阶段训练框架。首先，渐进的、由易到难的调度从单一类型的扭曲扩展到高阶混合退化，使 UARE 能够处理多种退化。其次，我们对交错的文本图像数据进行质量理解和恢复的统一微调，使 IQA 信号与恢复目标保持一致。通过多任务协同训练，UARE 利用 IQA 来提高恢复和增强性能。跨 IQA、恢复和增强任务的广泛实验证明了 UARE 的有效性。代码和模型将在此 https URL 中提供。

Title: VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

Authors: Wenbo Lyu, Yingjun Du, Jinglin Zhao, Xianton Zhen, Ling Shao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06759
Pdf URL: https://arxiv.org/pdf/2512.06759
Copy Paste: [[2512.06759]] VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors(https://arxiv.org/abs/2512.06759)
Keywords: generation
Abstract: Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: this https URL
摘要：理解多图像、多回合场景是大型视觉语言模型 (LVLM) 的一项关键但尚未充分开发的功能。现有的基准主要侧重于静态或横向比较——例如发现视觉差异或评估适当性——同时严重依赖语言线索。这样的设置忽视了渐进的、依赖于上下文的推理以及视觉到视觉推理的挑战。为了弥补这一差距，我们推出了 VisChainBench，这是一个大型基准测试，旨在严格评估 LVLM 在最少的语言指导下跨顺序、相互依赖的任务执行多步骤视觉推理的能力。 VisChainBench 包含 1,457 个任务，涵盖三个不同领域（例如日常场景、工程故障排除）的 20,000 多张图像，其结构旨在模仿现实世界的决策过程。独特的是，该基准测试是使用多代理生成管道构建的，确保了高度的视觉多样性和受控的语言偏差。所有基准测试数据和基准构建代码都可以通过以下链接查看和下载：此 https URL

Title: VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Authors: Yutong Wang, Haiyu Zhang, Tianfan Xue, Yu Qiao, Yaohui Wang, Chang Xu, Xinyuan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06802
Pdf URL: https://arxiv.org/pdf/2512.06802
Copy Paste: [[2512.06802]] VDOT: Efficient Unified Video Creation via Optimal Transport Distillation(https://arxiv.org/abs/2512.06802)
Keywords: generation, generative
Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.
摘要：生成模型的快速发展极大地促进了图像和视频应用的发展。其中，旨在在各种条件下生成视频的视频创作受到了广泛关注。然而，现有的视频创建模型要么只关注少数特定条件，要么由于复杂的模型推理而导致生成时间过长，这使得它们在实际应用中不切实际。为了缓解这些问题，我们提出了一种高效的统一视频创建模型，名为 VDOT。具体来说，我们使用分布匹配蒸馏（DMD）范式对训练过程进行建模。我们没有使用 Kullback-Leibler (KL) 最小化，而是另外采用了一种新颖的计算最优传输 (OT) 技术来优化真实和虚假分数分布之间的差异。 OT距离本质上施加了几何约束，减轻了在少步生成场景中基于KL的蒸馏过程中可能出现的潜在迫零或梯度崩溃问题，从而提高了蒸馏过程的效率和稳定性。此外，我们集成了一个鉴别器，使模型能够感知真实的视频数据，从而提高生成视频的质量。为了支持训练统一的视频创建模型，我们提出了一个用于视频数据注释和过滤的全自动管道，可容纳多个视频创建任务。同时，我们制定了统一的测试基准UVCBench，以规范评估。实验表明，我们的 4 步 VDOT 优于或与其他具有 100 个降噪步骤的基线相匹配。

Title: Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation

Authors: Agung Nugraha, Heungjun Im, Jihwan Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06813
Pdf URL: https://arxiv.org/pdf/2512.06813
Copy Paste: [[2512.06813]] Partial Inverse Design of High-Performance Concrete Using Cooperative Neural Networks for Constraint-Aware Mix Generation(https://arxiv.org/abs/2512.06813)
Keywords: generation, generative
Abstract: High-performance concrete offers exceptional strength and durability but requires complex mix designs involving many interdependent variables and practical constraints. While data-driven methods have advanced predictive modeling for forward design, inverse design, which focuses on determining mix compositions that achieve target performance, remains limited, particularly in design situations where some mix variables are fixed by constraints and only the remaining variables must be determined. This study proposes a cooperative neural network framework for the partial inverse design of high-performance concrete. The framework combines two coupled neural network models, an imputation model that infers the undetermined variables and a surrogate model that predicts compressive strength. Through cooperative learning, the model generates valid and performance-consistent mix designs in a single forward pass while accommodating different constraint combinations without retraining. Its performance is compared with both probabilistic and generative approaches, including Bayesian inference based on a Gaussian process surrogate and autoencoder-based models. Evaluated on a benchmark dataset, the proposed model achieves stable and higher R-squared values of 0.87-0.92 and reduces mean squared error by an average of 50 percent compared with autoencoder baselines and by an average of 70 percent compared with Bayesian inference. The results demonstrate that the cooperative neural network provides an accurate, robust, and computationally efficient foundation for constraint-aware, data-driven mix proportioning in concrete engineering.
摘要：高性能混凝土具有卓越的强度和耐久性，但需要复杂的配合比设计，涉及许多相互依赖的变量和实际约束。虽然数据驱动方法具有用于正向设计的先进预测模型，但专注于确定实现目标性能的混合成分的逆向设计仍然受到限制，特别是在一些混合变量由约束固定并且仅必须确定其余变量的设计情况下。本研究提出了一种用于高性能混凝土部分逆向设计的协作神经网络框架。该框架结合了两个耦合的神经网络模型，一个推断未确定变量的插补模型和一个预测抗压强度的替代模型。通过协作学习，该模型在一次前向传递中生成有效且性能一致的混合设计，同时适应不同的约束组合而无需重新训练。其性能与概率方法和生成方法进行了比较，包括基于高斯过程代理和基于自动编码器的模型的贝叶斯推理。在基准数据集上进行评估，所提出的模型实现了 0.87-0.92 的稳定且更高的 R 平方值，与自动编码器基线相比，均方误差平均降低了 50%，与贝叶斯推理相比，均方误差平均降低了 70%。结果表明，协作神经网络为混凝土工程中的约束感知、数据驱动的配合比提供了准确、稳健且计算高效的基础。

Title: Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection

Authors: Satoshi Hashimoto, Hitoshi Nishimura, Yanan Wang, Mori Kurokawa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06845
Pdf URL: https://arxiv.org/pdf/2512.06845
Copy Paste: [[2512.06845]] Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection(https://arxiv.org/abs/2512.06845)
Keywords: generation
Abstract: Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.
摘要：在实践中部署视频异常检测受到真实异常镜头的稀缺性和收集成本的阻碍。我们通过在没有任何真实异常视频的情况下进行训练，同时在标准弱监督分割下进行评估来解决这个问题，并且我们引入了 PA-VAD，这是一种生成驱动的方法，它从与真实正常视频配对的合成伪异常视频中学习检测器，仅使用一小组真实正常图像来驱动合成。为了进行合成，我们使用 CLIP 选择与类别相关的初始图像，并使用视觉语言模型细化文本提示，以在调用视频扩散模型之前提高保真度和场景一致性。对于训练，我们通过结合了域对齐和内存使用感知更新的域对齐正则化模块来减轻合成异常中过多的时空幅度。大量实验表明，我们的方法在 ShanghaiTech 上达到了 98.2%，在 UCF-Crime 上达到了 82.5%，超过了 ShanghaiTech 上最强的真实异常方法 +0.6%，并且在 UCF-Crime 上超过了 UVAD 最先进的方法 +1.9%。结果表明，无需收集真实异常即可获得高精度异常检测，为可扩展部署提供了实用途径。

Title: Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT

Authors: Matan Atad, Alexander W. Marka, Lisa Steinhelfer, Anna Curto-Vilalta, Yannik Leonhardt, Sarah C. Foreman, Anna-Sophia Walburga Dietrich, Robert Graf, Alexandra S. Gersing, Bjoern Menze, Daniel Rueckert, Jan S. Kirschke, Hendrik Möller
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.06849
Pdf URL: https://arxiv.org/pdf/2512.06849
Copy Paste: [[2512.06849]] Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT(https://arxiv.org/abs/2512.06849)
Keywords: generative
Abstract: Accurate segmentation of vertebral metastasis in CT is clinically important yet difficult to scale, as voxel-level annotations are scarce and both lytic and blastic lesions often resemble benign degenerative changes. We introduce a weakly supervised method trained solely on vertebra-level healthy/malignant labels, without any lesion masks. The method combines a Diffusion Autoencoder (DAE) that produces a classifier-guided healthy edit of each vertebra with pixel-wise difference maps that propose candidate lesion regions. To determine which regions truly reflect malignancy, we introduce Hide-and-Seek Attribution: each candidate is revealed in turn while all others are hidden, the edited image is projected back to the data manifold by the DAE, and a latent-space classifier quantifies the isolated malignant contribution of that component. High-scoring regions form the final lytic or blastic segmentation. On held-out radiologist annotations, we achieve strong blastic/lytic performance despite no mask supervision (F1: 0.91/0.85; Dice: 0.87/0.78), exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55). These results show that vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT.
摘要：CT 中椎体转移的准确分割在临床上很重要，但难以扩展，因为体素级注释很少，而且溶解性和母细胞性病变通常类似于良性退行性改变。我们引入了一种弱监督方法，仅在椎骨级别的健康/恶性标签上进行训练，没有任何病变掩模。该方法结合了扩散自动编码器（DAE），该编码器可对每个椎骨进行分类器引导的健康编辑，并具有提出候选病变区域的像素差异图。为了确定哪些区域真正反映了恶性肿瘤，我们引入了隐藏和寻找归因：依次显示每个候选区域，而隐藏所有其他区域，编辑后的图像由 DAE 投影回数据流形，并且潜在空间分类器量化该组件的孤立恶性贡献。高分区域形成最终的裂解或爆炸分割。在保留的放射科医生注释中，尽管没有面罩监督，我们仍实现了很强的爆炸/裂解性能（F1：0.91/0.85；Dice：0.87/0.78），超过基线（F1：0.79/0.67；Dice：0.74/0.55）。这些结果表明，椎骨级标签可以转化为可靠的病变掩模，证明生成编辑与选择性遮挡相结合支持 CT 中精确的弱监督分割。

Title: Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Authors: Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06864
Pdf URL: https://arxiv.org/pdf/2512.06864
Copy Paste: [[2512.06864]] Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training(https://arxiv.org/abs/2512.06864)
Keywords: generation, quality assessment
Abstract: Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at this https URL.
摘要：由于像素级掩模和时间一致性标签的双重要求，视频实例分割（VIS）面临着重大的标注挑战。虽然最近的无监督方法（例如 VideoCutLER）通过合成数据消除了光流依赖性，但它们仍然受到合成域与真实域差距的限制。我们提出了 AutoQ-VIS，这是一种新颖的无监督框架，它通过质量引导的自我训练来弥补这一差距。我们的方法在伪标签生成和自动质量评估之间建立了一个闭环系统，从而实现从合成视频到真实视频的逐步适应。实验证明，YouTubeVIS-2019 $\texttt{val}$ 集上的性能达到了 52.6 $\text{AP}_{50}$ 的最先进性能，比之前最先进的 VideoCutLER 提高了 4.4%，同时无需人工注释。这证明了无监督 VIS 质量意识自我训练的可行性。我们将在此 https URL 发布代码。

Title: Spatial Retrieval Augmented Autonomous Driving

Authors: Xiaosong Jia, Chenhe Zhang, Yule Jiang, Songbur Wong, Zhiyuan Zhang, Chen Chen, Shaofeng Zhang, Xuanhe Zhou, Xue Yang, Junchi Yan, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06865
Pdf URL: https://arxiv.org/pdf/2512.06865
Copy Paste: [[2512.06865]] Spatial Retrieval Augmented Autonomous Driving(https://arxiv.org/abs/2512.06865)
Keywords: generative
Abstract: Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.
摘要：现有的自动驾驶系统依靠车载传感器（摄像头、激光雷达、IMU 等）进行环境感知。然而，这种范例受到行驶时间感知视野的限制，并且在有限的视野范围、遮挡或黑暗和下雨等极端条件下通常会失败。相比之下，即使在能见度较差的情况下，人类驾驶员也能够回忆起道路结构。为了赋予模型这种“回忆”能力，我们提出了空间检索范例，引入离线检索的地理图像作为额外的输入。这些图像很容易从离线缓存（例如，谷歌地图或存储的自动驾驶数据集）中获取，而不需要额外的传感器，使其成为现有 AD 任务的即插即用扩展。在实验中，我们首先使用通过谷歌地图 API 检索的地理图像扩展 nuScenes 数据集，并将新数据与自我车辆轨迹对齐。我们建立涵盖五个核心自动驾驶任务的基线：对象检测、在线地图、占用预测、端到端规划和生成世界建模。广泛的实验表明，扩展模式可以提高某些任务的性能，我们将开源数据集管理代码、数据和基准，以进一步研究这种新的自动驾驶范例。

Title: JoPano: Unified Panorama Generation via Joint Modeling

Authors: Wancheng Feng, Chen An, Zhenliang He, Meina Kan, Shiguang Shan, Lukun Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06885
Pdf URL: https://arxiv.org/pdf/2512.06885
Copy Paste: [[2512.06885]] JoPano: Unified Panorama Generation via Joint Modeling(https://arxiv.org/abs/2512.06885)
Keywords: generation, generative
Abstract: Panorama generation has recently attracted growing interest in the research community, with two core tasks, text-to-panorama and view-to-panorama generation. However, existing methods still face two major challenges: their U-Net-based architectures constrain the visual quality of the generated panoramas, and they usually treat the two core tasks independently, which leads to modeling redundancy and inefficiency. To overcome these challenges, we propose a joint-face panorama (JoPano) generation approach that unifies the two core tasks within a DiT-based model. To transfer the rich generative capabilities of existing DiT backbones learned from natural images to the panorama domain, we propose a Joint-Face Adapter built on the cubemap representation of panoramas, which enables a pretrained DiT to jointly model and generate different views of a panorama. We further apply Poisson Blending to reduce seam inconsistencies that often appear at the boundaries between cube faces. Correspondingly, we introduce Seam-SSIM and Seam-Sobel metrics to quantitatively evaluate the seam consistency. Moreover, we propose a condition switching mechanism that unifies text-to-panorama and view-to-panorama tasks within a single model. Comprehensive experiments show that JoPano can generate high-quality panoramas for both text-to-panorama and view-to-panorama generation tasks, achieving state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics.
摘要：全景生成最近吸引了研究界越来越多的兴趣，其两个核心任务是文本到全景和视图到全景生成。然而，现有方法仍然面临两个主要挑战：它们基于 U-Net 的架构限制了生成的全景图的视觉质量，并且它们通常独立处理两个核心任务，这导致建模冗余和低效率。为了克服这些挑战，我们提出了一种联合面部全景（JoPano）生成方法，该方法将基于 DiT 的模型中的两个核心任务统一起来。为了将从自然图像中学习到的现有 DiT 主干的丰富生成能力转移到全景领域，我们提出了一种基于全景图立方体贴图表示的联合面部适配器，它使预训练的 DiT 能够联合建模并生成全景图的不同视图。我们进一步应用泊松混合来减少经常出现在立方体面之间边界处的接缝不一致性。相应地，我们引入Seam-SSIM和Seam-Sobel指标来定量评估接缝一致性。此外，我们提出了一种条件切换机制，将文本到全景和视图到全景任务统一在单个模型中。综合实验表明，JoPano 可以为文本到全景和视图到全景生成任务生成高质量的全景图，在 FID、CLIP-FID、IS 和 CLIP-Score 指标上实现最先进的性能。

Title: Scaling Zero-Shot Reference-to-Video Generation

Authors: Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.06905
Pdf URL: https://arxiv.org/pdf/2512.06905
Copy Paste: [[2512.06905]] Scaling Zero-Shot Reference-to-Video Generation(https://arxiv.org/abs/2512.06905)
Keywords: generation
Abstract: Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.
摘要：视频参考 (R2V) 生成旨在合成与文本提示对齐的视频，同时保留参考图像中的主体身份。然而，当前的 R2V 方法受到对显式参考图像-视频-文本三元组的依赖的阻碍，其构建成本高昂且难以扩展。我们通过引入 Saber 来绕过这个瓶颈，这是一个不需要显式 R2V 数据的可扩展零样本框架。 Saber 专门针对视频-文本对进行训练，采用屏蔽训练策略和基于注意力的定制模型设计来学习身份一致和参考感知的表示。进一步集成掩模增强技术，以减轻参考视频生成中常见的复制粘贴伪影。此外，Sabre 在不同数量的参考中展示了卓越的泛化能力，并且与使用 R2V 数据训练的方法相比，在 OpenS2V-Eval 基准上实现了卓越的性能。

Title: Evaluating the Sensitivity of BiLSTM Forecasting Models to Sequence Length and Input Noise

Authors: Salma Albelali, Moataz Ahmed
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06926
Pdf URL: https://arxiv.org/pdf/2512.06926
Copy Paste: [[2512.06926]] Evaluating the Sensitivity of BiLSTM Forecasting Models to Sequence Length and Input Noise(https://arxiv.org/abs/2512.06926)
Keywords: generation
Abstract: Deep learning (DL) models, a specialized class of multilayer neural networks, have become central to time-series forecasting in critical domains such as environmental monitoring and the Internet of Things (IoT). Among these, Bidirectional Long Short-Term Memory (BiLSTM) architectures are particularly effective in capturing complex temporal dependencies. However, the robustness and generalization of such models are highly sensitive to input data characteristics - an aspect that remains underexplored in existing literature. This study presents a systematic empirical analysis of two key data-centric factors: input sequence length and additive noise. To support this investigation, a modular and reproducible forecasting pipeline is developed, incorporating standardized preprocessing, sequence generation, model training, validation, and evaluation. Controlled experiments are conducted on three real-world datasets with varying sampling frequencies to assess BiLSTM performance under different input conditions. The results yield three key findings: (1) longer input sequences significantly increase the risk of overfitting and data leakage, particularly in data-constrained environments; (2) additive noise consistently degrades predictive accuracy across sampling frequencies; and (3) the simultaneous presence of both factors results in the most substantial decline in model stability. While datasets with higher observation frequencies exhibit greater robustness, they remain vulnerable when both input challenges are present. These findings highlight important limitations in current DL-based forecasting pipelines and underscore the need for data-aware design strategies. This work contributes to a deeper understanding of DL model behavior in dynamic time-series environments and provides practical insights for developing more reliable and generalizable forecasting systems.
摘要：深度学习 (DL) 模型是一类特殊的多层神经网络，已成为环境监测和物联网 (IoT) 等关键领域时间序列预测的核心。其中，双向长短期记忆（BiLSTM）架构在捕获复杂的时间依赖性方面特别有效。然而，此类模型的鲁棒性和泛化性对输入数据特征高度敏感——这一方面在现有文献中仍未得到充分探索。本研究对两个以数据为中心的关键因素进行了系统的实证分析：输入序列长度和加性噪声。为了支持这项调查，开发了一个模块化且可重复的预测管道，其中包含标准化预处理、序列生成、模型训练、验证和评估。在具有不同采样频率的三个真实数据集上进行受控实验，以评估不同输入条件下的 BiLSTM 性能。结果得出三个关键发现：（1）较长的输入序列会显着增加过度拟合和数据泄漏的风险，特别是在数据受限的环境中； (2) 加性噪声会持续降低采样频率的预测准确性； (3)这两个因素同时存在会导致模型稳定性大幅下降。虽然观察频率较高的数据集表现出更强的鲁棒性，但当同时存在输入挑战时，它们仍然容易受到攻击。这些发现凸显了当前基于深度学习的预测流程的重要局限性，并强调了数据感知设计策略的必要性。这项工作有助于更深入地理解动态时间序列环境中的深度学习模型行为，并为开发更可靠和更通用的预测系统提供实用的见解。

Title: Hidden Leaks in Time Series Forecasting: How Data Leakage Affects LSTM Evaluation Across Configurations and Validation Strategies

Authors: Salma Albelali, Moataz Ahmed
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.06932
Pdf URL: https://arxiv.org/pdf/2512.06932
Copy Paste: [[2512.06932]] Hidden Leaks in Time Series Forecasting: How Data Leakage Affects LSTM Evaluation Across Configurations and Validation Strategies(https://arxiv.org/abs/2512.06932)
Keywords: generation
Abstract: Deep learning models, particularly Long Short-Term Memory (LSTM) networks, are widely used in time series forecasting due to their ability to capture complex temporal dependencies. However, evaluation integrity is often compromised by data leakage, a methodological flaw in which input-output sequences are constructed before dataset partitioning, allowing future information to unintentionally influence training. This study investigates the impact of data leakage on performance, focusing on how validation design mediates leakage sensitivity. Three widely used validation techniques (2-way split, 3-way split, and 10-fold cross-validation) are evaluated under both leaky (pre-split sequence generation) and clean conditions, with the latter mitigating leakage risk by enforcing temporal separation during data splitting prior to sequence construction. The effect of leakage is assessed using RMSE Gain, which measures the relative increase in RMSE caused by leakage, computed as the percentage difference between leaky and clean setups. Empirical results show that 10-fold cross-validation exhibits RMSE Gain values of up to 20.5% at extended lag steps. In contrast, 2-way and 3-way splits demonstrate greater robustness, typically maintaining RMSE Gain below 5% across diverse configurations. Moreover, input window size and lag step significantly influence leakage sensitivity: smaller windows and longer lags increase the risk of leakage, whereas larger windows help reduce it. These findings underscore the need for configuration-aware, leakage-resistant evaluation pipelines to ensure reliable performance estimation.
摘要：深度学习模型，特别是长短期记忆（LSTM）网络，由于能够捕获复杂的时间依赖性而被广泛用于时间序列预测。然而，评估完整性经常受到数据泄漏的影响，这是一种方法缺陷，在数据集分区之前构建输入输出序列，从而使未来的信息无意中影响训练。本研究调查了数据泄漏对性能的影响，重点关注验证设计如何调节泄漏敏感性。三种广泛使用的验证技术（2 路分割、3 路分割和 10 倍交叉验证）在泄漏（预分割序列生成）和干净条件下进行评估，后者通过在序列构建之前的数据分割期间强制执行时间分离来减轻泄漏风险。使用 RMSE 增益评估泄漏的影响，该增益测量泄漏引起的 RMSE 的相对增加，计算为泄漏和干净设置之间的百分比差异。经验结果表明，10 倍交叉验证在扩展滞后步骤时的 RMSE 增益值高达 20.5%。相比之下，2 路和 3 路分割表现出更高的鲁棒性，通常在不同的配置中将 RMSE 增益保持在 5% 以下。此外，输入窗口大小和滞后阶跃显着影响泄漏灵敏度：较小的窗口和较长的滞后会增加泄漏风险，而较大的窗口有助于降低泄漏风险。这些发现强调需要配置感知、防泄漏评估管道，以确保可靠的性能估计。

Title: Evaluating and Preserving High-level Fidelity in Super-Resolution

Authors: Josep M. Rocafort, Shaolin Su, Javier Vazquez-Corral, Alexandra Gomez-Villa
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.07037
Pdf URL: https://arxiv.org/pdf/2512.07037
Copy Paste: [[2512.07037]] Evaluating and Preserving High-level Fidelity in Super-Resolution(https://arxiv.org/abs/2512.07037)
Keywords: super-resolution, generative
Abstract: Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.
摘要：最近的图像超分辨率 (SR) 模型在重建细节和提供视觉上令人愉悦的输出方面取得了令人印象深刻的效果。然而，压倒性的生成能力有时会产生幻觉，从而改变图像内容，尽管获得了很高的视觉质量。这种类型的高级变化可以很容易地被人类识别，但在现有的低级图像质量指标中尚未得到充分研究。在本文中，我们确立了测量 SR 模型高保真度的重要性，作为揭示生成 SR 模型可靠性的补充标准。我们使用不同 SR 模型的保真度分数构建第一个带注释的数据集，并评估最先进 (SOTA) SR 模型在保持高水平保真度方面的实际表现。然后，我们基于数据集分析现有图像质量指标如何与保真度测量相关，并进一步表明基础模型可以更好地解决这一高级任务。最后，通过根据我们的保真度反馈微调 SR 模型，我们表明语义保真度和感知质量都可以得到改善，从而证明了我们提出的标准在模型评估和优化方面的潜在价值。我们将在接受后发布数据集、代码和模型。

Title: MSN: Multi-directional Similarity Network for Hand-crafted and Deep-synthesized Copy-Move Forgery Detection

Authors: Liangwei Jiang, Jinluo Xie, Yecheng Huang, Hua Zhang, Hongyu Yang, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07110
Pdf URL: https://arxiv.org/pdf/2512.07110
Copy Paste: [[2512.07110]] MSN: Multi-directional Similarity Network for Hand-crafted and Deep-synthesized Copy-Move Forgery Detection(https://arxiv.org/abs/2512.07110)
Keywords: generative
Abstract: Copy-move image forgery aims to duplicate certain objects or to hide specific contents with copy-move operations, which can be achieved by a sequence of manual manipulations as well as up-to-date deep generative network-based swapping. Its detection is becoming increasingly challenging for the complex transformations and fine-tuned operations on the tampered regions. In this paper, we propose a novel two-stream model, namely Multi-directional Similarity Network (MSN), to accurate and efficient copy-move forgery detection. It addresses the two major limitations of existing deep detection models in \textbf{representation} and \textbf{localization}, respectively. In representation, an image is hierarchically encoded by a multi-directional CNN network, and due to the diverse augmentation in scales and rotations, the feature achieved better measures the similarity between sampled patches in two streams. In localization, we design a 2-D similarity matrix based decoder, and compared with the current 1-D similarity vector based one, it makes full use of spatial information in the entire image, leading to the improvement in detecting tampered regions. Beyond the method, a new forgery database generated by various deep neural networks is presented, as a new benchmark for detecting the growing deep-synthesized copy-move. Extensive experiments are conducted on two classic image forensics benchmarks, \emph{i.e.} CASIA CMFD and CoMoFoD, and the newly presented one. The state-of-the-art results are reported, which demonstrate the effectiveness of the proposed approach.
摘要：复制移动图像伪造的目的是通过复制移动操作来复制某些对象或隐藏特定内容，这可以通过一系列手动操作以及最新的基于深度生成网络的交换来实现。对于篡改区域的复杂转换和微调操作，其检测变得越来越具有挑战性。在本文中，我们提出了一种新颖的双流模型，即多向相似网络（MSN），以准确高效地进行复制移动伪造检测。它分别解决了现有深度检测模型在 \textbf{representation} 和 \textbf{localization} 中的两个主要限制。在表示中，图像由多向 CNN 网络进行分层编码，并且由于尺度和旋转的多样化增强，该特征更好地测量了两个流中采样块之间的相似性。在定位方面，我们设计了一种基于二维相似矩阵的解码器，与当前基于一维相似向量的解码器相比，它充分利用了整个图像的空间信息，从而提高了检测篡改区域的能力。除了该方法之外，还提出了由各种深度神经网络生成的新伪造数据库，作为检测不断增长的深度合成复制移动的新基准。在两个经典图像取证基准，即 CASIA CMFD 和 CoMoFoD 以及新提出的基准上进行了大量实验。报告了最先进的结果，证明了所提出方法的有效性。

Title: Training-free Clothing Region of Interest Self-correction for Virtual Try-On

Authors: Shengjie Lu, Zhibin Wan, Jiejie Liu, Quan Zhang, Mingjie Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07126
Pdf URL: https://arxiv.org/pdf/2512.07126
Copy Paste: [[2512.07126]] Training-free Clothing Region of Interest Self-correction for Virtual Try-On(https://arxiv.org/abs/2512.07126)
Keywords: generation
Abstract: VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPS, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method is public at this https URL.
摘要：VTON（Virtual Try-ON）旨在将目标服装合成到某个人身上，保留目标服装的细节，同时保持该人的其他部分不变。现有方法在图案、纹理和边界方面存在生成的服装结果与目标结果之间的差异。因此，我们建议使用能量函数对通过生成过程提取的注意力图施加约束。因此，在每个生成步骤，注意力可以更加集中在感兴趣的服装区域，从而影响生成结果与目标服装细节更加一致。此外，为了解决现有评估指标仅关注图像真实感而忽视与目标元素的一致性的局限性，我们设计了一个新的指标——虚拟试穿起始距离（VTID），以弥补这一差距并确保更全面的评估。在 VITON-HD 和 DressCode 数据集上，我们的方法在 LPIPS、FID、KID 和新 VTID 指标的传统指标中分别比之前最先进的 (SOTA) 方法高出 1.4%、2.3%、12.3% 和 5.8%。此外，通过将生成的数据应用于下游服装更换重新识别（CC-Reid）方法，我们在 LTCC、PRCC、VC-Clothes 数据集上的 Rank-1 指标中实现了 2.5%、1.1% 和 1.6% 的性能提升。我们方法的代码在此 https URL 上公开。

Title: Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Authors: Fenghua Weng, Chaochao Lu, Xia Hu, Wenqi Shao, Wenjie Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.07141
Pdf URL: https://arxiv.org/pdf/2512.07141
Copy Paste: [[2512.07141]] Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models(https://arxiv.org/abs/2512.07141)
Keywords: generation
Abstract: As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at this https URL.
摘要：随着多模态推理提高大视觉语言模型（LVLM）的整体能力，最近的研究开始探索面向安全的推理，旨在通过在生成最终响应之前分析推理过程中潜在的安全风险来增强安全意识。尽管此类方法提高了安全意识和可解释性，但这种单遍思考然后回答的范式仍然容易受到上下文或视觉越狱攻击。这揭示了一个严重缺陷：单遍推理可能会忽略其输出中明确的有害内容。我们的主要见解是通过反射来利用这种浪费的信号，这可以有效地利用首次推理中揭示的恶意内容来实现真正的自我纠正并防止不安全的生成。受此启发，我们提出了思考-反思-修订（TRR），这是一个三阶段培训框架，旨在通过政策引导的自我反思来增强 LVLM 的安全一致性。我们首先构建了一个反思安全推理 (ReSafe) 数据集，其中包含 5,000 个示例，遵循思考-反思-修改流程。然后，我们使用 ReSafe 数据集对目标模型进行微调，以初始化反思行为，最后通过强化学习强化策略引导的反思。实验结果表明，TRR在安全意识基准测试和越狱攻击评估上都显着提高了LVLM的安全性能，将Qwen2.5-VL-7B上的整体安全响应率从42.8%提高到87.7%，同时在MMMU和MMStar等通用基准测试上保持稳定的性能。项目页面可通过此 https URL 获取。

Title: FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers

Authors: Jonghyun Park, Jong Chul Ye
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.07150
Pdf URL: https://arxiv.org/pdf/2512.07150
Copy Paste: [[2512.07150]] FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers(https://arxiv.org/abs/2512.07150)
Keywords: generative
Abstract: Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
摘要：深度生成模型已成为解决逆问题的强大先验，并且已经开发了各种免训练方法。然而，当应用于潜在流模型时，现有方法通常无法收敛到后验模式或在潜在空间内遭受流形偏差。为了缓解这个问题，我们在这里引入了一种新颖的免训练框架 FlowLPS，它通过 Langevin 近端采样 (LPS) 策略解决预训练流模型的逆问题。我们的方法将用于流形一致探索的 Langevin 动力学与用于精确模式搜索的近端优化相结合，在 FFHQ 和 DIV2K 上的多个逆任务中实现重建保真度和感知质量之间的卓越平衡，优于最先进的逆解算器。

Title: CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

Authors: Dahyeon Kye, Jeahun Sung, MinKyu Jeon, Jihyong Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07155
Pdf URL: https://arxiv.org/pdf/2512.07155
Copy Paste: [[2512.07155]] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics(https://arxiv.org/abs/2512.07155)
Keywords: generative
Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
摘要：扩散模型表现出卓越的生成能力，但实现平滑且语义一致的图像变形仍然是一个挑战。由于缺乏自适应结构和语义对齐，现有方法经常会产生突然的过渡或过度饱和的外观。我们提出了 CHIMERA，一种基于零样本扩散的框架，它将变形表述为缓存反转引导的去噪过程。为了处理较大的语义和外观差异，我们提出了自适应缓存注入和语义锚提示。自适应缓存注入 (ACI) 在 DDIM 反转期间缓存来自两个输入的下、中和上块特征，并在去噪期间自适应地重新注入它们，从而以深度和时间自适应方式实现空间和语义对齐，并实现自然特征融合和平滑过渡。语义锚提示 (SAP) 利用视觉语言模型生成共享锚提示，充当语义锚，桥接不同的输入并指导去噪过程获得连贯的结果。最后，我们引入了全局-局部一致性得分（GLCS），这是一种面向变形的度量，它同时评估两个输入的全局协调性和局部变形过渡的平滑度。大量的实验和用户研究表明，CHIMERA 比现有方法实现了更平滑、语义上更一致的过渡，在图像变形方面建立了新的技术水平。代码和项目页面将公开发布。

Title: When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

Authors: Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07166
Pdf URL: https://arxiv.org/pdf/2512.07166
Copy Paste: [[2512.07166]] When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing(https://arxiv.org/abs/2512.07166)
Keywords: generation
Abstract: Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.
摘要：多模式大语言模型（MLLM）中的隐私泄露长期以来一直是一个棘手的问题。现有研究虽然有效地掩盖了 MLLM 中的隐私信息，但往往忽视了对用户隐私的真实性和恢复质量的评估。为此，这项工作特别关注如何在不同的 MLLM 场景中恢复代理驱动的受保护数据的关键挑战。我们首先通过提供 SPPE（代理隐私保护可编辑）数据集来弥补这一研究差距，其中包括广泛的隐私类别和用户指令来模拟真实的 MLLM 应用程序。该数据集提供受保护的代理及其各种 MLLM 编辑版本，从而能够直接评估隐私恢复质量。通过将隐私恢复制定为以互补多模态信号为条件的引导生成任务，我们进一步引入了一种统一的方法，可以可靠地重建隐私内容，同时保留 MLLM 生成的编辑的保真度。 SPPE 和 InstructPix2Pix 上的实验进一步表明，我们的方法可以很好地泛化不同的视觉内容和编辑任务，在隐私保护和 MLLM 可用性之间实现了强有力的平衡。

Title: Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Authors: Jiayang Li, Chengjie Jiang, Junjun Jiang, Pengwei Liang, Jiayi Ma, Liqiang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07170
Pdf URL: https://arxiv.org/pdf/2512.07170
Copy Paste: [[2512.07170]] Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach(https://arxiv.org/abs/2512.07170)
Keywords: restoration
Abstract: Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.
摘要：图像融合旨在融合来自多种传感模式的互补信息，但现有方法在鲁棒性、适应性和可控性方面仍然有限。目前大多数融合网络都是针对特定任务量身定制的，缺乏灵活融入用户意图的能力，尤其是在涉及低光衰减、色偏或曝光不平衡的复杂场景中。此外，缺乏真实融合图像以及现有数据集规模较小，使得训练同时理解高级语义和执行细粒度多模态对齐的端到端模型变得困难。因此，我们提出了 DiTFuse，指令驱动的扩散变换器 (DiT) 框架，它在单个模型中执行端到端、语义感知的融合。通过在共享潜在空间中联合编码两个图像和自然语言指令，DiTFuse 能够对融合动态进行分层和细粒度控制，克服融合前和融合后管道难以注入高级语义的限制。训练阶段采用多降级掩模图像建模策略，因此网络共同学习跨模态对齐、模态不变恢复和任务感知特征选择，而不依赖于地面实况图像。精心策划的多粒度指令数据集进一步为模型配备了交互式融合功能。 DiTFuse 将红外-可见光、多焦点和多重曝光融合以及文本控制的细化和下游任务统一在一个架构内。公共 IVIF、MFF 和 MEF 基准测试的实验证实了卓越的定量和定性性能、更清晰的纹理和更好的语义保留。该模型还支持多级用户控制和零样本泛化到其他多图像融合场景，包括指令条件分割。

Title: TIDE: Two-Stage Inverse Degradation Estimation with Guided Prior Disentanglement for Underwater Image Restoration

Authors: Shravan Venkatraman, Rakesh Raj Madavan, Pavan Kumar S, Muthu Subash Kavitha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07171
Pdf URL: https://arxiv.org/pdf/2512.07171
Copy Paste: [[2512.07171]] TIDE: Two-Stage Inverse Degradation Estimation with Guided Prior Disentanglement for Underwater Image Restoration(https://arxiv.org/abs/2512.07171)
Keywords: restoration
Abstract: Underwater image restoration is essential for marine applications ranging from ecological monitoring to archaeological surveys, but effectively addressing the complex and spatially varying nature of underwater degradations remains a challenge. Existing methods typically apply uniform restoration strategies across the entire image, struggling to handle multiple co-occurring degradations that vary spatially and with water conditions. We introduce TIDE, a $\underline{t}$wo stage $\underline{i}$nverse $\underline{d}$egradation $\underline{e}$stimation framework that explicitly models degradation characteristics and applies targeted restoration through specialized prior decomposition. Our approach disentangles the restoration process into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by a progressive refinement stage that corrects residual artifacts. Specifically, TIDE decomposes underwater degradations into four key factors, namely color distortion, haze, detail loss, and noise, and designs restoration experts specialized for each. By generating specialized restoration hypotheses, TIDE balances competing degradation factors and produces natural results even in highly degraded regions. Extensive experiments across both standard benchmarks and challenging turbid water conditions show that TIDE achieves competitive performance on reference based fidelity metrics while outperforming state of the art methods on non reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement. Our code is available at: this https URL.
摘要：水下图像恢复对于从生态监测到考古调查等海洋应用至关重要，但有效解决水下退化的复杂性和空间变化性质仍然是一个挑战。现有方法通常在整个图像上应用统一的恢复策略，难以处理随空间和水条件变化的多种同时发生的退化。我们引入了 TIDE，一个 $\underline{t}$wo 阶段 $\underline{i}$nverse $\underline{d}$egradation $\underline{e}$估计框架，它显式地模拟退化特征并通过专门的先验分解应用有针对性的恢复。我们的方法将恢复过程分解为多个专门的假设，这些假设根据局部退化模式自适应融合，然后是纠正残留伪影的渐进细化阶段。具体来说，TIDE将水下退化分解为四个关键因素，即颜色失真、雾度、细节损失和噪声，并针对每个因素设计了专门的修复专家。通过生成专门的恢复假设，TIDE 可以平衡相互竞争的退化因素，即使在高度退化的地区也能产生自然的结果。跨标准基准和具有挑战性的浑水条件的大量实验表明，TIDE 在基于参考的保真度指标上实现了具有竞争力的性能，同时在非参考感知质量指标上超越了最先进的方法，并且在色彩校正和对比度增强方面有了很大的改进。我们的代码位于：此 https URL。

Title: Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Authors: Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.07173
Pdf URL: https://arxiv.org/pdf/2512.07173
Copy Paste: [[2512.07173]] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration(https://arxiv.org/abs/2512.07173)
Keywords: generation
Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
摘要：我们提出了 CadLLM，这是一种无需训练的方法，可加速基于扩散的 LLM (dLLM) 的推理吞吐量。我们首先研究跨区块和步骤的代币揭秘置信度的动态性质。基于这一观察，我们提出了一种轻量级自适应方法，该方法根据未屏蔽标记的平均置信度来控制生成块大小、步长和阈值。我们通过动态利用词汇表的子集来调节采样宽度，进一步减少了 softmax 开销。 CadLLM 是一种即插即用、与模型无关的方法，与基于 KV 缓存的 dLLM 兼容。针对四项热门任务的大量实验表明，与最先进的基准相比，CadLLM 的吞吐量提高了 2.28 倍，且精度具有竞争力。

Title: UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting

Authors: Da Zhang, Bingyu Li, Zhuyuan Zhao, Junyu Gao, Feiping Nie, Xuelong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.07184
Pdf URL: https://arxiv.org/pdf/2512.07184
Copy Paste: [[2512.07184]] UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting(https://arxiv.org/abs/2512.07184)
Keywords: generation
Abstract: As multimodal data proliferates across diverse real-world applications, leveraging heterogeneous information such as texts and timestamps for accurate time series forecasting (TSF) has become a critical challenge. While diffusion models demonstrate exceptional performance in generation tasks, their application to TSF remains largely confined to modeling single-modality numerical sequences, overlooking the abundant cross-modal signals inherent in complex heterogeneous data. To address this gap, we propose UniDiff, a unified diffusion framework for multimodal time series forecasting. To process the numerical sequence, our framework first tokenizes the time series into patches, preserving local temporal dynamics by mapping each patch to an embedding space via a lightweight MLP. At its core lies a unified and parallel fusion module, where a single cross-attention mechanism adaptively weighs and integrates structural information from timestamps and semantic context from texts in one step, enabling a flexible and efficient interplay between modalities. Furthermore, we introduce a novel classifier-free guidance mechanism designed for multi-source conditioning, allowing for decoupled control over the guidance strength of textual and temporal information during inference, which significantly enhances model robustness. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
摘要：随着多模态数据在不同的现实世界应用中激增，利用文本和时间戳等异构信息进行准确的时间序列预测 (TSF) 已成为一项关键挑战。虽然扩散模型在生成任务中表现出卓越的性能，但它们在 TSF 中的应用仍然主要局限于对单模态数值序列进行建模，而忽略了复杂异构数据中固有的丰富的跨模态信号。为了解决这一差距，我们提出了 UniDiff，这是一种用于多模式时间序列预测的统一扩散框架。为了处理数字序列，我们的框架首先将时间序列标记为补丁，通过轻量级 MLP 将每个补丁映射到嵌入空间，从而保留局部时间动态。其核心在于一个统一且并行的融合模块，其中单个交叉注意机制一步自适应地权衡和集成来自时间戳的结构信息和来自文本的语义上下文，从而实现模态之间灵活高效的相互作用。此外，我们引入了一种专为多源调节而设计的新颖的无分类器引导机制，允许在推理过程中对文本和时间信息的引导强度进行解耦控制，从而显着增强模型的鲁棒性。对跨八个领域的真实世界基准数据集进行的广泛实验表明，所提出的 UniDiff 模型实现了最先进的性能。

Title: START: Spatial and Textual Learning for Chart Understanding

Authors: Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07186
Pdf URL: https://arxiv.org/pdf/2512.07186
Copy Paste: [[2512.07186]] START: Spatial and Textual Learning for Chart Understanding(https://arxiv.org/abs/2512.07186)
Keywords: generation
Abstract: Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
摘要：图表理解对于在现实场景（例如分析科学论文和技术报告）中部署多模式大语言模型 (MLLM) 至关重要。与自然图像不同，图表将结构化视觉布局（空间属性）与底层数据表示（文本属性）配对——掌握两者对于精确、细粒度的图表推理至关重要。受这一观察的启发，我们提出了 START，即用于图表理解的空间和文本学习。具体来说，我们引入了 (i) 图表元素基础和 (ii) 图表到代码生成，以加强 MLLM 对图表视觉布局和数据细节的理解。为了促进空间和文本学习，我们提出了使用新颖的数据生成管道生成的 START-Dataset，该管道首先利用 MLLM 将真实图表图像转换为可执行图表代码，恢复底层数据表示，同时保留真实世界图表的视觉分布。然后，我们使用大型语言模型 (LLM) 改进代码，以确定捕获图表视觉结构的图表元素的位置，解决现有方法无法应对的挑战。为了评估模型理解图表空间结构的能力，我们提出了图表空间理解基准（CS-Bench），填补了综合图表理解评估中的关键空白。利用空间和文本学习，START 在模型大小和基准模型上提供了一致的增益，并明显超越了之前的最先进技术。代码、数据和模型将公开。

Title: HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression

Authors: Niu Yi, Xu Tianyi, Ma Mingming, Wang Xinkun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07192
Pdf URL: https://arxiv.org/pdf/2512.07192
Copy Paste: [[2512.07192]] HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression(https://arxiv.org/abs/2512.07192)
Keywords: generative
Abstract: Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.
摘要：使用矢量量化（VQ）的生成学习图像压缩方法最近在平衡失真和感知质量方面表现出了令人印象深刻的潜力。然而，这些方法通常使用静态的全局概率分布来估计 VQ 指数的熵，这无法适应每个图像的具体内容。这种非自适应方法导致未开发的比特率潜力和实现灵活的速率控制的挑战。为了应对这一挑战，我们引入了一种基于 VQ Hyperprior 的可控生成图像压缩框架，称为 HVQ-CGIC。 HVQ-CGIC 严格推导了在 VQ 指数熵模型中引入超先验的数学基础。在此基础上，通过新颖的损失设计，据我们所知，该框架是第一个将RD平衡和控制引入到基于矢量量化的生成图像压缩中的框架。与轻量级超先验估计网络配合，HVQ-CGIC 与当前最先进的 (SOTA) 生成压缩方法相比，在率失真 (RD) 性能方面取得了显着优势。在 Kodak 数据集上，我们实现了与 Control-GIC、CDC 和 HiFiC 相同的 LPIPS，平均位数减少了 61.3%。我们认为 HVQ-CGIC 有潜力成为基于 VQGAN 的图像压缩的基础组件，类似于 HyperPrior 框架在神经图像压缩中的不可或缺的作用。

Title: Generating Storytelling Images with Rich Chains-of-Reasoning

Authors: Xiujie Song, Qi Jia, Shota Watanabe, Xiaoyi Pang, Ruijie Chen, Mengyue Wu, Kenny Q. Zhu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.07198
Pdf URL: https://arxiv.org/pdf/2512.07198
Copy Paste: [[2512.07198]] Generating Storytelling Images with Rich Chains-of-Reasoning(https://arxiv.org/abs/2512.07198)
Keywords: generation, generative
Abstract: An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at this https URL.
摘要：图像可以通过呈现丰富的、逻辑上相互关联的视觉线索来传达引人入胜的故事。这些连接在图像中形成推理链 (CoR)，使观看者能够推断事件、因果关系和其他信息，从而理解潜在的故事。在本文中，我们关注这些语义丰富的图像，并将它们定义为讲故事的图像。这些图像除了插图创作和认知筛选之外还有多种应用，利用它们以视觉方式传达多层信息并激发积极解释的能力。然而，由于其复杂的语义性质，讲故事的图像本身就具有挑战性，因此仍然相对稀缺。为了应对这一挑战，我们引入了讲故事图像生成任务，该任务探索如何利用生成式人工智能模型来创建此类图像。具体来说，我们提出了一个两阶段的管道 StorytellingPainter，它将大型语言模型 (LLM) 的创造性推理能力与文本到图像 (T2I) 模型的视觉合成能力结合起来，以生成讲故事的图像。除了这个流程之外，我们还开发了一个专用的评估框架，包括三个主要评估器：语义复杂性评估器、基于 KNN 的多样性评估器和故事-图像对齐评估器。考虑到故事生成在讲故事图像生成任务中的关键作用以及开源法学硕士和专有法学硕士之间的性能差异，我们进一步探索定制的培训策略来缩小这种差距，从而产生了一系列轻量级但有效的模型，称为迷你故事讲述者。实验结果证明了我们的方法的可行性和有效性。该代码可从此 https URL 获取。

Title: Understanding Diffusion Models via Code Execution

Authors: Cheng Yu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.07201
Pdf URL: https://arxiv.org/pdf/2512.07201
Copy Paste: [[2512.07201]] Understanding Diffusion Models via Code Execution(https://arxiv.org/abs/2512.07201)
Keywords: generative
Abstract: Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components -- including forward diffusion, reverse sampling, the noise-prediction network, and the training loop -- while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: this https URL.
摘要：扩散模型在生成建模中取得了显着的性能，但其理论基础往往很复杂，论文中的数学公式与实际的开源实现之间的差距可能难以弥合。现有教程主要侧重于推导方程，对扩散模型如何在代码中实际运行提供有限的指导。为了解决这个问题，我们提出了大约 300 行的简洁实现，从代码执行的角度解释了扩散模型。我们的最小示例保留了基本组件——包括前向扩散、反向采样、噪声预测网络和训练循环——同时删除了不必要的工程细节。这份技术报告旨在让研究人员对扩散模型在实践中如何工作以及代码和理论如何对应有一个清晰的、以实现为先的理解。我们的代码和预训练模型可在以下位置获取：此 https URL。

Title: Unified Camera Positional Encoding for Controlled Video Generation

Authors: Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07237
Pdf URL: https://arxiv.org/pdf/2512.07237
Copy Paste: [[2512.07237]] Unified Camera Positional Encoding for Controlled Video Generation(https://arxiv.org/abs/2512.07237)
Keywords: generation
Abstract: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at this https URL.
摘要：Transformer 已成为 3D 感知、视频生成以及自动驾驶和人工智能世界模型的通用支柱，其中理解相机几何形状对于在三维空间中进行视觉观察至关重要。然而，现有的相机编码方法通常依赖于简化的针孔假设，限制了现实世界相机中各种本征和镜头畸变的泛化。我们引入了相对光线编码，这是一种几何一致的表示，可以统一完整的相机信息，包括 6-DoF 位姿、本征和镜头畸变。为了评估其在不同可控性需求下的能力，我们采用相机控制的文本到视频生成作为测试台任务。在此设置中，我们进一步将俯仰和滚动识别为对绝对方向编码有效的两个组件，从而能够完全控制初始相机方向。这些设计共同形成了 UCPE（统一相机位置编码），它通过轻量级空间注意力适配器集成到预训练的视频扩散变压器中，添加了不到 1% 的可训练参数，同时实现了最先进的相机可控性和视觉保真度。为了促进系统训练和评估，我们构建了一个涵盖各种相机运动和镜头类型的大型视频数据集。大量实验验证了 UCPE 在摄像机可控视频生成方面的有效性，并强调了其作为 Transformers 跨未来多视图、视频和 3D 任务的通用摄像机表示的潜力。代码将在此 https URL 中提供。

Title: DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement

Authors: Handing Xu, Zhenguo Nie, Tairan Peng, Huimin Pan, Xin-Jun Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07253
Pdf URL: https://arxiv.org/pdf/2512.07253
Copy Paste: [[2512.07253]] DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement(https://arxiv.org/abs/2512.07253)
Keywords: generative
Abstract: Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.
摘要：内窥镜手术依赖于术中视频，使得图像质量成为手术安全性和有效性的决定性因素。然而，内窥镜视频常常因照明不均匀、组织散射、遮挡和运动模糊而质量下降，从而模糊了关键的解剖细节并使手术操作复杂化。尽管基于深度学习的方法在图像增强方面显示出了希望，但大多数现有方法对于实时手术使用的计算要求仍然过高。为了应对这一挑战，我们提出了一种用于内窥镜视频增强的退化感知框架，该框架通过跨帧传播退化表示来实现实时、高质量的增强。在我们的框架中，首先使用对比学习从图像中提取退化表示。然后，我们引入一种融合机制，用这些表示来调制图像特征，以指导单帧增强模型，该模型通过退化图像和恢复图像之间的循环一致性约束进行训练，以提高鲁棒性和泛化性。实验表明，与几种最先进的方法相比，我们的框架在性能和效率之间实现了卓越的平衡。这些结果突出了实时内窥镜视频增强的退化感知建模的有效性。尽管如此，我们的方法表明隐式学习和传播降解表示为临床应用提供了一条实用的途径。

Title: A graph generation pipeline for critical infrastructures based on heuristics, images and depth data

Authors: Mike Diessner, Yannick Tarant
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.07269
Pdf URL: https://arxiv.org/pdf/2512.07269
Copy Paste: [[2512.07269]] A graph generation pipeline for critical infrastructures based on heuristics, images and depth data(https://arxiv.org/abs/2512.07269)
Keywords: generation
Abstract: Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.
摘要：物理关键基础设施（例如水厂或能源厂）的虚拟表示用于模拟和数字孪生，以确保其服务的弹性和连续性。这些模型通常需要来自激光扫描仪的 3D 点云，这些扫描仪的获取成本很高，并且需要专业知识才能使用。在本文中，我们提出了一种基于摄影测量的图形生成管道。该管道使用 RGB 图像和立体相机生成的深度数据来检测相关对象并预测它们的关系。这种更具成本效益的方法使用深度学习进行对象检测和对象的实例分割，并采用用户定义的启发式或规则来推断它们的关系。两个液压系统的结果表明，该策略可以生成接近真实情况的图表，同时其灵活性允许该方法针对特定应用进行定制，其透明度使其有资格用于关键基础设施所需的高风险决策。

Title: ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

Authors: Ziyang Mai, Yu-Wing Tai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07328
Pdf URL: https://arxiv.org/pdf/2512.07328
Copy Paste: [[2512.07328]] ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation(https://arxiv.org/abs/2512.07328)
Keywords: generation
Abstract: Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{this https URL}{this https URL}.
摘要：文本转视频 (T2V) 生成技术发展迅速，但跨场景保持一致的角色身份仍然是一项重大挑战。现有的个性化方法通常侧重于面部识别，但无法保留更广泛的上下文线索，例如发型、服装和体形，而这些线索对于视觉连贯性至关重要。我们提出了 \textbf{ContextAnyone}，这是一种上下文感知扩散框架，可以从文本和单个参考图像生成字符一致的视频。我们的方法联合重建参考图像并生成新的视频帧，使模型能够充分感知和利用参考信息。通过新颖的 Emphasize-Attention 模块，参考信息有效地集成到基于 DiT 的扩散主干中，该模块有选择地增强参考感知功能并防止跨帧的身份漂移。双引导损失结合了扩散和参考重建目标以增强外观保真度，而所提出的 Gap-RoPE 位置嵌入将参考和视频标记分开以稳定时间建模。实验表明，ContextAnyone 在身份一致性和视觉质量方面优于现有的视频参考方法，可在不同的动作和场景中生成连贯且保留上下文的角色视频。项目页面：\href{此 https URL}{此 https URL}。

Title: Generalized Referring Expression Segmentation on Aerial Photos

Authors: Luís Marnoto, Alexandre Bernardino, Bruno Martins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07338
Pdf URL: https://arxiv.org/pdf/2512.07338
Copy Paste: [[2512.07338]] Generalized Referring Expression Segmentation on Aerial Photos(https://arxiv.org/abs/2512.07338)
Keywords: generation
Abstract: Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at this https URL .
摘要：引用表达分割是计算机视觉中的一项基本任务，它将自然语言理解与目标区域的精确视觉定位相结合。考虑到航空图像（例如，通过无人机收集的现代航空照片、航空档案中的历史照片、高分辨率卫星图像等）提出了独特的挑战，因为不同数据集的空间分辨率差异很大，颜色的使用不一致，目标通常缩小到只有几个像素，并且场景包含非常高的对象密度和具有部分遮挡的对象。这项工作提出了 Aerial-D，这是一种用于航空图像的新型大规模引用表达分割数据集，包含 37,288 个图像和 1,522,523 个引用表达，覆盖 259,709 个带注释的目标，跨越单个对象实例、实例组和语义区域，涵盖从车辆和基础设施到土地覆盖类型的 21 个不同类别。该数据集是通过全自动管道构建的，该管道将系统的基于规则的表达式生成与大型语言模型 (LLM) 增强程序相结合，丰富了语言多样性并关注引用表达式中的视觉细节。另外还使用滤镜来模拟每个场景的历史成像条件。我们采用 RSRefSeg 架构，并在 Aerial-D 上与先前的航空数据集一起训练模型，为现代和历史图像的文本生成统一的实例和语义分割。结果表明，组合训练在当代基准上实现了具有竞争力的性能，同时在档案航空摄影中出现的单色、棕褐色和颗粒状退化下保持了很高的准确性。数据集、经过训练的模型和完整的软件管道可通过此 https URL 公开获得。

Title: Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting

Authors: Shilong Jin, Haoran Duan, Litao Hua, Wentao Huang, Yuan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07345
Pdf URL: https://arxiv.org/pdf/2512.07345
Copy Paste: [[2512.07345]] Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting(https://arxiv.org/abs/2512.07345)
Keywords: generation
Abstract: Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.
摘要：从文本到图像 (T2I) 扩散模型中提取的多功能 3D 任务（例如生成或编辑）由于不依赖于广泛的 3D 训练数据而引起了广泛的研究兴趣。然而，T2I 模型表现出由于先前视图偏差而导致的局限性，这会在对象的不同视图之间产生冲突的外观。这种偏差导致主题词在交叉注意（CA）计算期间优先激活先验视图特征，而不管目标视图条件如何。为了克服这一限制，我们进行了全面的数学分析，以揭示 T2I 模型中先验偏差的根本原因。此外，我们发现不同的 UNet 层在 CA 中显示出不同的先验视图效果。因此，我们提出了一种新颖的框架TD-Attn，它通过两个关键组件解决多视图不一致问题：（1）3D感知注意力引导模块（3D-AAG）为主题词构建视图一致的3D注意力高斯，以强制跨注意力集中区域的空间一致性，从而补偿2D单独视图CA映射中有限的空间信息； (2) 分层注意力调制模块 (HAM) 利用语义指导树 (SGT) 指导语义响应分析器 (SRP) 定位和调制对视图条件高度响应的 CA 层，其中增强的 CA 映射进一步支持构建更一致的 3D 注意力高斯分布。值得注意的是，HAM 有助于特定语义的干预，从而实现可控且精确的 3D 编辑。大量实验证实 TD-Attn 有潜力作为通用插件，显着增强 3D 任务中的多视图一致性。

Title: MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Authors: Xinyu Wei, Kangrui Cen, Hongyang Wei, Zhen Guo, Bairui Li, Zeqing Wang, Jinrui Zhang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07348
Pdf URL: https://arxiv.org/pdf/2512.07348
Copy Paste: [[2512.07348]] MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition(https://arxiv.org/abs/2512.07348)
Keywords: generation
Abstract: In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
摘要：在可控图像生成中，从多个参考输入合成连贯且一致的图像，即多图像合成（MICo），仍然是一个具有挑战性的问题，部分原因是缺乏高质量的训练数据。为了弥补这一差距，我们对 MICo 进行了系统研究，将其分为 7 个有代表性的任务，并策划了大规模的高质量源图像集合并构建了多样化的 MICo 提示。利用强大的专有模型，我们合成了大量平衡的合成图像，然后进行人机交互过滤和细化，从而生成了 MICo-150K，这是一个具有身份一致性的 MICo 综合数据集。我们进一步构建了一个分解与重组 (De&Re) 子集，其中 11K 个现实世界的复杂图像被分解为组件并重新组合，从而实现真实和合成的组合。为了实现全面评估，我们构建了 MICo-Bench，每个任务包含 100 个案例和 300 个具有挑战性的 De&Re 案例，并进一步引入了专为 MICo 评估量身定制的新指标 Weighted-Ref-VIEScore。最后，我们在 MICo-150K 上微调多个模型，并在 MICo-Bench 上对其进行评估。结果表明，MICo-150K 可以有效装备不具备 MICo 功能的模型，并进一步增强具有现有技能的模型。值得注意的是，我们的基线模型 Qwen-MICo 经过 Qwen-Image-Edit 的微调，在 3 图像合成方面与 Qwen-Image-2509 相匹配，同时支持超出后者限制的任意多图像输入。我们的数据集、基准测试和基线共同为多图像合成的进一步研究提供了宝贵的资源。

Title: Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Zihan Zheng, Yuan Zhang, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07480
Pdf URL: https://arxiv.org/pdf/2512.07480
Copy Paste: [[2512.07480]] Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance(https://arxiv.org/abs/2512.07480)
Keywords: generation
Abstract: While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.
摘要：虽然传统和神经视频编解码器 (NVC) 已经实现了卓越的率失真性能，但提高低比特率下的感知质量仍然具有挑战性。一些 NVC 包含感知或对抗性目标，但由于生成能力有限，仍然会受到伪影的影响，而其他 NVC 则利用预训练的扩散模型来提高质量，但代价是采样复杂性很高。为了克服这些挑战，我们提出了 S2VC，这是一种基于单步扩散的视频编解码器，它将条件编码框架与高效的单步扩散生成器集成在一起，能够以低比特率实现真实的重建，并降低采样成本。认识到单步扩散中语义调节的重要性，我们引入上下文语义指导来从缓冲特征中提取帧自适应语义。它用高效、细粒度的调节取代了文本字幕，从而提高了生成的真实性。此外，时间一致性指导被纳入扩散 U-Net，以强制跨帧的时间一致性并确保稳定的生成。大量实验表明，S2VC 提供了最先进的感知质量，与之前的感知方法相比平均节省了 52.73% 的比特率，强调了单步扩散实现高效、高质量视频压缩的前景。

Title: Materium: An Autoregressive Approach for Material Generation

Authors: Niklas Dobberstein, Jan Hamaekers
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2512.07486
Pdf URL: https://arxiv.org/pdf/2512.07486
Copy Paste: [[2512.07486]] Materium: An Autoregressive Approach for Material Generation(https://arxiv.org/abs/2512.07486)
Keywords: generation
Abstract: We present Materium: an autoregressive transformer for generating crystal structures that converts 3D material representations into token sequences. These sequences include elements with oxidation states, fractional coordinates and lattice parameters. Unlike diffusion approaches, which refine atomic positions iteratively through many denoising steps, Materium places atoms at precise fractional coordinates, enabling fast, scalable generation. With this design, the model can be trained in a few hours on a single GPU and generate samples much faster on GPUs and CPUs than diffusion-based approaches. The model was trained and evaluated using multiple properties as conditions, including fundamental properties, such as density and space group, as well as more practical targets, such as band gap and magnetic density. In both single and combined conditions, the model performs consistently well, producing candidates that align with the requested inputs.
摘要：我们推出了 Materium：一种用于生成晶体结构的自回归转换器，可将 3D 材料表示转换为标记序列。这些序列包括具有氧化态、分数坐标和晶格参数的元素。与通过许多去噪步骤迭代地细化原子位置的扩散方法不同，Materium 将原子放置在精确的分数坐标上，从而实现快速、可扩展的生成。通过这种设计，模型可以在单个 GPU 上在几个小时内完成训练，并且在 GPU 和 CPU 上生成样本的速度比基于扩散的方法快得多。该模型使用多种属性作为条件进行训练和评估，包括密度和空间群等基本属性，以及带隙和磁密度等更实用的目标。在单一条件和组合条件下，该模型始终表现良好，产生与所请求的输入相符的候选对象。

Title: Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior

Authors: Chih-Chung Hsu, Shao-Ning Chen, Chia-Ming Lee, Yi-Fang Wang, Yi-Shiuan Chou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07498
Pdf URL: https://arxiv.org/pdf/2512.07498
Copy Paste: [[2512.07498]] Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior(https://arxiv.org/abs/2512.07498)
Keywords: generation
Abstract: Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.
摘要：随着 DeepFake 的生成变得越来越真实且对检测的鲁棒性越来越强，确保视频内容的真实性仍然具有挑战性。大多数现有的检测器隐含地假设时间一致且干净的面部序列，这种假设在现实场景中很少成立，在现实场景中，压缩伪影、遮挡和对抗性攻击会破坏面部检测的稳定性，并经常导致无效或误检测的面部。为了应对这些挑战，我们提出了一种拉普拉斯正则化图卷积网络（LR-GCN），它可以从噪声或无序的面部序列中稳健地检测 DeepFakes，同时仅在干净的面部数据上进行训练。我们的方法构建了一个无序时间图嵌入（OF-TGE），它将逐帧 CNN 特征组织成基于语义亲和力的自适应稀疏图。与受严格时间连续性限制的传统方法不同，OF-TGE 捕获跨帧的内在特征一致性，使其能够适应混洗、丢失或严重损坏的输入。我们进一步对图结构和节点特征施加双层稀疏机制，以抑制无效面的影响。至关重要的是，我们引入了显式的图拉普拉斯谱先验，它充当图谱域中的高通算子，突出显示结构异常和伪造伪影，然后通过低通 GCN 聚合进行合并。这种顺序设计有效地实现了任务驱动的光谱带通机制，可抑制背景信息和随机噪声，同时保留操作线索。在 FF++、Celeb-DFv2 和 DFDC 上进行的大量实验表明，LR-GCN 实现了最先进的性能，并在严重的全局和局部干扰（包括面部缺失、遮挡和对抗性扰动的面部检测）下显着提高了鲁棒性。

Title: MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

Authors: Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, Yue Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07500
Pdf URL: https://arxiv.org/pdf/2512.07500
Copy Paste: [[2512.07500]] MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer(https://arxiv.org/abs/2512.07500)
Keywords: generation
Abstract: Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
摘要：由于固有的运动纠缠和缺乏对象级控制，多对象视频运动传输对扩散变换器 (DiT) 架构提出了重大挑战。我们提出了 MultiMotion，这是一种克服这些限制的新颖的统一框架。我们的核心创新是 Maskaware Attention Motion Flow (AMF)，它利用 SAM2 掩模来明确地解开和控制 DiT 管道中多个对象的运动特征。此外，我们还引入了 RectPC，这是一种高阶预测校正求解器，可实现高效、准确的采样，特别有利于多实体生成。为了促进严格的评估，我们专门针对基于 DiT 的多对象运动传输构建了第一个基准数据集。 MultiMotion 明显实现了多个不同对象的精确、语义对齐和时间连贯的运动传输，保持了 DiT 的高质量和可扩展性。代码在支持中。

Title: SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation

Authors: Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07503
Pdf URL: https://arxiv.org/pdf/2512.07503
Copy Paste: [[2512.07503]] SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation(https://arxiv.org/abs/2512.07503)
Keywords: generation
Abstract: Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
摘要：大型自回归模型可以生成高质量、高分辨率的图像，但生成速度较慢，因为这些模型在推理过程中需要数百到数千次顺序前向传递来预测下一个标记。为了加速自回归文本到图像的生成，我们提出了推测雅可比解码++（SJD++），这是一种免训练的概率并行解码算法。与传统的下一个令牌预测不同，SJD++ 在每个前向传递中执行多令牌预测，大大减少了生成步骤。具体来说，它集成了雅可比解码的迭代多标记预测机制和推测采样的概率起草和验证机制。更重要的是，为了进一步加速，SJD++ 在每个验证阶段后重用高可信度草稿令牌，而不是全部重新采样。我们对几种代表性的自回归文本到图像生成模型进行了广泛的实验，并证明 SJD++ 实现了 $2\times$ 到 $3\times$ 推理延迟减少和 $2\times$ 到 $7\times$ 步长压缩，同时保持视觉质量，没有明显的下降。

Title: MeshRipple: Structured Autoregressive Generation of Artist-Meshes

Authors: Junkai Lin, Hang Long, Huipeng Guo, Jielei Zhang, JiaYi Yang, Tianle Guo, Yang Yang, Jianwen Li, Wenxiao Zhang, Matthias Nießner, Wei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07514
Pdf URL: https://arxiv.org/pdf/2512.07514
Copy Paste: [[2512.07514]] MeshRipple: Structured Autoregressive Generation of Artist-Meshes(https://arxiv.org/abs/2512.07514)
Keywords: generation
Abstract: Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a this http URL rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological this http URL integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.
摘要：网格作为 3D 资源的主要表示形式。自回归网格生成器将面序列化为序列，并通过滑动窗口推理对截断的片段进行训练，以应对内存限制。然而，这种不匹配破坏了远程几何依赖性，产生孔洞和碎片组件。为了解决这个关键限制，我们引入了 MeshRipple，它将网格从活跃的生成边界向外扩展，类似于此 http URL 上的波纹，它依赖于三个关键创新：边界感知 BFS 标记化，将生成顺序与表面拓扑对齐；保持连贯、连接的表面生长的扩展预测策略；稀疏注意力全局内存可提供有效的无界感受野来解析远程拓扑。这种 http URL 集成设计使 MeshRipple 能够生成具有高表面保真度和拓扑完整性的网格，其性能优于近期的强大基线。

Title: From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Authors: Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, Mu Xu, Wenzheng Chen, Baoquan Chen
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.07527
Pdf URL: https://arxiv.org/pdf/2512.07527
Copy Paste: [[2512.07527]] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images(https://arxiv.org/abs/2512.07527)
Keywords: restoration, generative
Abstract: City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.
摘要：从卫星图像进行城市规模的 3D 重建提出了极端视点外推的挑战，我们的目标是从视差最小的稀疏轨道图像中合成地面新颖的视图。这需要从具有严重缩短的立面和有缺陷的纹理的图像源推断近 90^\circ$ 视点间隙，导致 NeRF 和 3DGS 等最先进的重建引擎失败。为了解决这个问题，我们提出了两种针对城市结构和卫星输入量身定制的设计选择。首先，我们将城市几何模型建模为 2.5D 高度图，并以 Z 单调符号距离场 (SDF) 的形式实现，从自上而下的角度与城市建筑布局相匹配。这可以在稀疏、偏离最低点的卫星视图下稳定几何优化，并产生具有清晰屋顶和干净、垂直挤压立面的防水网格。其次，我们通过可微分渲染技术从卫星图像中绘制网格外观。虽然卫星输入可能包含远距离、模糊的捕获，但我们进一步训练生成纹理恢复网络来增强外观，从降级的输入中恢复高频、可信的纹理细节。我们的方法的可扩展性和鲁棒性通过大规模城市重建的大量实验得到了证明。例如，在我们的预告图中，我们仅从几张卫星图像中重建了 $4\,\mathrm{km}^2$ 的现实世界区域，在合成逼真的地面视图方面实现了最先进的性能。生成的模型不仅在视觉上引人注目，而且还可以作为城市规划和模拟等下游任务的高保真、应用就绪资产。

Title: ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

Authors: Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.07558
Pdf URL: https://arxiv.org/pdf/2512.07558
Copy Paste: [[2512.07558]] ReLaX: Reasoning with Latent Exploration for Large Reasoning Models(https://arxiv.org/abs/2512.07558)
Keywords: generation
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.
摘要：具有可验证奖励的强化学习（RLVR）最近在增强大型推理模型（LRM）的推理能力方面表现出了巨大的潜力。然而，RLVR 常常导致熵崩溃，导致策略过早收敛和性能饱和。虽然操纵令牌级熵已被证明对于促进策略探索有效，但我们认为令牌生成背后的潜在动态编码了更丰富的计算结构，用于引导策略优化走向更有效的探索-利用权衡。为了能够对 LRM 的潜在动态进行易于处理的分析和干预，我们利用库普曼算子理论来获得其隐藏状态动态的线性化表示。这使我们能够引入动态频谱色散（DSD），这是一种量化模型潜在动态异质性的新指标，可作为政策探索的直接指标。在此基础上，我们提出了潜在探索推理（ReLaX），这是一种明确结合潜在动态来调节策略优化期间探索和利用的范例。跨多种多模式和纯文本推理基准的综合实验表明，ReLaX 显着减轻了过早收敛，并始终实现最先进的性能。

Title: LongCat-Image Technical Report

Authors: Meituan LongCat Team: Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07584
Pdf URL: https://arxiv.org/pdf/2512.07584
Copy Paste: [[2512.07584]] LongCat-Image Technical Report(https://arxiv.org/abs/2512.07584)
Keywords: generation
Abstract: We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.
摘要：我们推出了LongCat-Image，这是一种开创性的开源双语（中英）图像生成基础模型，旨在解决当前领先模型中普遍存在的多语言文本渲染、真实感、部署效率和开发人员可访问性方面的核心挑战。 1）我们通过在训练前、训练中期和 SFT 阶段采用严格的数据管理策略来实现这一目标，并辅以在 RL 阶段协调使用管理奖励模型。该策略将该模型确立为新的最先进 (SOTA) 模型，提供卓越的文本渲染功能和卓越的照片级真实感，并显着提高美学质量。 2) 值得注意的是，它为汉字渲染制定了新的行业标准。通过支持复杂和罕见的字符，它在覆盖范围上优于主要的开源和商业解决方案，同时还实现了卓越的准确性。 3）该模型通过其紧凑的设计实现了显着的效率。它的核心扩散模型只有 6B 个参数，明显小于该领域常见的近 20B 或更大的专家混合 (MoE) 架构。这可确保最小的 VRAM 使用量和快速推理，从而显着降低部署成本。除了生成之外，LongCat-Image 在图像编辑方面也表现出色，在标准基准上取得了 SOTA 结果，与其他开源作品相比，具有卓越的编辑一致性。 4）为了充分赋能社区，我们建立了迄今为止最全面的开源生态系统。我们不仅发布了用于文本到图像和图像编辑的多个模型版本，包括训练中期和训练后阶段后的检查点，而且还发布了训练过程的整个工具链。我们相信LongCat-Image的开放性将为开发者和研究人员提供强有力的支持，推动视觉内容创作的前沿。

Title: MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

Authors: Zhiqi Li, Wenhuan Li, Tengfei Wang, Zhenwei Wang, Junta Wu, Haoyuan Wang, Yunhan Yang, Zehuan Huang, Yang Li, Peidong Liu, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07628
Pdf URL: https://arxiv.org/pdf/2512.07628
Copy Paste: [[2512.07628]] MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation(https://arxiv.org/abs/2512.07628)
Keywords: generation, generative
Abstract: Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: this https URL
摘要：组合性对于 3D 对象和场景生成至关重要，但现有的部件感知 3D 生成方法由于增加组件数量时的二次全局注意力成本而导致可扩展性较差。在这项工作中，我们提出了 MoCA，一种具有两个关键设计的组合 3D 生成模型：（1）基于重要性的组件路由，为稀疏全局注意力选择前 k 个相关组件；（2）不重要组件压缩，保留未选择组件的上下文先验，同时降低全局注意力的计算复杂性。通过这些设计，MoCA 能够通过可扩展数量的组件来创建高效、细粒度的组合 3D 资产。大量实验表明 MoCA 在组合对象和场景生成任务上都优于基线。项目页面：此 https URL

Title: Optimization-Guided Diffusion for Interactive Scene Generation

Authors: Shiaho Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07661
Pdf URL: https://arxiv.org/pdf/2512.07661
Copy Paste: [[2512.07661]] Optimization-Guided Diffusion for Interactive Scene Generation(https://arxiv.org/abs/2512.07661)
Keywords: generation
Abstract: Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
摘要：真实且多样化的多智能体驾驶场景对于评估自动驾驶汽车至关重要，但对于这项任务至关重要的安全关键事件在驾驶数据集中很少见且代表性不足。数据驱动的场景生成通过从现有驾驶日志中合成复杂的交通行为，提供了一种低成本的替代方案。然而，现有模型通常缺乏可控性或产生违反物理或社会约束的样本，从而限制了其可用性。我们提出了 OMEGA，这是一种优化引导、免训练的框架，可在场景生成模型的基于扩散的采样过程中增强结构一致性和交互意识。 OMEGA 通过约束优化重新锚定每个反向扩散步骤，引导生成过程朝着物理上合理且行为上一致的轨迹发展。在此框架的基础上，我们将自我攻击者交互制定为分布空间中的博弈论优化，近似纳什均衡以生成现实的、安全关键的对抗场景。 nuPlan 和 Waymo 上的实验表明，OMEGA 提高了生成的真实性、一致性和可控性，将自由探索能力的物理和行为有效场景的比例从 32.35% 提高到 72.27%，将注重可控性的生成从 11% 提高到 80%。我们的方法还可以生成多 5 倍的接近碰撞帧，碰撞时间低于三秒，同时保持整体场景的真实感。

Title: Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Authors: Sangha Park, Eunji Kim, Yeongtak Oh, Jooyoung Choi, Sungroh Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07702
Pdf URL: https://arxiv.org/pdf/2512.07702
Copy Paste: [[2512.07702]] Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment(https://arxiv.org/abs/2512.07702)
Keywords: generation
Abstract: Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at this https URL.
摘要：尽管在文本到图像生成方面取得了实质性进展，但实现精确的文本图像对齐仍然具有挑战性，特别是对于具有丰富构图结构或富有想象力的元素的提示。为了解决这个问题，我们引入了图像校正的负面提示 (NPC)，这是一种自动化管道，可通过识别和应用抑制意外内容的负面提示来改进对齐。我们首先分析交叉注意力模式，以解释为什么目标否定（与提示的对齐错误直接相关的否定）和非目标否定（与提示无关但存在于生成的图像中的标记）都可以增强对齐。为了发现有用的底片，NPC 使用验证者-标题者-提议者框架生成候选提示，并使用显着的文本空间分数对它们进行排名，从而无需额外的图像合成即可进行有效的选择。在 GenEval++ 和 Imagine-Bench 上，NPC 的性能优于强大的基线，在 GenEval++ 上实现 0.571 对比 0.371，并且在 Imagine-Bench 上获得最佳整体性能。通过指导不生成什么，NPC 提供了一种原则性的、完全自动化的途径，以在扩散模型中实现更强的文本-图像对齐。代码在此 https URL 发布。

Title: ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Authors: Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07720
Pdf URL: https://arxiv.org/pdf/2512.07720
Copy Paste: [[2512.07720]] ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation(https://arxiv.org/abs/2512.07720)
Keywords: generation, generative
Abstract: Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: this https URL
摘要：从单次输入图像生成高保真上半身 3D 头像仍然是一项重大挑战。当前的 3D 头像生成方法依赖于大型重建模型，速度快且能够生成稳定的身体结构，但它们经常会出现诸如模糊纹理和僵硬、不自然的运动等伪影。相比之下，生成视频模型通过合成真实感和动态结果显示出有希望的性能，但它们经常与不稳定的行为作斗争，包括身体结构错误和身份漂移。为了解决这些局限性，我们提出了一种结合了两种范式优点的新颖方法。我们的框架采用 3D 重建模型来提供强大的结构和外观先验，这反过来又指导实时自回归视频扩散模型进行渲染。这一过程使模型能够实时合成高频、逼真的细节和流体动力学，有效减少纹理模糊和运动刚度，同时防止视频生成方法中常见的结构不一致。通过将 3D 重建的几何稳定性与视频模型的生成能力相结合，我们的方法可以生成具有逼真外观和动态、时间连贯运动的高保真数字化身。实验表明，与领先方法相比，我们的方法显着减少了伪影，并在视觉质量方面取得了显着改进，为游戏和虚拟现实等实时应用提供了强大而高效的解决方案。项目页面：此 https URL

Title: SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Authors: Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07730
Pdf URL: https://arxiv.org/pdf/2512.07730
Copy Paste: [[2512.07730]] SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination(https://arxiv.org/abs/2512.07730)
Keywords: generation
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at this https URL.
摘要：尽管多模态大语言模型（MLLM）已经取得了长足的进步，但它们仍然容易受到语言先验和视觉信息丢失引起的物体幻觉的影响。为了解决这个问题，我们提出了 SAVE（稀疏自动编码器驱动的视觉信息增强），这是一个通过沿着稀疏自动编码器（SAE）潜在特征引导模型来减轻幻觉的框架。二元对象存在问答探针识别最能指示模型视觉信息处理的 SAE 特征，称为视觉理解特征。沿着这些已识别的特征引导模型可以增强基础视觉理解并有效减少幻觉。凭借其简单的设计，SAVE 在标准基准测试中优于最先进的免训练方法，在 CHAIR\_S 上实现了 10\%p 的改进，并在 POPE 和 MMHal-Bench 上实现了一致的增益。跨多个模型和层的广泛评估证实了我们方法的稳健性和普遍性。进一步的分析表明，沿着视觉理解特征进行引导可以抑制不确定物体标记的生成，并增加对图像标记的注意力，从而减轻幻觉。代码在此 https URL 发布。

Title: HLTCOE Evaluation Team at TREC 2025: VQA Track

Authors: Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07738
Pdf URL: https://arxiv.org/pdf/2512.07738
Copy Paste: [[2512.07738]] HLTCOE Evaluation Team at TREC 2025: VQA Track(https://arxiv.org/abs/2512.07738)
Keywords: generation, generative
Abstract: The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
摘要：HLTCOE 评估团队参与了 TREC VQA 的答案生成（AG）任务，为此我们开发了一个列表学习框架，旨在提高答案生成中的语义精度和排名一致性。给定视频-问题对，基本多模态模型首先生成多个候选答案，然后使用经过新颖的带有排名权重的蒙面指针交叉熵损失训练的模型对这些答案进行重新排名。该目标集成了基于指针的候选选择、排名相关的加权和词汇限制下的屏蔽交叉熵，从而实现稳定且可解释的列表优化。通过将生成模型与判别性排名结合起来，我们的方法可以生成连贯的、细粒度的答案列表。实验表明，准确性和排名稳定性不断提高，特别是对于需要时间推理和语义消歧的问题。

Title: DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Authors: Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07745
Pdf URL: https://arxiv.org/pdf/2512.07745
Copy Paste: [[2512.07745]] DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving(https://arxiv.org/abs/2512.07745)
Keywords: generative
Abstract: Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at this https URL
摘要：端到端自动驾驶的生成扩散模型经常遭受模式崩溃的影响，往往会产生保守且同质的行为。虽然 DiffusionDrive 采用代表不同驾驶意图的预定义锚点来划分动作空间并生成多样化的轨迹，但其对模仿学习的依赖缺乏足够的约束，导致在多样性和一致的高质量之间陷入困境。在这项工作中，我们提出了 DiffusionDriveV2，它利用强化学习来限制低质量模式并探索更好的轨迹。这显着提高了整体输出质量，同时保留了其核心高斯混合模型固有的多模态。首先，我们使用适合轨迹规划的尺度自适应乘性噪声来促进广泛的探索。其次，我们使用锚内 GRPO 来管理从单个锚生成的样本之间的优势估计，并使用锚间截断的 GRPO 来整合不同锚之间的全局视角，防止不同意图之间的不正确的优势比较（例如，转弯与直行），这可能导致进一步的模式崩溃。在使用对齐的 ResNet-34 主干网的闭环评估中，DiffusionDriveV2 在 NAVSIM v1 数据集上实现了 91.2 PDMS，在 NAVSIM v2 数据集上实现了 85.5 EPDMS，创下了新记录。进一步的实验验证了我们的方法解决了截断扩散模型的多样性和一致的高质量之间的困境，实现了最佳的权衡。代码和模型将在此 https URL 中提供

Title: Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Authors: Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07747
Pdf URL: https://arxiv.org/pdf/2512.07747
Copy Paste: [[2512.07747]] Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation(https://arxiv.org/abs/2512.07747)
Keywords: generation, generative
Abstract: Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
摘要：统一理解和生成是多模态学习中一个非常有吸引力的研究方向。存在两种方法：一种通过自回归范式训练变压器，另一种采用两阶段方案，连接预训练的理解和生成模型以进行对齐微调。前者需要大量的数据和计算资源，普通研究人员无法承受。尽管后者需要较低的培训成本，但现有的工作往往面临任务覆盖范围有限或生成质量较差的问题。这两种方法都缺乏解析输入元信息（如任务类型、图像分辨率、视频时长等）的能力，并且需要手动配置参数，繁琐且不智能。在本文中，我们提出了 Unison，它采用两阶段方案，同时很好地保留了预训练模型的功能。我们以极低的训练成本涵盖了各种多模态理解任务，包括文本、图像和视频理解，以及多样化的生成任务，例如文本到视觉的内容生成、编辑、可控生成和基于IP的参考生成。我们还使我们的模型能够自动解析用户意图，确定目标任务类型，并准确提取相应任务所需的元信息。这使得各种多模式任务完全自动化，无需人工干预。实验表明，在仅 50 万个训练样本和 50 个 GPU 小时的低成本设置下，我们的模型可以准确、自动地识别任务并提取相关参数，并在各种理解和生成任务中取得优异的性能。

Title: Distribution Matching Variational AutoEncoder

Authors: Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07778
Pdf URL: https://arxiv.org/pdf/2512.07778
Copy Paste: [[2512.07778]] Distribution Matching Variational AutoEncoder(https://arxiv.org/abs/2512.07778)
Keywords: generative
Abstract: Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at this https URL.
摘要：大多数视觉生成模型在应用扩散或自回归建模之前将图像压缩到潜在空间中。然而，诸如 VAE 和基础模型对齐编码器之类的现有方法隐式地约束了潜在空间，而没有显式地塑造其分布，从而不清楚哪种类型的分布最适合建模。我们引入了 \textbf{Distribution-Matching VAE} (\textbf{DMVAE})，它通过分布匹配约束将编码器的潜在分布与任意参考分布显式对齐。这超出了传统 VAE 的高斯先验，能够与源自自监督特征、扩散噪声或其他先验分布的分布对齐。通过 DMVAE，我们可以系统地研究哪些潜在分布更有利于建模，我们发现 SSL 派生的分布在重建保真度和建模效率之间提供了出色的平衡，仅用 64 个训练周期就在 ImageNet 上达到了 gFID 等于 3.2。我们的结果表明，选择合适的潜在分布结构（通过分布级对齐实现），而不是依赖固定的先验，是弥合易于建模的潜在分布和高保真图像合成之间差距的关键。代码可在此 https URL 获取。

Title: OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Authors: Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07802
Pdf URL: https://arxiv.org/pdf/2512.07802
Copy Paste: [[2512.07802]] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory(https://arxiv.org/abs/2512.07802)
Keywords: generation
Abstract: Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
摘要：现实世界视频中的故事讲述通常通过多个镜头展开——不连续但语义相关的剪辑共同传达连贯的叙述。然而，现有的多镜头视频生成（MSV）方法难以有效地模拟远程交叉镜头上下文，因为它们依赖于有限的时间窗口或单个关键帧条件，导致复杂叙事下的性能下降。在这项工作中，我们提出了 OneStory，支持全局而紧凑的跨镜头上下文建模，以实现一致且可扩展的叙事生成。 OneStory 将 MSV 重新定义为下一代镜头生成任务，实现自回归镜头合成，同时利用预训练的图像到视频 (I2V) 模型来实现强大的视觉调节。我们引入了两个关键模块：一个帧选择模块，它根据先前镜头中的信息帧构建语义相关的全局记忆；以及一个自适应调节器，它执行重要性引导的补丁化以生成用于直接调节的紧凑上下文。我们进一步策划了一个带有参考标题的高质量多镜头数据集，以反映现实世界的故事讲述模式，并在下一个镜头范例下设计有效的训练策略。 OneStory 根据我们精心策划的 60K 数据集上的预训练 I2V 模型进行了微调，在文本和图像条件设置中的各种复杂场景中实现了最先进的叙事连贯性，从而实现了可控且身临其境的长视频故事讲述。

Title: WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Authors: Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, Qixing Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07821
Pdf URL: https://arxiv.org/pdf/2512.07821
Copy Paste: [[2512.07821]] WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling(https://arxiv.org/abs/2512.07821)
Keywords: generation
Abstract: Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.
摘要：最近的视频生成器实现了惊人的照片级真实感，但在 3D 方面仍然存在根本性的不一致。我们推出了 WorldReel，一个原生时空一致的 4D 视频生成器。 WorldReel 联合生成 RGB 帧和 4D 场景表示，包括点图、相机轨迹和密集流映射，从而随着时间的推移实现连贯的几何和外观建模。我们的显式 4D 表示强制执行跨视点和动态内容持续存在的单个底层场景，即使在大型非刚性运动和显着的摄像机移动下，也能生成保持一致的视频。我们通过仔细结合合成数据和真实数据来训练 WorldReel：合成数据提供精确的 4D 监督（几何、运动和相机），而真实视频则提供视觉多样性和真实感。这种混合使 WorldReel 能够推广到野外镜头，同时保持强大的几何保真度。大量实验表明，WorldReel 为动态场景和移动摄像机的一致视频生成设定了新的最先进技术，与竞争方法相比，改进了几何一致性、运动连贯性的指标，并减少了观看时间伪影。我们相信 WorldReel 使视频生成更接近 4D 一致的世界建模，其中代理可以通过单一且稳定的时空表示来渲染、交互和推理场景。

Title: One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Authors: Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.07829
Pdf URL: https://arxiv.org/pdf/2512.07829
Copy Paste: [[2512.07829]] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation(https://arxiv.org/abs/2512.07829)
Keywords: generation, generative
Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
摘要：视觉生成模型（例如扩散模型）通常在压缩的潜在空间中运行，以平衡训练效率和样本质量。与此同时，人们对利用高质量的预训练视觉表示越来越感兴趣，无论是在 VAE 内对齐它们还是直接在生成模型内对齐它们。然而，由于面向理解的特征和生成友好的潜在空间之间的根本不匹配，适应这种表示仍然具有挑战性。表示编码器受益于高维潜在，捕获屏蔽区域的不同假设，而生成模型则倾向于低维潜在，必须忠实地保留注入的噪声。这种差异导致之前的工作依赖于复杂的目标和架构。在这项工作中，我们提出了 FAE（特征自动编码器），这是一种简单而有效的框架，可将预先训练的视觉表示适应低维潜伏，适合使用少量的单个注意层进行生成，同时保留足够的信息用于重建和理解。关键是耦合两个独立的深度解码器：一个经过训练以重建原始特征空间，第二个将重建的特征作为图像生成的输入。 FAE 是通用的；它可以使用各种自监督编码器（例如 DINO、SigLIP）进行实例化，并插入两个不同的生成系列：扩散模型和标准化流。在类条件和文本到图像基准测试中，FAE 取得了强劲的性能。例如，在 ImageNet 256x256 上，我们使用 CFG 的扩散模型获得了接近最先进的 FID：1.29（800 epoch）和 1.70（80 epoch）。在没有 CFG 的情况下，FAE 达到了最先进的 FID 1.48（800 个 epoch）和 2.08（80 个 epoch），展示了高质量和快速的学习。

Title: UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Authors: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07831
Pdf URL: https://arxiv.org/pdf/2512.07831
Copy Paste: [[2512.07831]] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation(https://arxiv.org/abs/2512.07831)
Keywords: generation
Abstract: Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: this https URL
摘要：最近的视频生成模型展示了令人印象深刻的合成能力，但仍然受到单一模态条件的限制，限制了它们对世界的整体理解。这是由于跨模态交互不足和综合世界知识表示的模态多样性有限。为了解决这些限制，我们引入了 UnityVideo，这是一个用于生成世界感知视频的统一框架，可以跨多种模式（分割掩模、人体骨骼、DensePose、光流和深度图）和训练范例进行联合学习。我们的方法具有两个核心组件：（1）动态噪声来统一异构训练范例，（2）具有上下文学习器的模态切换器，可以通过模块化参数和上下文学习实现统一处理。我们贡献了一个包含 130 万样本的大规模统一数据集。通过联合优化，UnityVideo 加速了收敛并显着增强了对未见数据的零样本泛化能力。我们证明 UnityVideo 实现了卓越的视频质量、一致性，并改善了与物理世界限制的一致性。代码和数据可以在以下位置找到：此 https URL

Title: Voxify3D: Pixel Art Meets Volumetric Rendering

Authors: Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.07834
Pdf URL: https://arxiv.org/pdf/2512.07834
Copy Paste: [[2512.07834]] Voxify3D: Pixel Art Meets Volumetric Rendering(https://arxiv.org/abs/2512.07834)
Keywords: generation
Abstract: Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: this https URL
摘要：体素艺术是一种广泛应用于游戏和数字媒体的独特风格，但由于几何抽象、语义保存和离散颜色一致性的相互冲突的要求，从 3D 网格自动生成仍然具有挑战性。现有的方法要么过度简化几何图形，要么无法实现像素精确、调色板受限的体素艺术美学。我们引入了 Voxify3D，这是一种可微的两阶段框架，将 3D 网格优化与 2D 像素艺术监督联系起来。我们的核心创新在于三个组件的协同集成：（1）正交像素艺术监督，消除透视失真以实现精确的体素像素对齐； (2) 基于补丁的 CLIP 对齐，可跨离散化级别保留语义； (3) 调色板约束的 Gumbel-Softmax 量化能够通过可控调色板策略对离散颜色空间进行可微分优化。这种集成解决了基本挑战：极端离散化下的语义保存、通过体积渲染实现像素艺术美学以及端到端离散优化。实验表明，在不同的字符和可控的抽象（2-8 种颜色，20x-50x 分辨率）上具有卓越的性能（37.12 CLIP-IQA，77.90\% 用户偏好）。项目页面：此 https URL