2025-12-17

Title: Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution

Authors: Jacob Schnell, Aditya Makkar, Gunadi Gani, Aniket Srinivasan Ashok, Darren Lo, Mike Optis, Alexander Wong, Yuhao Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.13729
Pdf URL: https://arxiv.org/pdf/2512.13729
Copy Paste: [[2512.13729]] Composite Classifier-Free Guidance for Multi-Modal Conditioning in Wind Dynamics Super-Resolution(https://arxiv.org/abs/2512.13729)
Keywords: super-resolution
Abstract: Various weather modelling problems (e.g., weather forecasting, optimizing turbine placements, etc.) require ample access to high-resolution, highly accurate wind data. Acquiring such high-resolution wind data, however, remains a challenging and expensive endeavour. Traditional reconstruction approaches are typically either cost-effective or accurate, but not both. Deep learning methods, including diffusion models, have been proposed to resolve this trade-off by leveraging advances in natural image super-resolution. Wind data, however, is distinct from natural images, and wind super-resolvers often use upwards of 10 input channels, significantly more than the usual 3-channel RGB inputs in natural images. To better leverage a large number of conditioning variables in diffusion models, we present a generalization of classifier-free guidance (CFG) to multiple conditioning inputs. Our novel composite classifier-free guidance (CCFG) can be dropped into any pre-trained diffusion model trained with standard CFG dropout. We demonstrate that CCFG outputs are higher-fidelity than those from CFG on wind super-resolution tasks. We present WindDM, a diffusion model trained for industrial-scale wind dynamics reconstruction and leveraging CCFG. WindDM achieves state-of-the-art reconstruction quality among deep learning models and costs up to $1000\times$ less than classical methods.
摘要：各种天气建模问题（例如天气预报、优化涡轮机位置等）需要充分访问高分辨率、高精度的风数据。然而，获取如此高分辨率的风数据仍然是一项具有挑战性且成本高昂的工作。传统的重建方法通常要么具有成本效益，要么准确，但不能两者兼而有之。人们提出了包括扩散模型在内的深度学习方法，通过利用自然图像超分辨率的进步来解决这种权衡问题。然而，风数据与自然图像不同，风超级解析器通常使用 10 个以上的输入通道，明显多于自然图像中常见的 3 通道 RGB 输入。为了更好地利用扩散模型中的大量条件变量，我们将无分类器指导（CFG）推广到多个条件输入。我们新颖的复合无分类器指导 (CCFG) 可以放入任何使用标准 CFG dropout 训练的预训练扩散模型中。我们证明，在风超分辨率任务中，CCFG 输出的保真度高于 CFG。我们提出了 WindDM，这是一种经过训练的扩散模型，用于工业规模的风动力学重建并利用 CCFG。 WindDM 在深度学习模型中实现了最先进的重建质量，并且成本比经典方法低 1000 倍。

Title: STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning

Authors: Jie Qin, Jiancheng Huang, Limeng Qiao, Lin Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.13752
Pdf URL: https://arxiv.org/pdf/2512.13752
Copy Paste: [[2512.13752]] STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning(https://arxiv.org/abs/2512.13752)
Keywords: generation, generative
Abstract: Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
摘要：多模态大语言模型（MLLM）在推进通用人工智能的探索中发挥着关键作用。然而，由于优化冲突和性能权衡，实现多模态理解和生成的统一目标仍然具有挑战性。为了有效增强生成性能，同时保留现有的理解能力，我们引入了 STAR：一种用于任务渐进式统一多模态学习的 STacked AutoRegressive 方案。这种方法将多模态学习分解为多个阶段：理解、生成和编辑。通过冻结基本自回归（AR）模型的参数并逐步堆叠同构AR模块，它可以在扩展模型功能的同时避免跨任务干扰。同时，我们引入了高容量 VQ 来增强图像表示的粒度，并采用隐式推理机制来提高复杂条件下的生成质量。实验表明，STAR 在 GenEval (0.91)、DPG-Bench (87.44) 和 ImgEdit (4.34) 上实现了最先进的性能，验证了其统一多模态学习的功效。

Title: Time-aware UNet and super-resolution deep residual networks for spatial downscaling

Authors: Mika Sipilä, Sabrina Maggio, Sandra De Iaco, Klaus Nordhausen, Monica Palma, Sara Taskinen
Subjects: cs.CV, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2512.13753
Pdf URL: https://arxiv.org/pdf/2512.13753
Copy Paste: [[2512.13753]] Time-aware UNet and super-resolution deep residual networks for spatial downscaling(https://arxiv.org/abs/2512.13753)
Keywords: super-resolution
Abstract: Satellite data of atmospheric pollutants are often available only at coarse spatial resolution, limiting their applicability in local-scale environmental analysis and decision-making. Spatial downscaling methods aim to transform the coarse satellite data into high-resolution fields. In this work, two widely used deep learning architectures, the super-resolution deep residual network (SRDRN) and the encoder-decoder-based UNet, are considered for spatial downscaling of tropospheric ozone. Both methods are extended with a lightweight temporal module, which encodes observation time using either sinusoidal or radial basis function (RBF) encoding, and fuses the temporal features with the spatial representations in the networks. The proposed time-aware extensions are evaluated against their baseline counterparts in a case study on ozone downscaling over Italy. The results suggest that, while only slightly increasing computational complexity, the temporal modules significantly improve downscaling performance and convergence speed.
摘要：大气污染物的卫星数据通常只能以粗略的空间分辨率获得，这限制了它们在当地规模环境分析和决策中的适用性。空间降尺度方法旨在将粗略的卫星数据转换为高分辨率的场。在这项工作中，两种广泛使用的深度学习架构，超分辨率深度残差网络（SRDRN）和基于编码器-解码器的UNet，被考虑用于对流层臭氧的空间缩小。这两种方法都使用轻量级时间模块进行了扩展，该模块使用正弦或径向基函数（RBF）编码对观察时间进行编码，并将时间特征与网络中的空间表示融合。拟议的时间感知扩展是根据意大利臭氧规模缩小的案例研究中的基线同行进行评估的。结果表明，虽然仅略微增加了计算复杂性，但时间模块显着提高了缩减性能和收敛速度。

Title: The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Authors: Subramanyam Sahoo, Jared Junkin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.13821
Pdf URL: https://arxiv.org/pdf/2512.13821
Copy Paste: [[2512.13821]] The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces(https://arxiv.org/abs/2512.13821)
Keywords: generation
Abstract: Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability -- adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a scalable, theoretically grounded approach to AI control for code generation tasks.
摘要：大型语言模型 (LLM) 越来越多地在人为监督最少的情况下生成代码，引发了对后门注入和恶意行为的严重担忧。我们提出了交叉跟踪验证协议（CTVP），这是一种新颖的人工智能控制框架，可通过语义轨道分析来验证不受信任的代码生成模型。 CTVP 不是直接执行潜在的恶意代码，而是利用模型自身对语义等效程序转换中的执行轨迹的预测。通过分析这些预测痕迹中的一致性模式，我们检测到表明后门的行为异常。我们的方法引入了对抗鲁棒性商数（ARQ），它量化了相对于基线生成的验证计算成本，证明了轨道大小的指数增长。理论分析建立了显示不可游戏性的信息理论界限——由于基本的空间复杂性限制，对手无法通过训练来改进。这项工作表明，语义轨道分析为代码生成任务的人工智能控制提供了一种可扩展的、有理论依据的方法。

Title: MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Authors: Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13840
Pdf URL: https://arxiv.org/pdf/2512.13840
Copy Paste: [[2512.13840]] MoLingo: Motion-Language Alignment for Text-to-Motion Generation(https://arxiv.org/abs/2512.13840)
Keywords: generation
Abstract: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.
摘要：我们引入了 MoLingo，一种文本到运动 (T2M) 模型，它通过在连续潜在空间中去噪来生成逼真、栩栩如生的人体运动。最近的作品进行了潜在空间扩散，要么一次性整体潜在，要么对多个潜在进行自回归。在本文中，我们研究如何使连续运动潜伏的扩散效果最好。我们关注两个问题：（1）如何构建语义对齐的潜在空间，使扩散变得更加有效；（2）如何最好地注入文本调节，使运动紧密遵循描述。我们提出了一种使用帧级文本标签训练的语义对齐运动编码器，以便具有相似文本含义的潜在变量保持接近，这使得潜在空间更加易于扩散。我们还将单标记调节与多标记交叉注意方案进行比较，发现交叉注意提供了更好的运动真实感和文本运动对齐。通过语义对齐的潜伏、自回归生成和交叉注意文本调节，我们的模型在标准指标和用户研究中设定了人类运动生成的新技术水平。我们将发布我们的代码和模型以供进一步研究和下游使用。

Title: Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Authors: Wenda Li, Meng Wu, Sungmin Eum, Heesung Kwon, Qing Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13869
Pdf URL: https://arxiv.org/pdf/2512.13869
Copy Paste: [[2512.13869]] Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models(https://arxiv.org/abs/2512.13869)
Keywords: super-resolution
Abstract: Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{this https URL}{this url}.
摘要：训练目标检测器需要广泛的、特定于任务的注释，但由于目标分布的不断变化和标记图像的稀缺，这一要求在基于无人机的人体检测中变得不切实际。作为补救措施，采用合成模拟器来生成注释数据，注释成本较低。然而，合成图像和真实图像之间的域差距阻碍了模型有效地应用于目标域。因此，我们引入了粗到细层次对齐（CFHA），这是一种基于扩散的三阶段框架，旨在转换基于无人机的人体检测的合成数据，缩小域间隙，同时保留原始合成标签。 CFHA 明确地解耦了全局风格和本地内容域的差异，并使用三个模块弥合了这些差距：（1）全局风格迁移——扩散模型仅使用一个小的真实参考集，将合成图像的颜色、照明和纹理统计数据与现实风格对齐； (2)局部细化——超分辨率扩散模型用于促进小物体（例如人体实例）的细粒度和真实感细节，保持形状和边界完整性； (3)幻觉去除——过滤掉视觉属性与现实世界数据不相符的人类实例的模块，使人类外观更接近目标分布。对公共无人机 Sim2Real 检测基准的大量实验表明，与未转换的基准相比，我们的方法显着提高了检测精度。具体来说，我们的方法在 Semantic-Drone 基准上实现了 mAP50 高达 $+14.1$ 的改进。消融研究证实了全球和地方阶段的互补作用，并强调了层级协调的重要性。代码发布于 \href{this https URL}{this url}。

Title: SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Authors: Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13874
Pdf URL: https://arxiv.org/pdf/2512.13874
Copy Paste: [[2512.13874]] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning(https://arxiv.org/abs/2512.13874)
Keywords: generation
Abstract: As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.
摘要：作为人类，我们是天生的任意视野推理者，即，在完成给定任务时，我们可以决定是否迭代地浏览长视频或完整观看短视频。考虑到这一点，人们会期望视频推理模型能够在不同的持续时间内灵活地进行推理。然而，SOTA 模型仍然经过训练，可以在处理大量帧的同时预测单轮答案，类似于观看整个长视频，需要大量资源。这就提出了一个问题：是否有可能开发出高性能的任意视野视频推理系统？受人类行为的启发，我们首先提出了 SAGE，这是一种代理系统，可以对长视频进行多轮推理，同时在单轮中处理更简单的问题。其次，我们介绍了一个简单的合成数据生成管道，使用 Gemini-2.5-Flash 来训练编排器 SAGE-MM，它是 SAGE 的核心。我们进一步提出了一种有效的 RL 训练后配方，对于在 SAGE-MM 中灌输任意视野推理能力至关重要。第三，我们策划了平均持续时间超过 700 秒的 SAGE-Bench，用于评估现实娱乐用例中的视频推理能力。最后，我们凭经验验证了我们的系统、数据和 RL 配方的有效性，观察到开放式视频推理任务的显着改进高达 6.1%，以及长度超过 10 分钟的视频的显着改进 8.2%。

Title: A Complete Guide to Spherical Equivariant Graph Transformers

Authors: Sophia Tang
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.13927
Pdf URL: https://arxiv.org/pdf/2512.13927
Copy Paste: [[2512.13927]] A Complete Guide to Spherical Equivariant Graph Transformers(https://arxiv.org/abs/2512.13927)
Keywords: generative
Abstract: Spherical equivariant graph neural networks (EGNNs) provide a principled framework for learning on three-dimensional molecular and biomolecular systems, where predictions must respect the rotational symmetries inherent in physics. These models extend traditional message-passing GNNs and Transformers by representing node and edge features as spherical tensors that transform under irreducible representations of the rotation group SO(3), ensuring that predictions change in physically meaningful ways under rotations of the input. This guide develops a complete, intuitive foundation for spherical equivariant modeling - from group representations and spherical harmonics, to tensor products, Clebsch-Gordan decomposition, and the construction of SO(3)-equivariant kernels. Building on this foundation, we construct the Tensor Field Network and SE(3)-Transformer architectures and explain how they perform equivariant message-passing and attention on geometric graphs. Through clear mathematical derivations and annotated code excerpts, this guide serves as a self-contained introduction for researchers and learners seeking to understand or implement spherical EGNNs for applications in chemistry, molecular property prediction, protein structure modeling, and generative modeling.
摘要：球形等变图神经网络 (EGNN) 为学习三维分子和生物分子系统提供了一个原则框架，其中预测必须尊重物理学固有的旋转对称性。这些模型通过将节点和边缘特征表示为球形张量来扩展传统的消息传递 GNN 和 Transformer，这些张量在旋转组 SO(3) 的不可约表示下进行变换，确保预测在输入旋转下以物理上有意义的方式发生变化。本指南为球面等变建模奠定了完整、直观的基础 - 从群表示和球谐函数到张量积、Clebsch-Gordan 分解以及 SO(3) 等变核的构造。在此基础上，我们构建了张量场网络和 SE(3)-Transformer 架构，并解释了它们如何在几何图上执行等变消息传递和注意力。通过清晰的数学推导和带注释的代码摘录，本指南为寻求理解或实现球形 EGNN 在化学、分子特性预测、蛋白质结构建模和生成建模中应用的研究人员和学习者提供了独立的介绍。

Title: An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes

Authors: Alban Gauthier, Valentin Deschaintre, Alexandre Lanvin, Fredo Durand, Adrien Bousseau, George Drettakis
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.13950
Pdf URL: https://arxiv.org/pdf/2512.13950
Copy Paste: [[2512.13950]] An evaluation of SVBRDF Prediction from Generative Image Models for Appearance Modeling of 3D Scenes(https://arxiv.org/abs/2512.13950)
Keywords: generative
Abstract: Digital content creation is experiencing a profound change with the advent of deep generative models. For texturing, conditional image generators now allow the synthesis of realistic RGB images of a 3D scene that align with the geometry of that scene. For appearance modeling, SVBRDF prediction networks recover material parameters from RGB images. Combining these technologies allows us to quickly generate SVBRDF maps for multiple views of a 3D scene, which can be merged to form a SVBRDF texture atlas of that scene. In this paper, we analyze the challenges and opportunities for SVBRDF prediction in the context of such a fast appearance modeling pipeline. On the one hand, single-view SVBRDF predictions might suffer from multiview incoherence and yield inconsistent texture atlases. On the other hand, generated RGB images, and the different modalities on which they are conditioned, can provide additional information for SVBRDF estimation compared to photographs. We compare neural architectures and conditions to identify designs that achieve high accuracy and coherence. We find that, surprisingly, a standard UNet is competitive with more complex designs. Project page: this http URL
摘要：随着深度生成模型的出现，数字内容创作正在经历深刻的变化。对于纹理，条件图像生成器现在允许合成与该场景的几何形状对齐的 3D 场景的真实 RGB 图像。对于外观建模，SVBRDF 预测网络从 RGB 图像中恢复材料参数。结合这些技术，我们可以快速生成 3D 场景的多个视图的 SVBRDF 贴图，这些贴图可以合并形成该场景的 SVBRDF 纹理图集。在本文中，我们分析了在如此快速的外观建模流程中 SVBRDF 预测的挑战和机遇。一方面，单视图 SVBRDF 预测可能会受到多视图不相干的影响并产生不一致的纹理图集。另一方面，与照片相比，生成的 RGB 图像以及它们所依赖的不同模态可以为 SVBRDF 估计提供更多信息。我们比较神经架构和条件，以确定实现高精度和一致性的设计。令人惊讶的是，我们发现标准 UNet 与更复杂的设计相比具有竞争力。项目页面：这个http URL

Title: From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation

Authors: Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13953
Pdf URL: https://arxiv.org/pdf/2512.13953
Copy Paste: [[2512.13953]] From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation(https://arxiv.org/abs/2512.13953)
Keywords: generation
Abstract: The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: this https URL.
摘要：文本到图像传播模型的快速发展引起了人们对未经授权复制商标内容的严重担忧。虽然之前的工作针对的是一般概念（例如风格、名人），但它未能解决特定的品牌标识符。至关重要的是，我们注意到品牌认知是多维的，超越了明确的徽标，涵盖了独特的结构特征（例如汽车的前格栅）。为了解决这个问题，我们引入了去品牌化，这是一项新颖的任务，可以细粒度地删除商标和微妙的结构性品牌特征，同时保持语义一致性。为了促进研究，我们构建了一个全面的基准数据集。认识到现有的品牌检测器仅限于徽标并且无法捕获抽象的商业外观（例如可口可乐瓶的形状），我们引入了一种基于视觉语言模型（VLM）的新颖的评估指标。这种基于 VLM 的指标使用问答框架来探测图像中的显式徽标和隐式的整体品牌特征。此外，我们观察到，随着模型保真度的提高，新系统（SDXL、FLUX）比旧模型（稳定扩散）更容易合成品牌标识符，去品牌化挑战的紧迫性就凸显出来了。我们的结果经过 VLM 指标的验证，证实去品牌是一个独特的、实际相关的问题，需要专门的技术。项目页面：此 https URL。

Title: Repurposing 2D Diffusion Models for 3D Shape Completion

Authors: Yao He, Youngjoong Kwon, Tiange Xiang, Wenxiao Cai, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.13991
Pdf URL: https://arxiv.org/pdf/2512.13991
Copy Paste: [[2512.13991]] Repurposing 2D Diffusion Models for 3D Shape Completion(https://arxiv.org/abs/2512.13991)
Keywords: generative
Abstract: We present a framework that adapts 2D diffusion models for 3D shape completion from incomplete point clouds. While text-to-image diffusion models have achieved remarkable success with abundant 2D data, 3D diffusion models lag due to the scarcity of high-quality 3D datasets and a persistent modality gap between 3D inputs and 2D latent spaces. To overcome these limitations, we introduce the Shape Atlas, a compact 2D representation of 3D geometry that (1) enables full utilization of the generative power of pretrained 2D diffusion models, and (2) aligns the modalities between the conditional input and output spaces, allowing more effective conditioning. This unified 2D formulation facilitates learning from limited 3D data and produces high-quality, detail-preserving shape completions. We validate the effectiveness of our results on the PCN and ShapeNet-55 datasets. Additionally, we show the downstream application of creating artist-created meshes from our completed point clouds, further demonstrating the practicality of our method.
摘要：我们提出了一个框架，该框架采用 2D 扩散模型来从不完整的点云完成 3D 形状。虽然文本到图像的扩散模型在丰富的 2D 数据上取得了显着的成功，但由于高质量 3D 数据集的稀缺以及 3D 输入和 2D 潜在空间之间持续存在的模态差距，3D 扩散模型显得滞后。为了克服这些限制，我们引入了 Shape Atlas，这是 3D 几何的紧凑 2D 表示，它 (1) 能够充分利用预训练的 2D 扩散模型的生成能力，(2) 对齐条件输入和输出空间之间的模式，从而实现更有效的调节。这种统一的 2D 公式有助于从有限的 3D 数据中学习，并生成高质量、保留细节的形状完成。我们在 PCN 和 ShapeNet-55 数据集上验证了结果的有效性。此外，我们还展示了从完成的点云创建艺术家创建的网格的下游应用，进一步证明了我们方法的实用性。

Title: Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

Authors: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14008
Pdf URL: https://arxiv.org/pdf/2512.14008
Copy Paste: [[2512.14008]] Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models(https://arxiv.org/abs/2512.14008)
Keywords: generation
Abstract: Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.
摘要：掩模离散扩散模型 (MDM) 在广泛的多模态任务中取得了出色的性能，包括图像理解、生成和编辑。然而，由于需要在每个采样步骤重复处理冗余屏蔽标记，因此它们的推理速度仍然不够理想。在这项工作中，我们提出了 Sparse-LaViDa，这是一种新颖的建模框架，可以在每个推理步骤动态截断不必要的屏蔽标记，以加速 MDM 采样。为了保持生成质量，我们引入了专门的寄存器令牌，作为截断令牌的紧凑表示。此外，为了确保训练和推理之间的一致性，我们设计了一个专门的注意掩模，它忠实地匹配训练期间的截断采样过程。 Sparse-LaViDa 基于最先进的统一 MDM LaViDa-O 构建，可在文本到图像生成、图像编辑和数学推理等各种任务中实现高达 2 倍的加速，同时保持生成质量。

Title: EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment

Authors: Juseung Yun, Sunwoo Yu, Sumin Ha, Jonghyun Kim, Janghyeon Lee, Jongseong Jang, Soonyoung Lee
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.14019
Pdf URL: https://arxiv.org/pdf/2512.14019
Copy Paste: [[2512.14019]] EXAONE Path 2.5: Pathology Foundation Model with Multi-Omics Alignment(https://arxiv.org/abs/2512.14019)
Keywords: generation
Abstract: Cancer progression arises from interactions across multiple biological layers, especially beyond morphological and across molecular layers that remain invisible to image-only models. To capture this broader biological landscape, we present EXAONE Path 2.5, a pathology foundation model that jointly models histologic, genomic, epigenetic and transcriptomic modalities, producing an integrated patient representation that reflects tumor biology more comprehensively. Our approach incorporates three key components: (1) multimodal SigLIP loss enabling all-pairwise contrastive learning across heterogeneous modalities, (2) a fragment-aware rotary positional encoding (F-RoPE) module that preserves spatial structure and tissue-fragment topology in WSI, and (3) domain-specialized internal foundation models for both WSI and RNA-seq to provide biologically grounded embeddings for robust multimodal alignment. We evaluate EXAONE Path 2.5 against six leading pathology foundation models across two complementary benchmarks: an internal real-world clinical dataset and the Patho-Bench benchmark covering 80 tasks. Our framework demonstrates high data and parameter efficiency, achieving on-par performance with state-of-the-art foundation models on Patho-Bench while exhibiting the highest adaptability in the internal clinical setting. These results highlight the value of biologically informed multimodal design and underscore the potential of integrated genotype-to-phenotype modeling for next-generation precision oncology.
摘要：癌症的进展是由多个生物层之间的相互作用引起的，特别是在形态学和分子层之外，这些层对于仅图像模型来说仍然是不可见的。为了捕捉更广泛的生物学景观，我们提出了 EXAONE Path 2.5，这是一种病理学基础模型，联合模拟组织学、基因组、表观遗传和转录组学模式，产生更全面地反映肿瘤生物学的综合患者表征。我们的方法包含三个关键组成部分：(1) 多模态 SigLIP 损失，支持跨异质模态的全成对对比学习；(2) 片段感知旋转位置编码 (F-RoPE) 模块，可保留 WSI 中的空间结构和组织片段拓扑；(3) WSI 和 RNA-seq 的领域专用内部基础模型，为稳健的多模态比对提供生物基础嵌入。我们通过两个互补的基准，针对六种领先的病理学基础模型来评估 EXAONE Path 2.5：内部真实临床数据集和涵盖 80 项任务的 Patho-Bench 基准。我们的框架展示了高数据和参数效率，在 Patho-Bench 上实现了与最先进的基础模型相当的性能，同时在内部临床环境中表现出最高的适应性。这些结果凸显了生物学信息多模式设计的价值，并强调了集成基因型到表型建模对于下一代精准肿瘤学的潜力。

Title: FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling

Authors: Kim Sung-Bin, Joohyun Chang, David Harwath, Tae-Hyun Oh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14056
Pdf URL: https://arxiv.org/pdf/2512.14056
Copy Paste: [[2512.14056]] FacEDiT: Unified Talking Face Editing and Generation via Facial Motion Infilling(https://arxiv.org/abs/2512.14056)
Keywords: generation
Abstract: Talking face editing and face generation have often been studied as distinct problems. In this work, we propose viewing both not as separate tasks but as subtasks of a unifying formulation, speech-conditional facial motion infilling. We explore facial motion infilling as a self-supervised pretext task that also serves as a unifying formulation of dynamic talking face synthesis. To instantiate this idea, we propose FacEDiT, a speech-conditional Diffusion Transformer trained with flow matching. Inspired by masked autoencoders, FacEDiT learns to synthesize masked facial motions conditioned on surrounding motions and speech. This formulation enables both localized generation and edits, such as substitution, insertion, and deletion, while ensuring seamless transitions with unedited regions. In addition, biased attention and temporal smoothness constraints enhance boundary continuity and lip synchronization. To address the lack of a standard editing benchmark, we introduce FacEDiTBench, the first dataset for talking face editing, featuring diverse edit types and lengths, along with new evaluation metrics. Extensive experiments validate that talking face editing and generation emerge as subtasks of speech-conditional motion infilling; FacEDiT produces accurate, speech-aligned facial edits with strong identity preservation and smooth visual continuity while generalizing effectively to talking face generation.
摘要：说话的脸部编辑和脸部生成经常被作为不同的问题来研究。在这项工作中，我们建议不要将两者视为单独的任务，而是将其视为统一公式（语音条件面部运动填充）的子任务。我们探索面部运动填充作为一种自我监督的借口任务，也可以作为动态说话面部合成的统一公式。为了实例化这个想法，我们提出了 FacEDiT，一种通过流匹配训练的语音条件扩散变压器。受蒙面自动编码器的启发，FacEDiT 学会根据周围的动作和语音合成蒙面的面部动作。这种公式可以实现本地化生成和编辑，例如替换、插入和删除，同时确保与未编辑区域的无缝过渡。此外，偏置注意力和时间平滑约束增强了边界连续性和唇同步。为了解决缺乏标准编辑基准的问题，我们引入了 FacEDiTBench，这是第一个用于说话面部编辑的数据集，具有多种编辑类型和长度以及新的评估指标。大量实验验证了说话面孔编辑和生成是作为语音条件运动填充的子任务而出现的； FacEDiT 可生成准确、与语音一致的面部编辑，具有强大的身份保留和平滑的视觉连续性，同时有效地推广到说话的面部生成。

Title: Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution

Authors: Hao Chen, Junyang Chen, Jinshan Pan, Jiangxin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14061
Pdf URL: https://arxiv.org/pdf/2512.14061
Copy Paste: [[2512.14061]] Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution(https://arxiv.org/abs/2512.14061)
Keywords: super-resolution, generative
Abstract: Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods with efficient one-step inference.
摘要：最近基于扩散的一步法在图像超分辨率领域取得了显着的进展，但它们仍然受到三个关键限制的限制：（1）低质量（LQ）输入的压缩编码造成的信息损失导致保真度性能较差； (2) 生成先验的区域区分性激活不足； (3)文本提示与其对应的语义区域之间的错位。为了解决这些限制，我们提出了 CODSR，一种用于图像超分辨率的可控单步扩散网络。首先，我们提出了一种 LQ 引导的特征调制模块，该模块利用来自 LQ 输入的原始未压缩信息为扩散过程提供高保真条件。然后，我们开发了一种区域自适应生成先验激活方法，以有效增强感知丰富度而不牺牲局部结构保真度。最后，我们采用文本匹配指导策略来充分利用文本提示的调节潜力。大量实验表明，与具有高效一步推理的最先进方法相比，CODSR 实现了卓越的感知质量和竞争保真度。

Title: SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

Authors: Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14068
Pdf URL: https://arxiv.org/pdf/2512.14068
Copy Paste: [[2512.14068]] SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding(https://arxiv.org/abs/2512.14068)
Keywords: generation
Abstract: Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
摘要：分块离散扩散在并行生成和因果依赖建模之间提供了有吸引力的平衡，使其成为视觉语言建模的有前途的支柱。然而，其实际采用受到训练成本高、收敛速度慢和不稳定的限制，迄今为止一直落后于强大的自回归（AR）基线。我们提出了 \textbf{SDAR-VL}，这是第一个将分块离散扩散系统应用于大规模视觉语言理解（VLU）的系统，以及 \emph{用于高效稳定训练的集成框架}。该框架统一了三个组件：（1）\textbf{异步分块噪声调度}，使每批内的监督多样化； (2) \textbf{有效掩模比缩放}，用于随机掩模下的无偏损失归一化； (3) \textbf{渐进式 Beta 噪声课程}，增加有效掩模覆盖率，同时保留腐败多样性。对 21 个单图像、多图像和视频基准的实验表明，与传统的块扩散相比，SDAR-VL 持续提高了\emph{训练效率}、\emph{收敛稳定性}和\emph{任务性能}。在此评估套件中，SDAR-VL 在基于扩散的视觉语言模型中树立了新的技术水平，并且在匹配设置下，匹配或超越强大的 AR 基线（例如 LLaVA-OneVision 以及全局扩散基线 LLaDA-V），将块式扩散建立为 VLU 的实用骨干。

Title: AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Authors: Sisi Dai, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14095
Pdf URL: https://arxiv.org/pdf/2512.14095
Copy Paste: [[2512.14095]] AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation(https://arxiv.org/abs/2512.14095)
Keywords: generation
Abstract: Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.
摘要：尽管使用监督方法在文本驱动的 4D 人机交互 (HOI) 生成方面取得了重大进展，但可扩展性仍然受到大规模 4D HOI 数据集稀缺的限制。为了克服这个问题，最近的方法尝试使用预先训练的图像扩散模型进行零样本 4D HOI 生成。然而，交互线索在生成过程中很少被提炼，限制了它们在不同场景中的适用性。在本文中，我们提出了 AnchorHOI，这是一种新颖的框架，通过将视频扩散模型纳入图像扩散模型之外，彻底利用混合先验，从而推进 4D HOI 的生成。然而，利用此类先验直接优化高维 4D HOI 仍然具有挑战性，特别是对于人体姿势和组合运动。为了应对这一挑战，AnchorHOI 引入了一种基于锚点的先验蒸馏策略，该策略构建交互感知的锚点，然后利用它们在易于处理的两步过程中指导生成。具体来说，两个定制的锚点是为 4D HOI 生成而设计的：用于表达交互合成的锚点神经辐射场 (NeRF) 和用于真实运动合成的锚点关键点。大量实验表明，AnchorHOI 具有卓越的多样性和泛化性，优于以前的方法。

Title: OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration

Authors: Ruitong Sun, Tianze Yang, Wei Niu, Jin Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14096
Pdf URL: https://arxiv.org/pdf/2512.14096
Copy Paste: [[2512.14096]] OUSAC: Optimized Guidance Scheduling with Adaptive Caching for DiT Acceleration(https://arxiv.org/abs/2512.14096)
Keywords: generation
Abstract: Diffusion models have emerged as the dominant paradigm for high-quality image generation, yet their computational expense remains substantial due to iterative denoising. Classifier-Free Guidance (CFG) significantly enhances generation quality and controllability but doubles the computation by requiring both conditional and unconditional forward passes at every timestep. We present OUSAC (Optimized gUidance Scheduling with Adaptive Caching), a framework that accelerates diffusion transformers (DiT) through systematic optimization. Our key insight is that variable guidance scales enable sparse computation: adjusting scales at certain timesteps can compensate for skipping CFG at others, enabling both fewer total sampling steps and fewer CFG steps while maintaining quality. However, variable guidance patterns introduce denoising deviations that undermine standard caching methods, which assume constant CFG scales across steps. Moreover, different transformer blocks are affected at different levels under dynamic conditions. This paper develops a two-stage approach leveraging these insights. Stage-1 employs evolutionary algorithms to jointly optimize which timesteps to skip and what guidance scale to use, eliminating up to 82% of unconditional passes. Stage-2 introduces adaptive rank allocation that tailors calibration efforts per transformer block, maintaining caching effectiveness under variable guidance. Experiments demonstrate that OUSAC significantly outperforms state-of-the-art acceleration methods, achieving 53% computational savings with 15% quality improvement on DiT-XL/2 (ImageNet 512x512), 60% savings with 16.1% improvement on PixArt-alpha (MSCOCO), and 5x speedup on FLUX while improving CLIP Score over the 50-step baseline.
摘要：扩散模型已成为高质量图像生成的主导范例，但由于迭代去噪，其计算成本仍然很高。无分类器引导（CFG）显着提高了生成质量和可控性，但由于在每个时间步都需要条件和无条件前向传递，因此计算量增加了一倍。我们提出了 OUSAC（自适应缓存优化指导调度），这是一个通过系统优化加速扩散变压器 (DiT) 的框架。我们的主要见解是可变引导尺度可以实现稀疏计算：在某些时间步长调整尺度可以补偿在其他时间步长跳过 CFG，从而在保持质量的同时减少总采样步骤和 CFG 步骤。然而，可变引导模式会引入降噪偏差，从而破坏标准缓存方法，该方法假设跨步骤的 CFG 比例恒定。此外，不同的变压器块在动态条件下受到不同程度的影响。本文利用这些见解开发了一种两阶段方法。 Stage-1 采用进化算法来联合优化要跳过的时间步长以及要使用的引导尺度，从而消除高达 82% 的无条件通过。第 2 阶段引入了自适应排名分配，可定制每个变压器块的校准工作，在变量指导下保持缓存有效性。实验表明，OUSAC 的性能显着优于最先进的加速方法，在 DiT-XL/2 (ImageNet 512x512) 上实现了 53% 的计算节省，质量提高了 15%，在 PixArt-alpha (MSCOCO) 上实现了 60% 的计算节省和 16.1% 的改进，在 FLUX 上实现了 5 倍的加速，同时在 50 步基线上提高了 CLIP 分数。

Title: ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models

Authors: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14099
Pdf URL: https://arxiv.org/pdf/2512.14099
Copy Paste: [[2512.14099]] ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models(https://arxiv.org/abs/2512.14099)
Keywords: generation
Abstract: Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.
摘要：由于难以保持不同视点的几何一致性，从单个图像和文本描述生成多视图图像仍然具有挑战性。现有方法通常依赖于 3D 感知架构或专门的扩散模型，这些模型需要大量的多视图训练数据和复杂的几何先验。在这项工作中，我们介绍了 ViewMask-1-to-3，这是一种将离散扩散模型应用于多视图图像生成的开创性方法。与在潜在空间中操作的连续扩散方法不同，ViewMask-1-to-3 将多视图合成表述为离散序列建模问题，其中每个视点都表示为通过 MAGVIT-v2 标记化获得的视觉标记。通过屏蔽标记预测来统一语言和视觉，我们的方法可以通过文本输入的迭代标记解锁来逐步生成多个观点。 ViewMask-1-to-3 通过简单的随机屏蔽与自注意力相结合实现了跨视图一致性，消除了对复杂 3D 几何约束或专门注意力架构的要求。我们的方法表明，离散扩散为现有的多视图生成方法提供了一种可行且简单的替代方案，在 PSNR、SSIM 和 LPIPS 方面在 GSO 和 3D-FUTURE 数据集中平均排名第一，同时保持了架构简单性。

Title: A First-Order Logic-Based Alternative to Reward Models in RLHF

Authors: Chunjin Jian, Xinhua Zhu
Subjects: cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2512.14100
Pdf URL: https://arxiv.org/pdf/2512.14100
Copy Paste: [[2512.14100]] A First-Order Logic-Based Alternative to Reward Models in RLHF(https://arxiv.org/abs/2512.14100)
Keywords: generation
Abstract: Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. However, the quality and stability of the trained reward model largely determine the final alignment performance. Existing approaches such as Proximal Policy Optimization (PPO) rely heavily on reward models to guide LLMs toward human-aligned behaviors. In this work, we propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling. Instead of relying on heuristic reward estimation, our method leverages formal logical consistency to steer model alignment with human preferences. Since real-world questions can be interpreted from multiple perspectives, to ensure that logic-based reinforcement learning does not cause model collapse, we introduce S-GRPO, a supervised variant of the GRPO framework. S-GRPO incorporates an additional supervised component and jointly optimizes the generation term, KL-divergence regularization, and label-based objective during training. Experimental results demonstrate that S-GRPO consistently outperforms standard supervised fine-tuning (SFT) in both performance and robustness. Furthermore, it extends existing preference-learning frameworks such as GRPO and DPO, offering a more flexible and task-adaptive approach to alignment training. Our code is available at this https URL.
摘要：人类反馈强化学习 (RLHF) 在使大型语言模型 (LLM) 与人类价值观和偏好保持一致方面发挥着至关重要的作用。然而，训练后的奖励模型的质量和稳定性很大程度上决定了最终的对齐性能。诸如近端策略优化（PPO）之类的现有方法在很大程度上依赖于奖励模型来引导法学硕士走向与人类一致的行为。在这项工作中，我们提出了一种基于逻辑相似性的奖励机制作为传统奖励模型的替代方案。我们的方法不依赖启发式奖励估计，而是利用形式逻辑一致性来引导模型与人类偏好保持一致。由于现实世界的问题可以从多个角度进行解释，为了确保基于逻辑的强化学习不会导致模型崩溃，我们引入了 S-GRPO，它是 GRPO 框架的有监督变体。 S-GRPO 结合了额外的监督组件，并在训练期间联合优化生成项、KL 散度正则化和基于标签的目标。实验结果表明，S-GRPO 在性能和鲁棒性方面始终优于标准监督微调（SFT）。此外，它扩展了现有的偏好学习框架，例如 GRPO 和 DPO，提供了一种更灵活和任务自适应的对齐训练方法。我们的代码可以在这个 https URL 上找到。

Title: MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction

Authors: Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14114
Pdf URL: https://arxiv.org/pdf/2512.14114
Copy Paste: [[2512.14114]] MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction(https://arxiv.org/abs/2512.14114)
Keywords: generative
Abstract: Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at this https URL.
摘要：文档图像增强和二值化通常在文档分析和识别任务之前执行，以提高光学字符识别 (OCR) 系统的效率和准确性。这是因为直接识别退化文档中的文本，特别是彩色图像中的文本，通常会导致识别性能不理想。为了解决这些问题，现有方法针对不同颜色通道训练独立的生成对抗网络（GAN）以消除阴影和噪声，从而促进高效的文本信息提取。然而，部署多个 GAN 会导致训练和推理时间较长。为了减少文档图像增强和二值化模型的训练和推理时间，我们提出了 MFE-GAN，这是一种基于 GAN 的高效框架，具有多尺度特征提取 (MFE)，它结合了 Haar 小波变换 (HWT) 和归一化来处理文档图像，然后将其输入 GAN 进行训练。此外，我们提出了新颖的生成器、鉴别器和损失函数来提高模型的性能，并进行消融研究来证明它们的有效性。 Benchmark、Nabuco 和 CMATERdb 数据集上的实验结果表明，所提出的 MFE-GAN 显着减少了总训练和推理时间，同时保持了与最先进 (SOTA) 方法相当的性能。这项工作的实现可以在 https URL 上找到。

Title: SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

Authors: Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14140
Pdf URL: https://arxiv.org/pdf/2512.14140
Copy Paste: [[2512.14140]] SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing(https://arxiv.org/abs/2512.14140)
Keywords: generation
Abstract: Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style-sensitive structure of line art while supporting both high-level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT-based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between instruction-guided edits and line-guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task-guided mixture-of-experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation. Extensive experiments show state-of-the-art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision.
摘要：草图编辑是数字插图的核心，但现有的图像编辑系统很难保留线条艺术的稀疏、风格敏感的结构，同时支持高级语义更改和精确的局部重画。我们推出了 SketchAssist，这是一款交互式草图绘制助手，它通过将指令引导的全局编辑与线条引导的区域重绘相结合来加速创作，同时保持不相关的区域和整体构图完好无损。为了大规模启用此助手，我们引入了一个可控的数据生成管道，该管道（i）从无属性的基础草图构建属性添加序列，（ii）通过跨序列采样形成多步骤编辑链，以及（iii）通过应用于不同草图的保留风格的属性删除模型来扩展风格覆盖范围。在此数据的基础上，SketchAssist 采用了统一的草图编辑框架，对基于 DiT 的编辑器进行了最小的更改。我们重新利用 RGB 通道对输入进行编码，从而在单个输入界面内实现指令引导编辑和线条引导重绘之间的无缝切换。为了进一步专业化跨模式的行为，我们将任务引导的专家组合集成到 LoRA 层中，通过文本和视觉提示进行路由，以提高语义可控性、结构保真度和风格保留。大量的实验显示了这两项任务的最先进的结果，与最近的基线相比，具有出色的指令依从性和风格/结构保留。我们的数据集和 SketchAssist 一起为草图创建和修改提供了实用、可控的助手。

Title: TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models

Authors: Hanning Chen, Keyu Man, Kevin Zhu, Chenguang Zhu, Haonan Li, Tongbo Luo, Xizhou Feng, Wei Sun, Sreen Tallam, Mohsen Imani, Partha Kanuparthy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14141
Pdf URL: https://arxiv.org/pdf/2512.14141
Copy Paste: [[2512.14141]] TorchTraceAP: A New Benchmark Dataset for Detecting Performance Anti-Patterns in Computer Vision Models(https://arxiv.org/abs/2512.14141)
Keywords: generation
Abstract: Identifying and addressing performance anti-patterns in machine learning (ML) models is critical for efficient training and inference, but it typically demands deep expertise spanning system infrastructure, ML models and kernel development. While large tech companies rely on dedicated ML infrastructure engineers to analyze torch traces and benchmarks, such resource-intensive workflows are largely inaccessible to computer vision researchers in general. Among the challenges, pinpointing problematic trace segments within lengthy execution traces remains the most time-consuming task, and is difficult to automate with current ML models, including LLMs. In this work, we present the first benchmark dataset specifically designed to evaluate and improve ML models' ability to detect anti patterns in traces. Our dataset contains over 600 PyTorch traces from diverse computer vision models classification, detection, segmentation, and generation collected across multiple hardware platforms. We also propose a novel iterative approach: a lightweight ML model first detects trace segments with anti patterns, followed by a large language model (LLM) for fine grained classification and targeted feedback. Experimental results demonstrate that our method significantly outperforms unsupervised clustering and rule based statistical techniques for detecting anti pattern regions. Our method also effectively compensates LLM's limited context length and reasoning inefficiencies.
摘要：识别和解决机器学习 (ML) 模型中的性能反模式对于高效训练和推理至关重要，但它通常需要涵盖系统基础设施、ML 模型和内核开发的深厚专业知识。虽然大型科技公司依靠专门的机器学习基础设施工程师来分析火炬轨迹和基准，但一般计算机视觉研究人员基本上无法访问这种资源密集型工作流程。在这些挑战中，在冗长的执行跟踪中查明有问题的跟踪段仍然是最耗时的任务，并且很难使用当前的 ML 模型（包括法学硕士）实现自动化。在这项工作中，我们提出了第一个基准数据集，专门用于评估和提高机器学习模型检测痕迹中反模式的能力。我们的数据集包含来自跨多个硬件平台收集的各种计算机视觉模型分类、检测、分割和生成的 600 多个 PyTorch 跟踪。我们还提出了一种新颖的迭代方法：轻量级 ML 模型首先检测具有反模式的跟踪片段，然后使用大型语言模型 (LLM) 进行细粒度分类和有针对性的反馈。实验结果表明，我们的方法在检测反模式区域方面明显优于无监督聚类和基于规则的统计技术。我们的方法还有效地弥补了法学硕士有限的上下文长度和推理效率低下的问题。

Title: Random-Bridges as Stochastic Transports for Generative Models

Authors: Stefano Goria, Levent A. Mengütürk, Murat C. Mengütürk, Berkan Sesen
Subjects: cs.LG, math.PR
Abstract URL: https://arxiv.org/abs/2512.14190
Pdf URL: https://arxiv.org/pdf/2512.14190
Copy Paste: [[2512.14190]] Random-Bridges as Stochastic Transports for Generative Models(https://arxiv.org/abs/2512.14190)
Keywords: generation, generative
Abstract: This paper motivates the use of random-bridges -- stochastic processes conditioned to take target distributions at fixed timepoints -- in the realm of generative modelling. Herein, random-bridges can act as stochastic transports between two probability distributions when appropriately initialized, and can display either Markovian or non-Markovian, and either continuous, discontinuous or hybrid patterns depending on the driving process. We show how one can start from general probabilistic statements and then branch out into specific representations for learning and simulation algorithms in terms of information processing. Our empirical results, built on Gaussian random bridges, produce high-quality samples in significantly fewer steps compared to traditional approaches, while achieving competitive Frechet inception distance scores. Our analysis provides evidence that the proposed framework is computationally cheap and suitable for high-speed generation tasks.
摘要：本文推动了在生成建模领域使用随机桥——在固定时间点获取目标分布的随机过程。这里，随机桥在适当初始化时可以充当两个概率分布之间的随机传输，并且可以显示马尔可夫或非马尔可夫，以及取决于驱动过程的连续、不连续或混合模式。我们展示了如何从一般概率陈述开始，然后扩展到信息处理方面的学习和模拟算法的特定表示。我们的实证结果建立在高斯随机桥的基础上，与传统方法相比，以明显更少的步骤生成高质量的样本，同时获得具有竞争力的 Frechet 起始距离分数。我们的分析提供的证据表明，所提出的框架计算成本低并且适合高速生成任务。

Title: DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Authors: Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.14217
Pdf URL: https://arxiv.org/pdf/2512.14217
Copy Paste: [[2512.14217]] DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos(https://arxiv.org/abs/2512.14217)
Keywords: generation
Abstract: Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated RGB and depth sequences to regress the robot's joint angles. Experiments on Bridge V2, Berkeley Autolab, and simulation benchmarks show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
摘要：视频扩散模型为具体人工智能提供了强大的现实世界模拟器，但机器人操作的可控性仍然有限。最近关于轨迹调节视频生成的工作解决了这一差距，但通常依赖于 2D 轨迹或单一模态调节，这限制了它们产生可控且一致的机器人演示的能力。我们提出了 DRAW2ACT，一种深度感知的轨迹条件视频生成框架，它从输入轨迹中提取多个正交表示，捕获深度、语义、形状和运动，并将它们注入扩散模型中。此外，我们建议联合生成空间对齐的 RGB 和深度视频，利用跨模态注意力机制和深度监督来增强时空一致性。最后，我们引入了一个以生成的 RGB 和深度序列为条件的多模态策略模型，以回归机器人的关节角度。 Bridge V2、Berkeley Autolab 和模拟基准测试表明，与现有基准相比，DRAW2ACT 实现了卓越的视觉保真度和一致性，同时产生了更高的操作成功率。

Title: Estimating problem difficulty without ground truth using Large Language Model comparisons

Authors: Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14220
Pdf URL: https://arxiv.org/pdf/2512.14220
Copy Paste: [[2512.14220]] Estimating problem difficulty without ground truth using Large Language Model comparisons(https://arxiv.org/abs/2512.14220)
Keywords: generation
Abstract: Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.
摘要：大型语言模型 (LLM) 微调的最新进展显着提高了其在既定基准上的性能，强调了对日益困难的合成数据的需求。该数据生成流程中的关键步骤是估计问题难度的方法。当前的方法，例如人工校准或基于性能的评分，无法推广到分布外问题，即目前人类和法学硕士无法解决的问题，因为它们不可扩展、耗时且依赖于地面事实。因此，我们提出了一种估计问题难度的新方法，LLM比较，来解决这些限制。法学硕士执行成对难度比较，然后根据结果计算布拉德利-特里分数。为了验证我们的方法，我们首先提出一个概念框架，将现有方法放置在三个正交平面上——构造、规模和依赖性——确定度量需要占据哪些象限来对分布外问题进行评分。 LLM 比较自然地占据了所有理想的象限，作为连续和动态、与模型无关且独立于地面真实信息的第一个衡量标准。作为第二次验证，我们表明 LLM 比较与人类注释具有很强的一致性：Pearson $r \geq 0.80$ for $n=1876$。第三，我们表明 LLM 比较对幻觉具有鲁棒性，对于 $10\%$ 噪声注入，皮尔逊相关性下降不到 $6\%$。我们的工作代表着朝着取代耗时的人工注释和合成数据生成迈出的重要一步，并将成为课程设计、模型评估和人工智能辅助研究构思的重要驱动力。

Title: OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving

Authors: Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14225
Pdf URL: https://arxiv.org/pdf/2512.14225
Copy Paste: [[2512.14225]] OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving(https://arxiv.org/abs/2512.14225)
Keywords: generation, generative
Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
摘要：自动驾驶取得了显着的进步，这在很大程度上是由广泛的现实世界数据收集推动的。然而，获取多样化和极端情况的数据仍然成本高昂且效率低下。通过合成真实的传感器数据，生成模型已成为一种有前景的解决方案。然而，现有方法主要关注单模态生成，导致多模态传感器数据效率低下和失调。为了应对这些挑战，我们提出了 OminiGen，它在统一的框架中生成对齐的多模式传感器数据。我们的方法利用共享鸟瞰 (BEV) 空间来统一多模态特征，并设计一种新颖的可泛化多模态重建方法 UAE，以联合解码 LiDAR 和多视图相机数据。 UAE通过体渲染实现多模态传感器解码，实现准确灵活的重建。此外，我们将扩散变压器 (DiT) 与 ControlNet 分支结合起来，以实现可控多模态传感器的生成。我们的综合实验表明，OminiGen 在统一多模态传感器数据生成方面实现了所需的性能，具有多模态一致性和灵活的传感器调整。

Title: Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Authors: Divyansh Pareek, Sewoong Oh, Simon S. Du
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.14230
Pdf URL: https://arxiv.org/pdf/2512.14230
Copy Paste: [[2512.14230]] Understanding the Gain from Data Filtering in Multimodal Contrastive Learning(https://arxiv.org/abs/2512.14230)
Keywords: generation
Abstract: The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $\eta\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{\eta \sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{\eta n}}$ in the large $\eta$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $\eta$ regime.
摘要：现代多模态表示学习的成功依赖于互联网规模的数据集。由于大部分原始网络数据的质量较低，数据管理已成为培训流程中的关键步骤。使用经过训练的模型进行过滤（即基于教师的过滤）已成为一种成功的解决方案，利用预先训练的模型来计算质量分数。为了解释基于教师的过滤的实证成功，我们描述了标准双峰数据生成模型下过滤对比学习的性能。将 $\eta\in(0,1]$ 表示为 $n$ 配对样本中模态正确匹配的数据比例，我们利用线性对比学习设置来展示数据过滤的可证明的好处：$(i)$ 不进行过滤的误差上限和下限为 $\frac{1}{\eta \sqrt{n}}$，而 $(ii)$ 基于教师过滤的误差上限为 $\frac{1}{\sqrt{\eta n}}$ 在大 $\eta$ 区域中，以及 $\frac{1}{\sqrt{n}}$ 在小 $\eta$ 区域中。

Title: ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Authors: Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14234
Pdf URL: https://arxiv.org/pdf/2512.14234
Copy Paste: [[2512.14234]] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body(https://arxiv.org/abs/2512.14234)
Keywords: generation
Abstract: Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: this http URL
摘要：人类交流本质上是多模式和社交的：词语、韵律和肢体语言共同承载意图。然而，大多数现有系统将人类行为建模为翻译任务协同语音手势或文本到运动，将固定话语映射到运动剪辑，而不需要关于何时移动、做什么或如何适应多轮对话的代理决策。这导致了脆弱的时间安排、薄弱的社会基础以及碎片化的堆栈，其中语音、文本和动作是孤立地训练或推断的。我们引入了 ViBES（行为表达和同步中的声音），这是一种对话式 3D 代理，可以共同规划语言和动作，并执行对话条件下的身体动作。具体来说，ViBES 是一种语音语言行为 (SLB) 模型，具有混合模态专家 (MoME) 主干：针对语音、面部表情和身体运动的模态分区转换器专家。该模型通过模态硬路由处理交错的多模态令牌流（参数按专家分割），同时通过跨专家注意力共享信息。通过利用强大的预训练语音语言模型，代理支持混合主动交互：用户可以在对话中说出、键入或发出身体动作指令，并且系统公开用于流式响应的可控行为钩子。我们通过对话运动对齐和行为质量的自动指标进一步对多轮对话进行基准测试，并观察到与强大的共同语音和文本到运动基线相比的一致增益。 ViBES 超越了“语音调节运动生成”，转向了代理虚拟身体，其中语言、韵律和运动共同生成，从而实现了可控的、具有社交能力的 3D 交互。代码和数据将在以下位置提供：此 http URL

Title: 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation

Authors: Jimmie Kwok, Holger Caesar, Andras Palffy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14235
Pdf URL: https://arxiv.org/pdf/2512.14235
Copy Paste: [[2512.14235]] 4D-RaDiff: Latent Diffusion for 4D Radar Point Cloud Generation(https://arxiv.org/abs/2512.14235)
Keywords: generation
Abstract: Automotive radar has shown promising developments in environment perception due to its cost-effectiveness and robustness in adverse weather conditions. However, the limited availability of annotated radar data poses a significant challenge for advancing radar-based perception systems. To address this limitation, we propose a novel framework to generate 4D radar point clouds for training and evaluating object detectors. Unlike image-based diffusion, our method is designed to consider the sparsity and unique characteristics of radar point clouds by applying diffusion to a latent point cloud representation. Within this latent space, generation is controlled via conditioning at either the object or scene level. The proposed 4D-RaDiff converts unlabeled bounding boxes into high-quality radar annotations and transforms existing LiDAR point cloud data into realistic radar scenes. Experiments demonstrate that incorporating synthetic radar data of 4D-RaDiff as data augmentation method during training consistently improves object detection performance compared to training on real data only. In addition, pre-training on our synthetic data reduces the amount of required annotated radar data by up to 90% while achieving comparable object detection performance.
摘要：汽车雷达由于其在恶劣天气条件下的成本效益和鲁棒性，在环境感知方面显示出了有希望的发展。然而，带注释的雷达数据的有限可用性对推进基于雷达的感知系统提出了重大挑战。为了解决这一限制，我们提出了一种新颖的框架来生成 4D 雷达点云，用于训练和评估目标检测器。与基于图像的扩散不同，我们的方法旨在通过将扩散应用于潜在点云表示来考虑雷达点云的稀疏性和独特特征。在这个潜在空间中，生成是通过对象或场景级别的调节来控制的。所提出的 4D-RaDiff 将未标记的边界框转换为高质量的雷达注释，并将现有的 LiDAR 点云数据转换为真实的雷达场景。实验表明，与仅在真实数据上进行训练相比，在训练期间将 4D-RaDiff 的合成雷达数据合并为数据增强方法能够持续提高目标检测性能。此外，对我们的合成数据进行预训练可将所需的带注释雷达数据量减少高达 90%，同时实现可比的物体检测性能。

Title: Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning

Authors: Salvatore Romano, Marco Grassia, Giuseppe Mangioni
Subjects: cs.LG, cs.AI, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2512.14241
Pdf URL: https://arxiv.org/pdf/2512.14241
Copy Paste: [[2512.14241]] Beyond MMD: Evaluating Graph Generative Models with Geometric Deep Learning(https://arxiv.org/abs/2512.14241)
Keywords: generation, generative
Abstract: Graph generation is a crucial task in many fields, including network science and bioinformatics, as it enables the creation of synthetic graphs that mimic the properties of real-world networks for various applications. Graph Generative Models (GGMs) have emerged as a promising solution to this problem, leveraging deep learning techniques to learn the underlying distribution of real-world graphs and generate new samples that closely resemble them. Examples include approaches based on Variational Auto-Encoders, Recurrent Neural Networks, and more recently, diffusion-based models. However, the main limitation often lies in the evaluation process, which typically relies on Maximum Mean Discrepancy (MMD) as a metric to assess the distribution of graph properties in the generated ensemble. This paper introduces a novel methodology for evaluating GGMs that overcomes the limitations of MMD, which we call RGM (Representation-aware Graph-generation Model evaluation). As a practical demonstration of our methodology, we present a comprehensive evaluation of two state-of-the-art Graph Generative Models: Graph Recurrent Attention Networks (GRAN) and Efficient and Degree-guided graph GEnerative model (EDGE). We investigate their performance in generating realistic graphs and compare them using a Geometric Deep Learning model trained on a custom dataset of synthetic and real-world graphs, specifically designed for graph classification tasks. Our findings reveal that while both models can generate graphs with certain topological properties, they exhibit significant limitations in preserving the structural characteristics that distinguish different graph domains. We also highlight the inadequacy of Maximum Mean Discrepancy as an evaluation metric for GGMs and suggest alternative approaches for future research.
摘要：图生成是包括网络科学和生物信息学在内的许多领域的一项关键任务，因为它可以创建模拟各种应用的现实世界网络特性的合成图。图生成模型（GGM）已成为解决此问题的一种有前途的解决方案，它利用深度学习技术来学习现实世界图的底层分布并生成与其非常相似的新样本。示例包括基于变分自动编码器、循环神经网络以及最近的基于扩散的模型的方法。然而，主要的限制通常在于评估过程，该过程通常依赖于最大平均差异（MMD）作为评估生成的集合中图属性分布的度量。本文介绍了一种新的评估 GGM 的方法，它克服了 MMD 的局限性，我们称之为 RGM（表示感知图生成模型评估）。作为我们方法论的实际演示，我们对两种最先进的图生成模型进行了全面评估：图循环注意力网络（GRAN）和高效且度引导的图生成模型（EDGE）。我们研究了它们在生成真实图形方面的性能，并使用在合成和真实世界图形的自定义数据集上训练的几何深度学习模型来比较它们，专门为图形分类任务而设计。我们的研究结果表明，虽然这两种模型都可以生成具有某些拓扑属性的图，但它们在保留区分不同图域的结构特征方面表现出显着的局限性。我们还强调了最大平均差异作为 GGM 评估指标的不足，并为未来的研究提出了替代方法。

Title: FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting

Authors: Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14253
Pdf URL: https://arxiv.org/pdf/2512.14253
Copy Paste: [[2512.14253]] FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting(https://arxiv.org/abs/2512.14253)
Keywords: generative
Abstract: In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.
摘要：在这项工作中，我们介绍了 FLAME，这是一系列极其轻量且功能强大的时间序列基础模型，它通过生成概率建模支持确定性和概率预测，从而确保效率和鲁棒性。 FLAME 利用 Legendre Memory 来实现强大的泛化能力。通过在编码和解码阶段采用勒让德内存的变体，即平移勒让德（LegT）和缩放勒让德（LegS），FLAME可以有效捕获数据中固有的归纳偏差，并做出有效的远程推理。为了在保持效率的同时提高概率预测的准确性，FLAME 采用了基于归一化流的预测头，它可以以生成方式对预测范围内的任意复杂分布进行建模。对 TSFM-Bench 和 ProbTS 等公认基准的综合实验证明了 FLAME 在确定性和概率预测任务上始终如一的最先进的零样本性能。

Title: Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Authors: Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14273
Pdf URL: https://arxiv.org/pdf/2512.14273
Copy Paste: [[2512.14273]] Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in(https://arxiv.org/abs/2512.14273)
Keywords: generation
Abstract: Grounded video question answering (GVQA) aims to localize relevant temporal segments in videos and generate accurate answers to a given question; however, large video-language models (LVLMs) exhibit limited temporal awareness. Although existing approaches based on Group Relative Policy Optimization (GRPO) attempt to improve temporal grounding, they still struggle to faithfully ground their answers in the relevant video evidence, leading to temporal mislocalization and hallucinations. In this work, we present Zoom-Zero, a coarse-to-fine framework that first localizes query-relevant segments and then temporally zooms into the most salient frames for finer-grained visual verification. Our method addresses the limits of GRPO for the GVQA task with two key innovations: (i) a zoom-in accuracy reward that validates the fidelity of temporal grounding prediction and facilitates fine-grained visual verification on grounded frames; (ii) token-selective credit assignment, which attributes rewards to the tokens responsible for temporal localization or answer generation, mitigating GRPO's issue in handling multi-faceted reward signals. Our proposed method advances grounded video question answering, improving temporal grounding by 5.2\% on NExT-GQA and 4.6\% on ReXTime, while also enhancing average answer accuracy by 2.4\%. Additionally, the coarse-to-fine zoom-in during inference further benefits long-form video understanding by preserving critical visual details without compromising global context, yielding an average improvement of 6.4\% on long-video benchmarks.
摘要：扎根视频问答（GVQA）旨在定位视频中的相关时间片段并生成给定问题的准确答案；然而，大型视频语言模型（LVLM）表现出有限的时间意识。尽管基于组相对策略优化（GRPO）的现有方法试图改善时间基础，但它们仍然难以忠实地将答案建立在相关视频证据中，导致时间错误定位和幻觉。在这项工作中，我们提出了 Zoom-Zero，这是一个从粗到细的框架，它首先定位与查询相关的片段，然后暂时放大到最显着的帧以进行更细粒度的视觉验证。我们的方法通过两项关键创新解决了 GVQA 任务中 GRPO 的局限性：（i）放大精度奖励，可验证时间接地预测的保真度并促进对接地帧的细粒度视觉验证； (ii) 令牌选择性信用分配，将奖励分配给负责时间定位或答案生成的令牌，从而减轻 GRPO 在处理多方面奖励信号方面的问题。我们提出的方法改进了接地视频问答，在 NExT-GQA 上将时间接地提高了 5.2%，在 ReXTime 上提高了 4.6%，同时还将平均答案准确性提高了 2.4%。此外，推理过程中从粗到细的放大通过保留关键视觉细节而不影响全局上下文，进一步有利于长视频理解，在长视频基准上平均提高了 6.4%。

Title: SS4D: Native 4D Generative Model via Structured Spacetime Latents

Authors: Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14284
Pdf URL: https://arxiv.org/pdf/2512.14284
Copy Paste: [[2512.14284]] SS4D: Native 4D Generative Model via Structured Spacetime Latents(https://arxiv.org/abs/2512.14284)
Keywords: generative
Abstract: We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion
摘要：我们提出了 SS4D，这是一种原生 4D 生成模型，可以直接从单目视频合成动态 3D 对象。与之前通过优化 3D 或视频生成模型来构建 4D 表示的方法不同，我们直接在 4D 数据上训练生成器，实现高保真度、时间连贯性和结构一致性。我们方法的核心是一组结构化时空潜伏的压缩集。具体来说，(1) 为了解决 4D 训练数据的稀缺问题，我们建立在预训练的单图像到 3D 模型的基础上，保持强大的空间一致性。 (2) 通过引入跨帧推理的专用时间层来强制执行时间一致性。 (3) 为了支持长视频序列的高效训练和推理，我们使用分解的 4D 卷积和时间下采样块沿时间轴压缩潜在序列。此外，我们采用精心设计的训练策略来增强针对遮挡的鲁棒性

Title: Dual Attention Guided Defense Against Malicious Edits

Authors: Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14333
Pdf URL: https://arxiv.org/pdf/2512.14333
Copy Paste: [[2512.14333]] Dual Attention Guided Defense Against Malicious Edits(https://arxiv.org/abs/2512.14333)
Keywords: generation
Abstract: Recent progress in text-to-image diffusion models has transformed image editing via text prompts, yet this also introduces significant ethical challenges from potential misuse in creating deceptive or harmful content. While current defenses seek to mitigate this risk by embedding imperceptible perturbations, their effectiveness is limited against malicious tampering. To address this issue, we propose a Dual Attention-Guided Noise Perturbation (DANP) immunization method that adds imperceptible perturbations to disrupt the model's semantic understanding and generation process. DANP functions over multiple timesteps to manipulate both cross-attention maps and the noise prediction process, using a dynamic threshold to generate masks that identify text-relevant and irrelevant regions. It then reduces attention in relevant areas while increasing it in irrelevant ones, thereby misguides the edit towards incorrect regions and preserves the intended targets. Additionally, our method maximizes the discrepancy between the injected noise and the model's predicted noise to further interfere with the generation. By targeting both attention and noise prediction mechanisms, DANP exhibits impressive immunity against malicious edits, and extensive experiments confirm that our method achieves state-of-the-art performance.
摘要：文本到图像扩散模型的最新进展已经改变了通过文本提示进行图像编辑的方式，但这也带来了重大的道德挑战，因为可能被滥用来创建欺骗性或有害内容。虽然当前的防御措施试图通过嵌入难以察觉的扰动来减轻这种风险，但它们针对恶意篡改的有效性有限。为了解决这个问题，我们提出了一种双重注意力引导噪声扰动（DANP）免疫方法，该方法添加难以察觉的扰动来破坏模型的语义理解和生成过程。 DANP 在多个时间步上发挥作用，以操纵交叉注意力图和噪声预测过程，使用动态阈值生成识别文本相关和不相关区域的掩码。然后，它会减少对相关区域的关注，同时增加对不相关区域的关注，从而将编辑误导到不正确的区域并保留预期目标。此外，我们的方法最大化了注入噪声和模型预测噪声之间的差异，以进一步干扰生成。通过针对注意力和噪声预测机制，DANP 对恶意编辑表现出令人印象深刻的免疫力，并且大量实验证实我们的方法实现了最先进的性能。

Title: Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Authors: Jooyeol Yun, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14336
Pdf URL: https://arxiv.org/pdf/2512.14336
Copy Paste: [[2512.14336]] Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure(https://arxiv.org/abs/2512.14336)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
摘要：可扩展矢量图形 (SVG) 是现代网页设计的核心，随着 Web 环境变得越来越动态，对它们进行动画处理的需求也在不断增长。然而，尽管最近在代码生成和运动规划方面取得了进展，但自动化矢量图形动画对于视觉语言模型 (VLM) 来说仍然具有挑战性。 VLM 经常错误地处理 SVG，因为视觉上连贯的部分通常被分割成低级形状，而这些形状几乎无法指导哪些元素应该一起移动。在本文中，我们介绍了一个框架，该框架可以恢复可靠的 SVG 动画所需的语义结构，并揭示当前 VLM 系统忽略的缺失层。这是通过多个薄弱部分预测的统计聚合来实现的，使系统能够从噪声预测中稳定地推断语义。通过将 SVG 重新组织为语义组，我们的方法使 VLM 能够生成具有更高连贯性的动画。我们的实验证明了相对于现有方法的巨大进步，这表明语义恢复是解锁强大的 SVG 动画并支持 VLM 和矢量图形之间更可解释的交互的关键步骤。

Title: Towards Transferable Defense Against Malicious Image Edits

Authors: Jie Zhang, Shuai Dong, Shiguang Shan, Xilin Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2512.14341
Pdf URL: https://arxiv.org/pdf/2512.14341
Copy Paste: [[2512.14341]] Towards Transferable Defense Against Malicious Image Edits(https://arxiv.org/abs/2512.14341)
Keywords: generation
Abstract: Recent approaches employing imperceptible perturbations in input images have demonstrated promising potential to counter malicious manipulations in diffusion-based image editing systems. However, existing methods suffer from limited transferability in cross-model evaluations. To address this, we propose Transferable Defense Against Malicious Image Edits (TDAE), a novel bimodal framework that enhances image immunity against malicious edits through coordinated image-text optimization. Specifically, at the visual defense level, we introduce FlatGrad Defense Mechanism (FDM), which incorporates gradient regularization into the adversarial objective. By explicitly steering the perturbations toward flat minima, FDM amplifies immune robustness against unseen editing models. For textual enhancement protection, we propose an adversarial optimization paradigm named Dynamic Prompt Defense (DPD), which periodically refines text embeddings to align the editing outcomes of immunized images with those of the original images, then updates the images under optimized embeddings. Through iterative adversarial updates to diverse embeddings, DPD enforces the generation of immunized images that seek a broader set of immunity-enhancing features, thereby achieving cross-model transferability. Extensive experimental results demonstrate that our TDAE achieves state-of-the-art performance in mitigating malicious edits under both intra- and cross-model evaluations.
摘要：最近在输入图像中采用难以察觉的扰动的方法已经证明了对抗基于扩散的图像编辑系统中的恶意操纵的巨大潜力。然而，现有方法在跨模型评估中的可移植性有限。为了解决这个问题，我们提出了可传输的恶意图像编辑防御（TDAE），这是一种新颖的双模框架，可通过协调的图像文本优化来增强图像对恶意编辑的免疫力。具体来说，在视觉防御层面，我们引入了FlatGrad防御机制（FDM），它将梯度正则化纳入对抗目标中。通过明确地将扰动转向平坦最小值，FDM 增强了针对看不见的编辑模型的免疫鲁棒性。对于文本增强保护，我们提出了一种名为动态提示防御（DPD）的对抗性优化范例，它定期细化文本嵌入，以使免疫图像的编辑结果与原始图像的编辑结果保持一致，然后更新优化嵌入下的图像。通过对不同嵌入的迭代对抗性更新，DPD 强制生成免疫图像，寻求更广泛的免疫增强功能，从而实现跨模型可转移性。大量的实验结果表明，我们的 TDAE 在模型内和跨模型评估下在减少恶意编辑方面实现了最先进的性能。

Title: Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

Authors: Le Jiang, Shaotong Zhu, Yedi Luo, Shayda Moezzi, Sarah Ostadabbas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14406
Pdf URL: https://arxiv.org/pdf/2512.14406
Copy Paste: [[2512.14406]] Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos(https://arxiv.org/abs/2512.14406)
Keywords: generation
Abstract: In dynamic Neural Radiance Fields (NeRF) systems, state-of-the-art novel view synthesis methods often fail under significant viewpoint deviations, producing unstable and unrealistic renderings. To address this, we introduce Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF framework that leverages Gaussian splatting priors and a pseudo-ground-truth generation strategy to enable realistic synthesis under large-angle rotations. ExpanDyNeRF optimizes density and color features to improve scene reconstruction from challenging perspectives. We also present the Synthetic Dynamic Multiview (SynDM) dataset, the first synthetic multiview dataset for dynamic scenes with explicit side-view supervision-created using a custom GTA V-based rendering pipeline. Quantitative and qualitative results on SynDM and real-world datasets demonstrate that ExpanDyNeRF significantly outperforms existing dynamic NeRF methods in rendering fidelity under extreme viewpoint shifts. Further details are provided in the supplementary materials.
摘要：在动态神经辐射场 (NeRF) 系统中，最先进的新颖视图合成方法通常会在明显的视点偏差下失败，从而产生不稳定且不切实际的渲染。为了解决这个问题，我们引入了扩展动态 NeRF (ExpanDyNeRF)，这是一种单目 NeRF 框架，利用高斯泼溅先验和伪地面实况生成策略来实现大角度旋转下的真实合成。 ExpanDyNeRF 优化密度和颜色特征，以从具有挑战性的角度改进场景重建。我们还介绍了合成动态多视图 (SynDM) 数据集，这是第一个使用基于 GTA V 的自定义渲染管道创建的具有显式侧视图监督的动态场景的合成多视图数据集。 SynDM 和现实数据集上的定量和定性结果表明，ExpanDyNeRF 在极端视点变化下的渲染保真度方面显着优于现有的动态 NeRF 方法。补充材料中提供了更多详细信息。

Title: LCMem: A Universal Model for Robust Image Memorization Detection

Authors: Mischa Dombrowski, Felix Nützel, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14421
Pdf URL: https://arxiv.org/pdf/2512.14421
Copy Paste: [[2512.14421]] LCMem: A Universal Model for Robust Image Memorization Detection(https://arxiv.org/abs/2512.14421)
Keywords: generative
Abstract: Recent advances in generative image modeling have achieved visual realism sufficient to deceive human experts, yet their potential for privacy preserving data sharing remains insufficiently understood. A central obstacle is the absence of reliable memorization detection mechanisms, limited quantitative evaluation, and poor generalization of existing privacy auditing methods across domains. To address this, we propose to view memorization detection as a unified problem at the intersection of re-identification and copy detection, whose complementary goals cover both identity consistency and augmentation-robust duplication, and introduce Latent Contrastive Memorization Network (LCMem), a cross-domain model evaluated jointly on both tasks. LCMem achieves this through a two-stage training strategy that first learns identity consistency before incorporating augmentation-robust copy detection. Across six benchmark datasets, LCMem achieves improvements of up to 16 percentage points on re-identification and 30 percentage points on copy detection, enabling substantially more reliable memorization detection at scale. Our results show that existing privacy filters provide limited performance and robustness, highlighting the need for stronger protection mechanisms. We show that LCMem sets a new standard for cross-domain privacy auditing, offering reliable and scalable memorization detection. Code and model is publicly available at this https URL.
摘要：生成图像建模的最新进展已经实现了足以欺骗人类专家的视觉真实感，但它们在保护隐私数据共享方面的潜力仍然没有得到充分的了解。一个主要障碍是缺乏可靠的记忆检测机制、有限的定量评估以及现有跨领域隐私审计方法的泛化性较差。为了解决这个问题，我们建议将记忆检测视为重新识别和复制检测交叉点的统一问题，其互补目标涵盖身份一致性和增强鲁棒复制，并引入潜在对比记忆网络（LCMem），这是一种在两项任务上联合评估的跨域模型。 LCMem 通过两阶段训练策略实现了这一目标，该策略首先学习身份一致性，然后再结合增强稳健的副本检测。在六个基准数据集上，LCMem 在重新识别方面实现了高达 16 个百分点的改进，在复制检测方面实现了 30 个百分点的改进，从而实现了大规模更可靠的记忆检测。我们的结果表明，现有的隐私过滤器提供的性能和鲁棒性有限，突出表明需要更强大的保护机制。我们证明 LCMem 为跨域隐私审计设定了新标准，提供可靠且可扩展的记忆检测。代码和模型可通过此 https URL 公开获取。

Title: Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging

Authors: Chang Cai, Hao Jiang, Xiaojun Yuan, Ying-Jun Angela Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14435
Pdf URL: https://arxiv.org/pdf/2512.14435
Copy Paste: [[2512.14435]] Score-Based Turbo Message Passing for Plug-and-Play Compressive Imaging(https://arxiv.org/abs/2512.14435)
Keywords: generative
Abstract: Message-passing algorithms have been adapted for compressive imaging by incorporating various off-the-shelf image denoisers. However, these denoisers rely largely on generic or hand-crafted priors and often fall short in accurately capturing the complex statistical structure of natural images. As a result, traditional plug-and-play (PnP) methods often lead to suboptimal reconstruction, especially in highly underdetermined regimes. Recently, score-based generative models have emerged as a powerful framework for accurately characterizing sophisticated image distribution. Yet, their direct use for posterior sampling typically incurs prohibitive computational complexity. In this paper, by exploiting the close connection between score-based generative modeling and empirical Bayes denoising, we devise a message-passing framework that integrates a score-based minimum mean-squared error (MMSE) denoiser for compressive image recovery. The resulting algorithm, named score-based turbo message passing (STMP), combines the fast convergence of message passing with the expressive power of score-based generative priors. For practical systems with quantized measurements, we further propose quantized STMP (Q-STMP), which augments STMP with a component-wise MMSE dequantization module. We demonstrate that the asymptotic performance of STMP and Q-STMP can be accurately predicted by a set of state-evolution (SE) equations. Experiments on the FFHQ dataset demonstrate that STMP strikes a significantly better performance-complexity tradeoff compared with competing baselines, and that Q-STMP remains robust even under 1-bit quantization. Remarkably, both STMP and Q-STMP typically converge within 10 iterations.
摘要：通过结合各种现成的图像降噪器，消息传递算法已适用于压缩成像。然而，这些降噪器很大程度上依赖于通用或手工制作的先验，并且通常无法准确捕捉自然图像的复杂统计结构。因此，传统的即插即用 (PnP) 方法通常会导致次优重建，尤其是在高度欠定的情况下。最近，基于分数的生成模型已成为准确表征复杂图像分布的强大框架。然而，它们直接用于后验采样通常会导致计算复杂性过高。在本文中，通过利用基于分数的生成模型和经验贝叶斯去噪之间的紧密联系，我们设计了一种消息传递框架，该框架集成了基于分数的最小均方误差（MMSE）去噪器，用于压缩图像恢复。由此产生的算法被称为基于分数的涡轮消息传递（STMP），它将消息传递的快速收敛性与基于分数的生成先验的表达能力结合起来。对于具有量化测量的实际系统，我们进一步提出了量化 STMP (Q-STMP)，它通过组件级 MMSE 反量化模块增强了 STMP。我们证明 STMP 和 Q-STMP 的渐近性能可以通过一组状态演化（SE）方程准确预测。 FFHQ 数据集上的实验表明，与竞争基线相比，STMP 实现了显着更好的性能复杂性权衡，并且即使在 1 位量化下，Q-STMP 仍然保持稳健。值得注意的是，STMP 和 Q-STMP 通常在 10 次迭代内收敛。

Title: A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Authors: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.14442
Pdf URL: https://arxiv.org/pdf/2512.14442
Copy Paste: [[2512.14442]] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning(https://arxiv.org/abs/2512.14442)
Keywords: generative
Abstract: Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
摘要：可供性预测根据语言指令识别对象上的交互区域，对于实体人工智能至关重要。流行的端到端模型将高级推理和低级基础耦合到单个整体管道中，并依赖于带注释的数据集的训练，这导致对新对象和未见过的环境的泛化能力较差。在本文中，我们提出了 A4-Agent，这是一种无需训练的代理框架，它将可供性预测解耦为三阶段管道，从而超越了这种范式。我们的框架在测试时协调专门的基础模型：（1）一个$\textbf{Dreamer}$，它采用生成模型来可视化$\textit{how}$交互的外观； (2) 一个 $\textbf{Thinker}$，利用大型视觉语言模型来决定 $\textit{what}$ 与之交互的对象部分； (3) 一个$\textbf{Spotter}$，它协调视觉基础模型以精确定位$\textit{where}$交互区域。通过利用预训练模型的互补优势，无需任何特定于任务的微调，我们的零样本框架在多个基准测试中显着优于最先进的监督方法，并展示了对现实世界设置的强大泛化能力。

Title: Improving Slow Transfer Predictions: Generative Methods Compared

Authors: Jacob Taegon Kim, Alex Sim, Kesheng Wu, Jinoh Kim
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2512.14522
Pdf URL: https://arxiv.org/pdf/2512.14522
Copy Paste: [[2512.14522]] Improving Slow Transfer Predictions: Generative Methods Compared(https://arxiv.org/abs/2512.14522)
Keywords: generative
Abstract: Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.
摘要：监控数据传输性能是科学计算网络中的一项关键任务。通过在通信阶段早期预测性能，可以识别并选择性地监视潜在的缓慢传输，从而优化网络使用和整体性能。在这种情况下，提高机器学习（ML）模型预测能力的一个关键瓶颈是类别不平衡问题。该项目重点解决类别不平衡问题，以提高性能预测的准确性。在本研究中，我们分析和比较了各种增强策略，包括传统的过采样方法和生成技术。此外，我们调整训练数据集中的类别不平衡比率，以评估它们对模型性能的影响。虽然增强可以提高性能，但随着不平衡率的增加，性能并没有显着提高。我们的结论是，即使是最先进的技术，例如 CTGAN，也不会比简单的分层采样有显着改进。

Title: Synthetic Electrogram Generation with Variational Autoencoders for ECGI

Authors: Miriam Gutiérrez Fernández, Karen López-Linares, Carlos Fambuena Santos, María S. Guillem, Andreu M. Climent, Óscar Barquero Pérez
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.14537
Pdf URL: https://arxiv.org/pdf/2512.14537
Copy Paste: [[2512.14537]] Synthetic Electrogram Generation with Variational Autoencoders for ECGI(https://arxiv.org/abs/2512.14537)
Keywords: generation, generative
Abstract: Atrial fibrillation (AF) is the most prevalent sustained cardiac arrhythmia, and its clinical assessment requires accurate characterization of atrial electrical activity. Noninvasive electrocardiographic imaging (ECGI) combined with deep learning (DL) approaches for estimating intracardiac electrograms (EGMs) from body surface potentials (BSPMs) has shown promise, but progress is hindered by the limited availability of paired BSPM-EGM datasets. To address this limitation, we investigate variational autoencoders (VAEs) for the generation of synthetic multichannel atrial EGMs. Two models are proposed: a sinus rhythm-specific VAE (VAE-S) and a class-conditioned VAE (VAE-C) trained on both sinus rhythm and AF signals. Generated EGMs are evaluated using morphological, spectral, and distributional similarity metrics. VAE-S achieves higher fidelity with respect to in silico EGMs, while VAE-C enables rhythm-specific generation at the expense of reduced sinus reconstruction quality. As a proof of concept, the generated EGMs are used for data augmentation in a downstream noninvasive EGM reconstruction task, where moderate augmentation improves estimation performance. These results demonstrate the potential of VAE-based generative modeling to alleviate data scarcity and enhance deep learning-based ECGI pipelines.
摘要：心房颤动 (AF) 是最常见的持续性心律失常，其临床评估需要准确表征心房电活动。无创心电图成像 (ECGI) 与深度学习 (DL) 方法相结合，用于根据体表电位 (BSPM) 估计心内电图 (EGM)，已显示出希望，但由于配对 BSPM-EGM 数据集的可用性有限，进展受到阻碍。为了解决这一限制，我们研究了用于生成合成多通道心房 EGM 的变分自动编码器 (VAE)。提出了两种模型：窦性心律特异性 VAE (VAE-S) 和针对窦性心律和 AF 信号进行训练的类条件 VAE (VAE-C)。使用形态、光谱和分布相似性指标评估生成的 EGM。 VAE-S 相对于计算机 EGM 实现了更高的保真度，而 VAE-C 可以实现节律特定的生成，但代价是降低了窦重建质量。作为概念证明，生成的 EGM 用于下游无创 EGM 重建任务中的数据增强，其中适度增强可提高估计性能。这些结果证明了基于 VAE 的生成模型在缓解数据稀缺性和增强基于深度学习的 ECGI 管道方面的潜力。

Title: HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

Authors: Yifang Xu, Benxiang Zhai, Yunzhuo Sun, Ming Li, Yang Li, Sidan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14542
Pdf URL: https://arxiv.org/pdf/2512.14542
Copy Paste: [[2512.14542]] HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion(https://arxiv.org/abs/2512.14542)
Keywords: generation
Abstract: Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
摘要：基于扩散的技术最近取得了重大进展，特别是在身份保留肖像生成（IPG）方面。然而，当使用来自同一 ID 的多个参考图像时，现有方法通常会生成保真度较低的肖像，并且难以精确定制面部属性。为了解决这些问题，本文提出了 HiFi-Portrait，一种用于零镜头肖像生成的高保真方法。具体来说，我们首先引入人脸细化器和地标生成器来获得细粒度的多人脸特征和 3D 感知人脸地标。地标包括参考ID和目标属性。然后，我们设计 HiFi-Net 来融合多面部特征并将其与地标对齐，从而提高 ID 保真度和面部控制。此外，我们设计了一个自动化管道来构建基于 ID 的数据集来训练 HiFi-Portrait。大量的实验结果表明，我们的方法在人脸相似性和可控性方面超越了 SOTA 方法。此外，我们的方法也与之前基于 SDXL 的作品兼容。

Title: TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration

Authors: Zhiwen Yang, Jiaju Zhang, Yang Yi, Jian Liang, Bingzheng Wei, Yan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14550
Pdf URL: https://arxiv.org/pdf/2512.14550
Copy Paste: [[2512.14550]] TAT: Task-Adaptive Transformer for All-in-One Medical Image Restoration(https://arxiv.org/abs/2512.14550)
Keywords: restoration, super-resolution, generation
Abstract: Medical image restoration (MedIR) aims to recover high-quality medical images from their low-quality counterparts. Recent advancements in MedIR have focused on All-in-One models capable of simultaneously addressing multiple different MedIR tasks. However, due to significant differences in both modality and degradation types, using a shared model for these diverse tasks requires careful consideration of two critical inter-task relationships: task interference, which occurs when conflicting gradient update directions arise across tasks on the same parameter, and task imbalance, which refers to uneven optimization caused by varying learning difficulties inherent to each task. To address these challenges, we propose a task-adaptive Transformer (TAT), a novel framework that dynamically adapts to different tasks through two key innovations. First, a task-adaptive weight generation strategy is introduced to mitigate task interference by generating task-specific weight parameters for each task, thereby eliminating potential gradient conflicts on shared weight parameters. Second, a task-adaptive loss balancing strategy is introduced to dynamically adjust loss weights based on task-specific learning difficulties, preventing task domination or undertraining. Extensive experiments demonstrate that our proposed TAT achieves state-of-the-art performance in three MedIR tasks--PET synthesis, CT denoising, and MRI super-resolution--both in task-specific and All-in-One settings. Code is available at this https URL.
摘要：医学图像恢复（MedIR）旨在从低质量的医学图像中恢复高质量的医学图像。 MedIR 的最新进展集中在能够同时解决多个不同 MedIR 任务的一体化模型。然而，由于模态和退化类型的显着差异，对这些不同的任务使用共享模型需要仔细考虑两个关键的任务间关系：任务干扰（当同一参数上的任务之间出现梯度更新方向冲突时发生）和任务不平衡（指由于每个任务固有的学习难度不同而导致的不均匀优化）。为了应对这些挑战，我们提出了一种任务自适应 Transformer（TAT），这是一种通过两项关键创新动态适应不同任务的新颖框架。首先，引入任务自适应权重生成策略，通过为每个任务生成特定于任务的权重参数来减轻任务干扰，从而消除共享权重参数上潜在的梯度冲突。其次，引入任务自适应损失平衡策略，根据特定任务的学习难度动态调整损失权重，防止任务支配或训练不足。大量实验表明，我们提出的 TAT 在三项 MedIR 任务（PET 合成、CT 去噪和 MRI 超分辨率）中实现了最先进的性能，无论是在特定任务还是一体化设置中。代码可从此 https URL 获取。

Title: FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos

Authors: Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo, Guanbin Li, Rushi Lan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14601
Pdf URL: https://arxiv.org/pdf/2512.14601
Copy Paste: [[2512.14601]] FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos(https://arxiv.org/abs/2512.14601)
Keywords: generation
Abstract: In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
摘要：在本文中，我们提出了 FakeRadar，一种新颖的 Deepfake 视频检测框架，旨在解决现实场景中跨域泛化的挑战。现有的检测方法通常依赖于特定于操纵的线索，在已知的伪造类型上表现良好，但对新兴的操纵技术表现出严重的局限性。这种糟糕的概括源于他们无法有效地适应看不见的伪造模式。为了克服这个问题，我们利用大规模预训练模型（例如 CLIP）主动探测特征空间，明确突出真实视频、已知伪造品和看不见的操纵之间的分布差距。具体来说，FakeRadar 引入了伪造异常值探测，它采用动态子簇建模和簇条件异常值生成来合成估计子簇边界附近的异常值样本，模拟超出已知操作类型的新型伪造伪影。此外，我们设计了离群值引导三训练，它使用提出的离群值驱动对比学习和离群值条件交叉熵损失来优化检测器以区分真实、虚假和离群样本。实验表明，FakeRadar 通过处理各种新兴的操纵技术，在深度伪造视频检测的各种基准数据集上优于现有方法，特别是在跨域评估中。

Title: gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation

Authors: Alban Puech, Matteo Mazzonelli, Celia Cintas, Tamara R. Govindasamy, Mangaliso Mngomezulu, Jonas Weiss, Matteo Baù, Anna Varbella, François Mirallès, Kibaek Kim, Le Xie, Hendrik F. Hamann, Etienne Vos, Thomas Brunschwiler
Subjects: cs.LG, cs.AI, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2512.14658
Pdf URL: https://arxiv.org/pdf/2512.14658
Copy Paste: [[2512.14658]] gridfm-datakit-v1: A Python Library for Scalable and Realistic Power Flow and Optimal Power Flow Data Generation(https://arxiv.org/abs/2512.14658)
Keywords: generation
Abstract: We introduce gridfm-datakit-v1, a Python library for generating realistic and diverse Power Flow (PF) and Optimal Power Flow (OPF) datasets for training Machine Learning (ML) solvers. Existing datasets and libraries face three main challenges: (1) lack of realistic stochastic load and topology perturbations, limiting scenario diversity; (2) PF datasets are restricted to OPF-feasible points, hindering generalization of ML solvers to cases that violate operating limits (e.g., branch overloads or voltage violations); and (3) OPF datasets use fixed generator cost functions, limiting generalization across varying costs. gridfm-datakit addresses these challenges by: (1) combining global load scaling from real-world profiles with localized noise and supporting arbitrary N-k topology perturbations to create diverse yet realistic datasets; (2) generating PF samples beyond operating limits; and (3) producing OPF data with varying generator costs. It also scales efficiently to large grids (up to 10,000 buses). Comparisons with OPFData, OPF-Learn, PGLearn, and PF$\Delta$ are provided. Available on GitHub at this https URL under Apache 2.0 and via `pip install gridfm-datakit`.
摘要：我们引入了 gridfm-datakit-v1，这是一个 Python 库，用于生成真实且多样化的潮流 (PF) 和最佳潮流 (OPF) 数据集，用于训练机器学习 (ML) 求解器。现有的数据集和库面临三个主要挑战：（1）缺乏现实的随机负载和拓扑扰动，限制了场景多样性； (2) PF 数据集仅限于 OPF 可行点，阻碍了 ML 求解器泛化到违反操作限制的情况（例如，分支过载或电压违规）； (3) OPF 数据集使用固定的发电机成本函数，限制了不同成本的泛化。 gridfm-datakit 通过以下方式解决这些挑战：（1）将真实世界配置文件的全局负载缩放与局部噪声相结合，并支持任意 N-k 拓扑扰动，以创建多样化但真实的数据集； (2) 生成超出操作限制的 PF 样本； (3) 生成具有不同发电机成本的 OPF 数据。它还可以有效地扩展到大型网格（最多 10,000 辆公交车）。提供了与 OPFData、OPF-Learn、PGLearn 和 PF$\Delta$ 的比较。可通过 Apache 2.0 下的此 https URL 并通过“pip install gridfm-datakit”在 GitHub 上获取。

Title: VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Authors: Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14677
Pdf URL: https://arxiv.org/pdf/2512.14677
Copy Paste: [[2512.14677]] VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image(https://arxiv.org/abs/2512.14677)
Keywords: generation
Abstract: We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
摘要：我们提出了 VASA-3D，一种音频驱动的单次 3D 头像生成器。这项研究解决了两个主要挑战：捕捉真实人脸中的微妙表情细节，以及从单个肖像图像重建复杂的 3D 头部头像。为了准确地模拟表情细节，VASA-3D 利用了 VASA-1 的潜在运动，这种方法可以在 2D 头部说话时产生非凡的真实感和生动性。我们工作的一个关键要素是将这种潜在运动转化为 3D，这是通过设计一个以潜在运动为条件的 3D 头部模型来实现的。将该模型定制为单个图像是通过优化框架实现的，该优化框架采用从输入图像合成的参考头部的大量视频帧。该优化采用了对伪影稳健的各种训练损失以及生成的训练数据中有限的姿势覆盖范围。我们的实验表明，VASA-3D 可以生成现有技术无法实现的逼真 3D 头部说话，并且支持在线生成高达 75 FPS 的 512x512 自由视点视频，促进与逼真的 3D 头像的更身临其境的互动。

Title: Native and Compact Structured Latents for 3D Generation

Authors: Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, Jiaolong Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14692
Pdf URL: https://arxiv.org/pdf/2512.14692
Copy Paste: [[2512.14692]] Native and Compact Structured Latents for 3D Generation(https://arxiv.org/abs/2512.14692)
Keywords: generation, generative
Abstract: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
摘要：3D 生成建模的最新进展显着提高了生成的真实性，但该领域仍然受到现有表示的阻碍，现有表示难以捕获具有复杂拓扑和详细外观的资产。本文提出了一种从原生 3D 数据中学习结构化潜在表示的方法，以应对这一挑战。其核心是一种称为 O-Voxel 的新稀疏体素结构，这是一种对几何和外观进行编码的全体素表示。 O-Voxel 可以对任意拓扑进行稳健建模，包括开放、非流形和全封闭表面，同时捕获纹理颜色之外的全面表面属性，例如基于物理的渲染参数。基于O-Voxel，我们设计了稀疏压缩VAE，它提供了高空间压缩率和紧凑的潜在空间。我们使用不同的公共 3D 资产数据集训练包含 4B 参数的大规模流匹配模型，用于 3D 生成。尽管规模很大，推理仍然非常高效。同时，我们生成的资产的几何形状和材质质量远远超过现有模型。我们相信我们的方法在 3D 生成建模方面取得了重大进步。

Title: Spherical Leech Quantization for Visual Tokenization and Generation

Authors: Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, Philipp Krähenbühl
Subjects: cs.CV, cs.AI, cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.14697
Pdf URL: https://arxiv.org/pdf/2512.14697
Copy Paste: [[2512.14697]] Spherical Leech Quantization for Visual Tokenization and Generation(https://arxiv.org/abs/2512.14697)
Keywords: generation
Abstract: Non-parametric quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($\Lambda_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.
摘要：非参数量化由于其参数效率和大型码本的可扩展性而受到广泛关注。在本文中，我们通过点阵编码的视角提出了不同非参数量化方法的统一表述。格码的几何结构解释了在使用某些现有的免查找量化变体（例如 BSQ）训练自动编码器时辅助损失项的必要性。作为前进的一步，我们探索了一些可能的候选者，包括随机格子、广义斐波那契格子和最密集的球体堆积格子。其中，我们发现基于Leech点阵的量化方法，被称为球形Leech量化（$\Lambda_{24}$-SQ），由于其高度对称性和在超球面上的均匀分布，可以简化训练方法并改进重建压缩权衡。在图像标记化和压缩任务中，与最好的现有技术 BSQ 相比，这种量化方法在所有指标上都实现了更好的重建质量，同时消耗的位数略少。这一改进还扩展到最先进的自回归图像生成框架。

Title: MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Authors: Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.14699
Pdf URL: https://arxiv.org/pdf/2512.14699
Copy Paste: [[2512.14699]] MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives(https://arxiv.org/abs/2512.14699)
Keywords: generation
Abstract: The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
摘要：流视频生成的核心挑战是保持长上下文中的内容一致性，这对内存设计提出了很高的要求。大多数现有解决方案通过使用预定义策略压缩历史帧来维护内存。然而，不同的生成视频块应该引用不同的历史线索，这很难用固定的策略来满足。在这项工作中，我们提出 MemFlow 来解决这个问题。具体来说，在生成即将到来的块之前，我们通过检索与该块的文本提示最相关的历史帧来动态更新内存库。即使在未来的框架中发生新事件或场景切换，这种设计也能实现叙事连贯性。此外，在生成过程中，我们只针对注意力层中的每个查询激活记忆库中最相关的标记，这有效保证了生成效率。通过这种方式，MemFlow 实现了出色的长上下文一致性，计算负担可以忽略不计（与无内存基线相比，速度降低了 7.9%），并保持了与任何具有 KV 缓存的流视频生成模型的兼容性。