2025-12-18

Title: LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts

Authors: Krunal Jesani, Dmitry Ignatov, Radu Timofte
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.14706
Pdf URL: https://arxiv.org/pdf/2512.14706
Copy Paste: [[2512.14706]] LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts(https://arxiv.org/abs/2512.14706)
Keywords: generation
Abstract: Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR's classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.
摘要：传统上，神经架构搜索 (NAS) 需要大量的人类专业知识或自动试错来设计深度学习模型。我们提出了 NN-Caption，一种 LLM 引导的神经架构搜索管道，通过在严格的 Net API 下将来自 LEMUR 分类主干的 CNN 编码器与序列解码器（LSTM/GRU/Transformer）组合起来，生成可运行的图像字幕模型。使用 DeepSeek-R1-0528-Qwen3-8B 作为主要生成器，我们展示了提示模板和生成架构的示例。我们使用 BLEU-4 对 MS COCO 进行评估。法学硕士生成了数十个字幕模型，其中超过一半成功训练并生成有意义的字幕。我们分析了在提示中使用不同数量的输入模型片段（5 与 10）的结果，发现提供更多候选组件时成功率略有下降。我们还报告训练动态（字幕准确度与历元）以及达到的最高 BLEU-4。我们的结果凸显了 LLM 引导的 NAS 的前景：LLM 不仅提出了架构，还提出了超参数和训练实践。我们确定遇到的挑战（例如，代码幻觉或 API 合规性问题），并详细说明提示规则和迭代代码修复如何解决这些问题。这项工作提出了一个将基于提示的代码生成与自动评估相集成的管道，并向开放的 LEMUR 数据集添加了数十种新颖的字幕模型，以促进可重复的基准测试和下游 AutoML 研究。

Title: How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection

Authors: Zafaryab Haider, Md Hafizur Rahman, Shane Moeykens, Vijay Devabhaktuni, Prabuddha Chakraborty
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14715
Pdf URL: https://arxiv.org/pdf/2512.14715
Copy Paste: [[2512.14715]] How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection(https://arxiv.org/abs/2512.14715)
Keywords: generative
Abstract: Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model's own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model's semantic output.
摘要：已经证明，由于恶意电路或错误而难以检测的硬件位翻转会使变压器在非生成任务中容易受到攻击。这项工作首次研究了对用于图像字幕的大型语言模型 (LLM) 的权重进行低级按位扰动（故障注入）如何影响其生成的描述的语义，同时保留语法结构。虽然先前的故障分析方法已经表明翻转一些位可能会导致分类器崩溃或降低准确性，但这些方法忽略了生成系统的语义和语言维度。在图像字幕模型中，单个翻转位可能会巧妙地改变视觉特征与文字的映射方式，从而改变人工智能讲述世界的整个叙述。我们假设这种语义漂移不是随机的，而是可微分估计的。也就是说，模型自身的梯度可以预测哪些位（如果受到扰动）将最强烈地影响含义，同时保持语法和流畅性不变。我们设计了一个可微分故障分析框架，BLADE（通过可微分估计进行位级故障分析），它使用基于梯度的灵敏度估计来定位语义关键位，然后通过标题级语义流畅性目标细化其选择。我们的目标不仅仅是破坏字幕，而是理解意义本身是如何在位级别上编码、分布和改变的，揭示即使是难以察觉的低级变化也可以引导生成视觉语言模型的高级语义。它还通过揭示结构化位级故障如何重塑模型的语义输出，为鲁棒性测试、对抗性防御和可解释的人工智能开辟了道路。

Title: Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example

Authors: Arno Appenzeller, Nick Terzer, André Hohmeyer, Jan-Philipp Redlich, Sabine Luttmann, Friedrich Feuerhake, Nadine S. Schaadt, Timm Intemann, Sarah Teuber-Hanselmann, Stefan Nikolin, Joachim Weis, Klaus Kraywinkel, Pascal Birnstill
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14721
Pdf URL: https://arxiv.org/pdf/2512.14721
Copy Paste: [[2512.14721]] Automatic Extraction of Rules for Generating Synthetic Patient Data From Real-World Population Data Using Glioblastoma as an Example(https://arxiv.org/abs/2512.14721)
Keywords: generation
Abstract: The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.
摘要：合成数据的生成是一项很有前景的技术，可以以符合隐私的方式使医疗数据可供二次使用。创建真实患者数据的一种流行方法是基于规则的 Synthea 数据生成器。 Synthea 根据描述合成患者生命周期的规则生成数据。这些规则通常表示某种情况（例如疾病）发生的概率，具体取决于年龄等因素。由于规则仅包含统计信息，因此通常没有特定的数据保护要求。然而，创建有意义的规则可能是一个非常复杂的过程，需要专业知识和真实的样本数据。在本文中，我们介绍并评估了一种根据从癌症报告中提取的表格数据统计数据自动生成 Synthea 规则的方法。作为示例用例，我们根据真实数据集创建了用于胶质母细胞瘤的 Synthea 模块，并使用它生成合成数据集。与原始数据集相比，合成数据再现了已知的疾病过程，并且大部分保留了统计特性。总体而言，合成患者数据对于隐私保护研究具有巨大潜力。这些数据可用于提出假设并开发原型，但医学解释应考虑与任何当前可用方法一样的具体限制。

Title: Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion

Authors: Francisco Giral, Álvaro Manzano, Ignacio Gómez, Petros Koumoutsakos, Soledad Le Clainche
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14725
Pdf URL: https://arxiv.org/pdf/2512.14725
Copy Paste: [[2512.14725]] Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion(https://arxiv.org/abs/2512.14725)
Keywords: generative, quality assessment
Abstract: Urban wind flow modeling and simulation play an important role in air quality assessment and sustainable city planning. A key challenge for modeling and simulation is handling the complex geometries of the urban landscape. Low order models are limited in capturing the effects of geometry, while high-fidelity Computational Fluid Dynamics (CFD) simulations are prohibitively expensive, especially across multiple geometries or wind conditions. Here, we propose a generative diffusion framework for synthesizing steady-state urban wind fields over unstructured meshes that requires only geometry information. The framework combines a hierarchical graph neural network with score-based diffusion modeling to generate accurate and diverse velocity fields without requiring temporal rollouts or dense measurements. Trained across multiple mesh slices and wind angles, the model generalizes to unseen geometries, recovers key flow structures such as wakes and recirculation zones, and offers uncertainty-aware predictions. Ablation studies confirm robustness to mesh variation and performance under different inference regimes. This work develops is the first step towards foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty.
摘要：城市风流建模和模拟在空气质量评估和可持续城市规划中发挥着重要作用。建模和仿真的一个关键挑战是处理城市景观的复杂几何形状。低阶模型在捕获几何形状的影响方面受到限制，而高保真计算流体动力学 (CFD) 模拟的成本却极其昂贵，尤其是在多种几何形状或风力条件下。在这里，我们提出了一种生成扩散框架，用于在仅需要几何信息的非结构化网格上合成稳态城市风场。该框架将分层图神经网络与基于分数的扩散建模相结合，以生成准确且多样化的速度场，而无需时间推出或密集测量。该模型经过多个网格切片和风角的训练，可推广到看不见的几何形状，恢复关键的流动结构（例如尾流和再循环区），并提供不确定性感知预测。消融研究证实了不同推理机制下网格变化和性能的鲁棒性。这项工作的开发是迈向建筑环境基础模型的第一步，可以帮助城市规划者在致密化和气候不确定性下快速评估设计决策。

Title: Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse

Authors: Jingwei Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14879
Pdf URL: https://arxiv.org/pdf/2512.14879
Copy Paste: [[2512.14879]] Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse(https://arxiv.org/abs/2512.14879)
Keywords: generation
Abstract: Self-referential learning -- training a model on data it generated itself -- promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir -- a high-entropy distribution mixed into each projection -- injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.
摘要：自参照学习——根据自己生成的数据训练模型——承诺无限的可扩展性，但长期遭受模型崩溃的困扰：语言模型退化为重复文本、GAN 丢弃模式以及强化学习策略过度利用。尽管实践者采用了临时修复，例如真实数据混合、熵奖励、知识蒸馏或检索增强生成，但解释这些修复的失败模式和成功的单一原则仍然难以捉摸。我们提出了熵-储层布雷格曼投影（ERBP），这是一个统一这些现象的信息几何框架。我们将闭环建模为分布空间中的随机 Bregman 投影序列。如果没有外部耦合，有限样本噪声会迫使系统投射到不断缩小的经验支持上，从而导致指数熵衰减并最终崩溃。引入熵库——混合到每个投影中的高熵分布——注入可控的熵通量，可证明稳定动力学。我们的理论得出（i）崩溃的必要条件，（ii）保证非平凡熵下限的充分条件，以及（iii）仅取决于样本大小和 Bregman 生成器的强凸性/Lipschitz 常数的封闭形式速率。大语言模型自训练、强化学习中的 Soft Actor-Critic 和 GAN 优化的实验验证了我们的预测，并表明不同的稳定启发法对应于特定的储层选择和耦合系数。因此，ERBP 将一系列民间疗法转化为单一的定量设计规则：监控和预算你的熵通量。

Title: Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Authors: Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, Jeff Da
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14895
Pdf URL: https://arxiv.org/pdf/2512.14895
Copy Paste: [[2512.14895]] Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections(https://arxiv.org/abs/2512.14895)
Keywords: generation
Abstract: A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy's behavior diverges from the expert's, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.
摘要：训练 LM 智能体的流行范例依赖于模仿学习和对专家轨迹的微调。然而，我们表明，多轮 LM 代理的模仿学习的离策略性质受到称为协变量偏移的基本限制：由于学生策略的行为与专家的行为不同，它会遇到训练数据中不存在的状态，从而降低了微调的有效性。受到经典 DAgger 算法的启发，我们提出了一种新颖的数据生成方法，用于解决多轮 LLM 训练的协变量偏移问题。我们引入了策略专家修正（OEC），部分策略数据是通过从学生模型开始推出，然后在整个轨迹中部分切换到专家模型而生成的。我们探索了我们的数据生成技术在软件工程 (SWE) 任务领域的有效性，这是一种多回合设置，其中 LLM 代理必须与开发环境交互以修复软件错误。我们的实验将 OEC 数据与 SWE 代理问题上的各种其他在策略和模仿学习方法进行比较，并使用常见的拒绝采样（即使用环境奖励）与监督微调技术相结合来训练模型。实验发现，在经过 SWE 基准验证的 7b 和 32b 设置中，OEC 轨迹比传统模仿学习分别提高了 14% 和 13%。我们的结果表明，需要将专家演示与策略数据相结合，以进行有效的多轮 LM 代理训练。

Title: TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Authors: Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, Bing Zhou
Subjects: cs.CV, cs.AI, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2512.14938
Pdf URL: https://arxiv.org/pdf/2512.14938
Copy Paste: [[2512.14938]] TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation(https://arxiv.org/abs/2512.14938)
Keywords: generation
Abstract: We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: this https URL
摘要：我们推出了 TalkVerse，这是一个大规模、开放的语料库，用于单人、音频驱动的谈话视频生成，旨在实现跨方法的公平、可重复的比较。虽然当前最先进的系统依赖于封闭数据或计算量大的模型，但 TalkVerse 提供了 230 万个高分辨率 (720p/1080p) 音频视频同步剪辑，总计 6300 小时。这些内容是通过透明管道从超过 60,000 小时的视频中精心策划的，其中包括场景剪切检测、美学评估、严格的视听同步检查以及包括 2D 骨架和结构化视觉/音频风格字幕在内的综合注释。利用 TalkVerse，我们提出了一个基于 Wan2.2-5B 的可重复的 5B DiT 基线。通过利用具有高下采样率的视频 VAE 和具有运动帧上下文的滑动窗口机制，我们的模型实现了低漂移的一分钟长的生成。它提供了与 14B Wan-S2V 模型相当的口型同步和视觉质量，但推理成本降低了 10 美元\倍$。为了增强长视频的故事讲述能力，我们集成了 MLLM 导演，根据音频和视觉提示重写提示。此外，我们的模型通过受控的潜在噪声注入支持零镜头视频配音。我们开源数据集、训练方法和 5B 检查点，以降低音频驱动的人类视频生成研究的障碍。项目页面：此 https URL

Title: Softly Constrained Denoisers for Diffusion Models

Authors: Victor M. Yeom Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.14980
Pdf URL: https://arxiv.org/pdf/2512.14980
Copy Paste: [[2512.14980]] Softly Constrained Denoisers for Diffusion Models(https://arxiv.org/abs/2512.14980)
Keywords: generative
Abstract: Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem, especially when the constraint is misspecified, a common issue when formulating constraints on scientific data. In this paper, instead of changing the loss or the sampling loop, we integrate a guidance-inspired adjustment into the denoiser itself, giving it a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, and maintain enough flexibility to deviate from it when there is misspecification with observed data.
摘要：扩散模型很难产生遵守约束的样本，这是科学应用中的常见要求。最近的方法在采样期间的损失或指导方法中引入了正则化项以强制执行此类约束，但它们使生成模型偏离真实的数据分布。这是一个问题，特别是当约束指定错误时，这是制定科学数据约束时的常见问题。在本文中，我们没有改变损失或采样循环，而是将指导启发的调整集成到降噪器本身中，使其对符合约束的样本产生软归纳偏差。我们表明，这些软约束降噪器利用约束知识来提高标准降噪器的合规性，并在观察到的数据存在错误指定时保持足够的灵活性以偏离标准降噪器。

Title: Where is the Watermark? Interpretable Watermark Detection at the Block Level

Authors: Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.14994
Pdf URL: https://arxiv.org/pdf/2512.14994
Copy Paste: [[2512.14994]] Where is the Watermark? Interpretable Watermark Detection at the Block Level(https://arxiv.org/abs/2512.14994)
Keywords: generative
Abstract: Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.
摘要：生成式人工智能的最新进展使得高度逼真的数字内容的创建成为可能，引发了人们对真实性、所有权和滥用的担忧。虽然水印已成为追踪和保护数字媒体的越来越重要的机制，但大多数现有的图像水印方案都以黑匣子的方式运行，产生全局检测分数，但无法深入了解水印的存在方式或位置。缺乏透明度会影响用户信任，并导致难以解释篡改的影响。在本文中，我们提出了一种事后图像水印方法，该方法将局部嵌入与区域级可解释性相结合。我们的方法使用统计分块策略将水印信号嵌入离散小波变换域中。这使我们能够生成检测图，揭示图像的哪些区域可能带有水印或被更改。我们表明，我们的方法对常见图像转换具有很强的鲁棒性，同时对语义操作保持敏感。同时，水印仍然非常难以察觉。与之前的事后方法相比，我们的方法提供了更多可解释的检测，同时保持了竞争的稳健性。例如，我们的水印对于裁剪最多一半的图像具有鲁棒性。

Title: DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding

Authors: Ruiyi Zhang, Peijia Qin, Qi Cao, Pengtao Xie
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.15000
Pdf URL: https://arxiv.org/pdf/2512.15000
Copy Paste: [[2512.15000]] DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding(https://arxiv.org/abs/2512.15000)
Keywords: generation
Abstract: Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
摘要：过程奖励模型 (PRM) 已成为通过测试时间扩展改进大型语言模型 (LLM) 的关键，但由于代码中缺乏有意义的步骤分解以及蒙特卡罗生成的部分标签的噪声，它们在编码中的有效性仍然有限。我们提出了 DreamPRM-Code，一种以编码为中心的 PRM，它将函数视为推理步骤，使用函数链提示策略来诱导模块化代码生成，从而实现类似于数学推理任务的 PRM 训练和应用。为了解决标签噪声问题，DreamPRM-Code 引入了一种基于元学习的校正机制，该机制利用干净的最终解决方案单元测试标签并执行双层优化来细化中间标签。在测试时间扩展方面，DreamPRM-Code 在 LiveCodeBench 上取得了最先进的性能，通过率@1 为 80.9，超越了 OpenAI o4-mini。

Title: Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

Authors: Huaying Zhang, Atsushi Hashimoto, Tosho Hirasawa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15006
Pdf URL: https://arxiv.org/pdf/2512.15006
Copy Paste: [[2512.15006]] Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation(https://arxiv.org/abs/2512.15006)
Keywords: generation
Abstract: Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D's expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in this https URL .
摘要：熟练的采访员可以从专家那里提取有价值的信息。这就提出了一个基本问题：是什么让某些问题比其他问题更有效？为了解决这个问题，对问题生成模型进行定量评估至关重要。视频问题生成 (VQG) 是视频问答 (VideoQA) 的主题，其中针对给定答案生成问题。他们的评估通常侧重于回答问题的能力，而不是生成问题的质量。相比之下，我们专注于从人类专家那里引出看不见的知识的问题质量。为了不断改进 VQG 模型，我们提出了一种协议，通过使用问答检索模拟与专家的问答交流来评估能力。我们通过构建一个新颖的数据集 EgoExoAsk 来获得检索器，该数据集包含从 Ego-Exo4D 的专家评论注释生成的 27,666 个 QA 对。 EgoExoAsk 训练集用于获取检索器，基准是在带有 Ego-Exo4D 视频片段的验证集上构建的。实验结果表明，我们的指标与问题生成设置合理一致：访问更丰富上下文的模型可以得到更好的评估，支持我们的协议按预期工作。 EgoExoAsk 数据集可在此 https URL 中找到。

Title: MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance

Authors: Kaizhe Zhang, Shinan Chen, Qian Zhao, Weizhan Zhang, Caixia Yan, Yudeng Xin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15048
Pdf URL: https://arxiv.org/pdf/2512.15048
Copy Paste: [[2512.15048]] MVGSR: Multi-View Consistent 3D Gaussian Super-Resolution via Epipolar Guidance(https://arxiv.org/abs/2512.15048)
Keywords: super-resolution
Abstract: Scenes reconstructed by 3D Gaussian Splatting (3DGS) trained on low-resolution (LR) images are unsuitable for high-resolution (HR) rendering. Consequently, a 3DGS super-resolution (SR) method is needed to bridge LR inputs and HR rendering. Early 3DGS SR methods rely on single-image SR networks, which lack cross-view consistency and fail to fuse complementary information across views. More recent video-based SR approaches attempt to address this limitation but require strictly sequential frames, limiting their applicability to unstructured multi-view datasets. In this work, we introduce Multi-View Consistent 3D Gaussian Splatting Super-Resolution (MVGSR), a framework that focuses on integrating multi-view information for 3DGS rendering with high-frequency details and enhanced consistency. We first propose an Auxiliary View Selection Method based on camera poses, making our method adaptable for arbitrarily organized multi-view datasets without the need of temporal continuity or data reordering. Furthermore, we introduce, for the first time, an epipolar-constrained multi-view attention mechanism into 3DGS SR, which serves as the core of our proposed multi-view SR network. This design enables the model to selectively aggregate consistent information from auxiliary views, enhancing the geometric consistency and detail fidelity of 3DGS representations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both object-centric and scene-level 3DGS SR benchmarks.
摘要：通过在低分辨率 (LR) 图像上训练的 3D 高斯泼溅 (3DGS) 重建的场景不适合高分辨率 (HR) 渲染。因此，需要一种 3DGS 超分辨率 (SR) 方法来桥接 LR 输入和 HR 渲染。早期的 3DGS SR 方法依赖于单图像 SR 网络，该网络缺乏跨视图一致性，并且无法融合跨视图的互补信息。最近基于视频的 SR 方法试图解决这一限制，但需要严格的顺序帧，从而限制了它们对非结构化多视图数据集的适用性。在这项工作中，我们介绍了多视图一致 3D 高斯泼溅超分辨率 (MVGSR)，该框架专注于将 3DGS 渲染的多视图信息与高频细节和增强的一致性相集成。我们首先提出了一种基于相机姿态的辅助视图选择方法，使我们的方法适用于任意组织的多视图数据集，而不需要时间连续性或数据重新排序。此外，我们首次将极线约束多视图注意力机制引入 3DGS SR，作为我们提出的多视图 SR 网络的核心。这种设计使模型能够有选择地聚合来自辅助视图的一致信息，从而增强 3DGS 表示的几何一致性和细节保真度。大量实验表明，我们的方法在以对象为中心和场景级 3DGS SR 基准测试中均实现了最先进的性能。

Title: EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks

Authors: Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2512.15067
Pdf URL: https://arxiv.org/pdf/2512.15067
Copy Paste: [[2512.15067]] EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks(https://arxiv.org/abs/2512.15067)
Keywords: generation
Abstract: The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
摘要：无线基础设施的快速增长增加了准确估计和预测电磁场 (EMF) 水平的需求，以确保持续合规性、评估潜在的健康影响并支持高效的网络规划。虽然现有研究依赖于宽带聚合 EMF 数据的单变量预测，但需要频率选择性多元预测来捕获对于主动网络规划至关重要的运营商间和频率间变化。为此，本文介绍了 EMFusion，这是一种基于条件多元扩散的概率预测框架，它集成了不同的背景因素（例如，一天中的时间、季节和假期），同时提供明确的不确定性估计。所提出的架构具有通过交叉注意机制增强的剩余 U-Net 主干，该机制动态集成外部条件来指导生成过程。此外，EMFusion 集成了基于插补的采样策略，将预测视为结构修复任务，即使在不规则测量的情况下也能确保时间一致性。与标准点预测器不同，EMFusion 直接从学习的条件分布生成校准的概率预测区间，提供对于可信决策至关重要的明确的不确定性量化。在频率选择性 EMF 数据集上进行的数值实验表明，具有工作时间上下文信息的 EMFusion 在有条件或无条件的情况下都优于基线模型。 EMFusion 在连续排序概率得分 (CRPS) 方面优于最佳基线 23.85%，在归一化均方根误差方面优于最佳基线 13.93%，并将预测 CRPS 误差降低 22.47%。

Title: The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

Authors: Debu Sinha
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.15068
Pdf URL: https://arxiv.org/pdf/2512.15068
Copy Paste: [[2512.15068]] The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems(https://arxiv.org/abs/2512.15068)
Keywords: generation
Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the "semantic illusion": semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.
摘要：尽管基于检索到的证据，检索增强生成（RAG）系统仍然容易产生幻觉。当前的检测方法依赖于语义相似性和自然语言推理（NLI），但其基本局限性尚未得到严格表征。我们将共形预测应用于幻觉检测，提供有限样本覆盖保证，从而实现检测能力的精确量化。使用大约 600 个示例的校准集，我们对合成幻觉（自然问题）实现了 94% 的覆盖率和 0% 的误报率。然而，在跨越多个 LLM 的三个真实幻觉基准（GPT-4、ChatGPT、GPT-3、Llama-2、Mistral）上，基于嵌入的方法（包括最先进的 OpenAI text-embedding-3-large 和交叉编码器模型）表现出不可接受的误报率：HaluEval 为 100%，RAGTruth 为 88%，WikiBio 为 50%。至关重要的是，GPT-4 作为 LLM 法官在相同数据上仅获得 7% FPR（95% CI：[3.4%，13.7%]），证明该任务是可以通过推理解决的。我们将其称为“语义错觉”：语义上合理的幻觉保留了与源文档的相似性，同时引入了嵌入不可见的事实错误。这种限制在嵌入架构、LLM 生成器和任务类型中持续存在，表明基于嵌入的检测不足以用于生产 RAG 部署。

Title: PMMD: A pose-guided multi-view multi-modal diffusion for person generation

Authors: Ziyu Shang, Haoran Liu, Rongchao Zhang, Zhiqian Wei, Tongtong Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15069
Pdf URL: https://arxiv.org/pdf/2512.15069
Copy Paste: [[2512.15069]] PMMD: A pose-guided multi-view multi-modal diffusion for person generation(https://arxiv.org/abs/2512.15069)
Keywords: generation
Abstract: Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at this https URL.
摘要：生成具有可控姿势和外观的一致人体图像对于虚拟试穿、图像编辑和数字人体创建等应用至关重要。目前的方法经常会遇到遮挡、服装风格漂移和姿势不对等问题。我们提出了姿势引导多视图多模态扩散（PMMD），这是一种扩散框架，可以根据多视图参考、姿势图和文本提示合成逼真的人物图像。多模态编码器联合建模视觉视图、姿势特征和语义描述，从而减少跨模态差异并提高身份保真度。我们进一步设计了一个 ResCVA 模块来增强局部细节，同时保留全局结构，以及一个跨模态融合模块，可以在整个去噪流程中将图像语义与文本集成。 DeepFashion MultiModal 数据集上的实验表明，PMMD 在一致性、细节保留和可控性方面优于代表性基线。项目页面和代码可从此 https URL 获取。

Title: The Semantic Architect: How FEAML Bridges Structured Data and LLMs for Multi-Label Tasks

Authors: Wanfu Gao, Zebin He, Jun Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15082
Pdf URL: https://arxiv.org/pdf/2512.15082
Copy Paste: [[2512.15082]] The Semantic Architect: How FEAML Bridges Structured Data and LLMs for Multi-Label Tasks(https://arxiv.org/abs/2512.15082)
Keywords: generation
Abstract: Existing feature engineering methods based on large language models (LLMs) have not yet been applied to multi-label learning tasks. They lack the ability to model complex label dependencies and are not specifically adapted to the characteristics of multi-label tasks. To address the above issues, we propose Feature Engineering Automation for Multi-Label Learning (FEAML), an automated feature engineering method for multi-label classification which leverages the code generation capabilities of LLMs. By utilizing metadata and label co-occurrence matrices, LLMs are guided to understand the relationships between data features and task objectives, based on which high-quality features are generated. The newly generated features are evaluated in terms of model accuracy to assess their effectiveness, while Pearson correlation coefficients are used to detect redundancy. FEAML further incorporates the evaluation results as feedback to drive LLMs to continuously optimize code generation in subsequent iterations. By integrating LLMs with a feedback mechanism, FEAML realizes an efficient, interpretable and self-improving feature engineering paradigm. Empirical results on various multi-label datasets demonstrate that our FEAML outperforms other feature engineering methods.
摘要：现有的基于大语言模型（LLM）的特征工程方法尚未应用于多标签学习任务。它们缺乏对复杂标签依赖关系进行建模的能力，并且没有专门适应多标签任务的特点。为了解决上述问题，我们提出了多标签学习的特征工程自动化（FEAML），这是一种利用法学硕士的代码生成功能的多标签分类的自动化特征工程方法。通过利用元数据和标签共现矩阵，引导法学硕士理解数据特征和任务目标之间的关系，并在此基础上生成高质量的特征。新生成的特征根据模型精度进行评估，以评估其有效性，同时使用皮尔逊相关系数来检测冗余。 FEAML进一步将评估结果作为反馈，驱动LLM在后续迭代中不断优化代码生成。通过将 LLM 与反馈机制集成，FEAML 实现了高效、可解释和自我改进的特征工程范式。各种多标签数据集的实证结果表明，我们的 FEAML 优于其他特征工程方法。

Title: Uni-Parser Technical Report

Authors: Xi Fang, Haoyi Tao, Shuwen Yang, Suyang Zhong, Haocheng Lu, Han Lyu, Chaozheng Huang, Xinyu Li, Linfeng Zhang, Guolin Ke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15098
Pdf URL: https://arxiv.org/pdf/2512.15098
Copy Paste: [[2512.15098]] Uni-Parser Technical Report(https://arxiv.org/abs/2512.15098)
Keywords: generation
Abstract: This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.
摘要：本技术报告介绍了 Uni-Parser，这是一款专为科学文献和专利量身定制的工业级文档解析引擎，可提供高吞吐量、强大的准确性和成本效益。与基于管道的文档解析方法不同，Uni-Parser 采用模块化、松散耦合的多专家架构，可以保留文本、方程、表格、图形和化学结构之间的细粒度跨模式对齐，同时保持易于扩展到新兴模式。该系统结合了自适应 GPU 负载平衡、分布式推理、动态模块编排以及支持整体或特定模态解析的可配置模式。 Uni-Parser 针对大规模云部署进行了优化，在 8 个 NVIDIA RTX 4090D GPU 上实现了高达每秒 20 个 PDF 页的处理速率，从而实现了跨数十亿页的经济高效的推理。这种级别的可扩展性促进了广泛的下游应用，从文献检索和总结到化学结构、反应方案和生物活性数据的提取，以及用于训练下一代大型语言模型和 AI4Science 模型的大规模语料库的管理。

Title: Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets

Authors: Jialong Zuo, Haoyou Deng, Hanyu Zhou, Jiaxin Zhu, Yicheng Zhang, Yiwei Zhang, Yongxin Yan, Kaixing Huang, Weisen Chen, Yongtai Deng, Rui Jin, Nong Sang, Changxin Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15110
Pdf URL: https://arxiv.org/pdf/2512.15110
Copy Paste: [[2512.15110]] Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets(https://arxiv.org/abs/2512.15110)
Keywords: generation, generative
Abstract: The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
摘要：文本到图像生成模型的快速发展彻底改变了视觉内容创建。虽然 Nano Banana Pro 等商业产品引起了广泛关注，但它们作为传统低水平视觉挑战的通用解决方案的潜力仍然很大程度上未被充分开发。在这项研究中，我们调查了一个关键问题：Nano Banana Pro 是低级视觉全能选手吗？我们对涵盖 40 个不同数据集的 14 项不同的低级任务进行了全面的零样本评估。通过使用简单的文本提示而不进行微调，我们将 Nano Banana Pro 与最先进的专业模型进行了基准测试。我们的广泛分析揭示了明显的性能二分法：虽然 \textbf{Nano Banana Pro 展示了卓越的主观视觉质量}，经常产生超越专业模型的看似合理的高频细节，但它在传统的基于参考的定量指标方面落后。我们将这种差异归因于生成模型固有的随机性，该模型难以维持传统指标所需的严格像素级一致性。该报告将 Nano Banana Pro 视为低级视觉任务的零样本竞争者，同时强调实现领域专家的高保真度仍然是一个重大障碍。

Title: FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

Authors: Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15116
Pdf URL: https://arxiv.org/pdf/2512.15116
Copy Paste: [[2512.15116]] FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation(https://arxiv.org/abs/2512.15116)
Keywords: generative
Abstract: Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at this https URL
摘要：多元时间序列插补是医疗保健、交通预测和生物建模等应用的基础，在这些应用中，传感器故障和不规则采样会导致普遍的缺失值。然而，现有的基于 Transformer 和扩散的模型缺乏明确的归纳偏差和频率意识，限制了它们在结构化缺失模式和分布变化下的泛化。我们提出了 FADTI，一种基于扩散的框架，它通过可学习的傅里叶偏差投影（FBP）模块注入频率信息特征调制，并通过自注意力和门控卷积将其与时间建模相结合。 FBP 支持多种频谱基础，支持静态和非静态模式的自适应编码。该设计将频域归纳偏差注入生成插补过程中。对多个基准（包括新引入的生物时间序列数据集）的实验表明，FADTI 始终优于最先进的方法，特别是在高缺失率的情况下。代码可在此 https URL 获取

Title: 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Authors: Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15126
Pdf URL: https://arxiv.org/pdf/2512.15126
Copy Paste: [[2512.15126]] 3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding(https://arxiv.org/abs/2512.15126)
Keywords: generation, generative
Abstract: 3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.
摘要：3D 动画是现代视觉媒体的核心，但传统的制作流程仍然是劳动密集型、专业知识要求高且计算成本昂贵。最近基于 AIGC 的方法部分自动化了资产创建和装配，但它们要么继承了完整 3D 管道的沉重成本，要么依赖于牺牲 3D 可控性和交互性的视频合成范例。我们专注于单图像 3D 动画生成，并认为进展从根本上受到渲染质量和 3D 控制之间的权衡的限制。为了解决这个限制，我们提出了一个轻量级的 3D 动画框架，它将几何控制与外观合成分离。核心思想是 2D-3D 对齐的代理表示，它使用粗略的 3D 估计作为结构载体，同时将高保真外观和视图合成委托给学习的图像空间生成先验。这种代理公式可以实现与经典管道相当的 3D 感知运动控制和交互，无需精确的几何形状或昂贵的优化，并且自然地扩展到连贯的背景动画。大量实验表明，我们的方法可以在低功耗平台上实现高效的动画生成，并且在身份保留、几何和纹理一致性以及为用户提供的精确交互控制水平方面优于基于视频的 3D 动画生成。

Title: Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

Authors: Mengshi Qi, Yeteng Wu, Xianlin Zhang, Huadong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15153
Pdf URL: https://arxiv.org/pdf/2512.15153
Copy Paste: [[2512.15153]] Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning(https://arxiv.org/abs/2512.15153)
Keywords: generation, quality assessment
Abstract: Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at this https URL.
摘要：评估人类行为是否标准并提供合理的反馈以提高行为标准化非常重要，但在现实场景中具有挑战性。然而，目前的视频理解方法主要关注动作是什么、在哪里，这无法满足要求。同时，现有的数据集大多缺乏表明行动标准化程度的标签，行动质量评估数据集缺乏可解释性和详细的反馈。因此，我们定义了一个新的人类动作形式评估（AFA）任务，并引入了一个新的多样化数据集CoT-AFA，其中包含具有多级注释的大规模健身和武术视频，用于综合视频分析。我们用一种新颖的思想链解释范式丰富了 CoT-AFA 数据集。我们的解释不是提供孤立的反馈，而是提供完整的推理过程——从确定行动步骤到分析其结果并提出具体的解决方案。此外，我们提出了一个名为Explainable Fitness Assessor的框架，它不仅可以判断一个动作，还可以解释原因并提供解决方案。该框架采用两个并行处理流和动态门控机制来融合视觉和语义信息，从而提高其分析能力。实验结果表明，我们的方法在解释生成（例如，CIDEr 中 +16.0%）、动作分类（准确率 +2.7%）和质量评估（准确率 +2.1%）方面取得了改进，揭示了 CoT-AFA 在未来研究中的巨大潜力。我们的数据集和源代码可通过此 https URL 获取。

Title: Robust and Calibrated Detection of Authentic Multimedia Content

Authors: Sarim Hashmi, Abdelrahman Elsayed, Mohammed Talha Alam, Samuele Poppi, Nils Lukas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15182
Pdf URL: https://arxiv.org/pdf/2512.15182
Copy Paste: [[2512.15182]] Robust and Calibrated Detection of Authentic Multimedia Content(https://arxiv.org/abs/2512.15182)
Keywords: generative
Abstract: Generative models can synthesize highly realistic content, so-called deepfakes, that are already being misused at scale to undermine digital media authenticity. Current deepfake detection methods are unreliable for two reasons: (i) distinguishing inauthentic content post-hoc is often impossible (e.g., with memorized samples), leading to an unbounded false positive rate (FPR); and (ii) detection lacks robustness, as adversaries can adapt to known detectors with near-perfect accuracy using minimal computational resources. To address these limitations, we propose a resynthesis framework to determine if a sample is authentic or if its authenticity can be plausibly denied. We make two key contributions focusing on the high-precision, low-recall setting against efficient (i.e., compute-restricted) adversaries. First, we demonstrate that our calibrated resynthesis method is the most reliable approach for verifying authentic samples while maintaining controllable, low FPRs. Second, we show that our method achieves adversarial robustness against efficient adversaries, whereas prior methods are easily evaded under identical compute budgets. Our approach supports multiple modalities and leverages state-of-the-art inversion techniques.
摘要：生成模型可以合成高度真实的内容，即所谓的深度伪造内容，这些内容已经被大规模滥用，破坏了数字媒体的真实性。当前的深度伪造检测方法不可靠，原因有两个：（i）事后区分不真实的内容通常是不可能的（例如，使用记忆的样本），导致无限制的误报率（FPR）； (ii) 检测缺乏鲁棒性，因为对手可以使用最少的计算资源以近乎完美的精度适应已知的检测器。为了解决这些限制，我们提出了一个重新合成框架来确定样本是否真实，或者是否可以合理地否认其真实性。我们做出了两项关键贡献，重点是针对高效（即计算受限）对手的高精度、低召回率设置。首先，我们证明我们的校准再合成方法是验证真实样品同时保持可控、低 FPR 的最可靠方法。其次，我们表明我们的方法对高效对手实现了对抗鲁棒性，而先前的方法在相同的计算预算下很容易被规避。我们的方法支持多种模式并利用最先进的反演技术。

Title: SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal

Authors: Xiyu Zhu, Wei Wang, Xin Yuan, Xiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15221
Pdf URL: https://arxiv.org/pdf/2512.15221
Copy Paste: [[2512.15221]] SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal(https://arxiv.org/abs/2512.15221)
Keywords: generation
Abstract: Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions. To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.
摘要：镜头眩光是一种常见的夜间伪像，由相机镜头内的强光源散射引起，导致模糊条纹、光晕和眩光，从而降低视觉质量。然而，现有的方法通常无法有效解决不均匀的散射耀斑，这严重降低了它们对具有不同照明条件的复杂现实场景的适用性。为了解决这个问题，我们提出了 SLCFormer，一种新颖的光谱局部上下文变换器框架，用于有效去除夜间镜头眩光。 SLCFormer 集成了两个关键模块：频率傅立叶和激励模块 (FFEM)，它在频域中捕获有效的全局上下文表示，以对耀斑特征进行建模；定向增强空间模块 (DESM)，用于空间域中的局部结构增强和方向特征，以实现精确的耀斑去除。此外，我们引入了基于 ZernikeVAE 的散射耀斑生成管道，以通过空间变化的 PSF 合成物理真实的散射耀斑，从而桥接光学物理和数据驱动的训练。在 Flare7K++ 数据集上进行的大量实验表明，我们的方法实现了最先进的性能，在定量指标和感知视觉质量方面都优于现有方法，并且可以稳健地推广到具有复杂耀斑伪影的真实夜间场景。

Title: Accelerating High-Throughput Catalyst Screening by Direct Generation of Equilibrium Adsorption Structures

Authors: Songze Huo, Xiao-Ming Cao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15228
Pdf URL: https://arxiv.org/pdf/2512.15228
Copy Paste: [[2512.15228]] Accelerating High-Throughput Catalyst Screening by Direct Generation of Equilibrium Adsorption Structures(https://arxiv.org/abs/2512.15228)
Keywords: generation, generative
Abstract: The adsorption energy serves as a crucial descriptor for the large-scale screening of catalysts. Nevertheless, the limited distribution of training data for the extensively utilised machine learning interatomic potential (MLIP), predominantly sourced from near-equilibrium structures, results in unreliable adsorption structures and consequent adsorption energy predictions. In this context, we present DBCata, a deep generative model that integrates a periodic Brownian-bridge framework with an equivariant graph neural network to establish a low-dimensional transition manifold between unrelaxed and DFT-relaxed structures, without requiring explicit energy or force information. Upon training, DBCata effectively generates high-fidelity adsorption geometries, achieving an interatomic distance mean absolute error (DMAE) of 0.035 \textÅ on the Catalysis-Hub dataset, which is nearly three times superior to that of the current state-of-the-art machine learning potential models. Moreover, the corresponding DFT accuracy can be improved within 0.1 eV in 94\% of instances by identifying and refining anomalous predictions through a hybrid chemical-heuristic and self-supervised outlier detection approach. We demonstrate that the remarkable performance of DBCata facilitates accelerated high-throughput computational screening for efficient alloy catalysts in the oxygen reduction reaction, highlighting the potential of DBCata as a powerful tool for catalyst design and optimisation.
摘要：吸附能是大规模筛选催化剂的关键描述符。然而，广泛使用的机器学习原子间势（MLIP）的训练数据分布有限，主要来自近平衡结构，导致吸附结构和随后的吸附能预测不可靠。在这种背景下，我们提出了 DBCata，一种深度生成模型，它将周期性布朗桥框架与等变图神经网络相结合，在非松弛结构和 DFT 松弛结构之间建立低维过渡流形，而不需要明确的能量或力信息。经过训练，DBCata 有效地生成高保真吸附几何形状，在 Catalysis-Hub 数据集上实现了 0.035 \textÅ 的原子间距离平均绝对误差（DMAE），这比当前最先进的机器学习潜力模型高出近三倍。此外，通过混合化学启发式和自监督异常值检测方法来识别和细化异常预测，相应的 DFT 精度可以在 94% 的情况下提高到 0.1 eV 以内。我们证明，DBCata 的卓越性能有助于加速氧还原反应中高效合金催化剂的高通量计算筛选，凸显了 DBCata 作为催化剂设计和优化的强大工具的潜力。

Title: MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement

Authors: Yingying Wang, Xuanhua He, Chen Wu, Jialing Huang, Suiyun Zhang, Rui Liu, Xinghao Ding, Haoxuan Che
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15261
Pdf URL: https://arxiv.org/pdf/2512.15261
Copy Paste: [[2512.15261]] MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement(https://arxiv.org/abs/2512.15261)
Keywords: super-resolution, generation
Abstract: Pan-sharpening aims to generate high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) image with its corresponding low-resolution multispectral (MS) image. To achieve effective fusion, it is crucial to fully exploit the complementary information between the two modalities. Traditional CNN-based methods typically rely on channel-wise concatenation with fixed convolutional operators, which limits their adaptability to diverse spatial and spectral variations. While cross-attention mechanisms enable global interactions, they are computationally inefficient and may dilute fine-grained correspondences, making it difficult to capture complex semantic relationships. Recent advances in the Multimodal Diffusion Transformer (MMDiT) architecture have demonstrated impressive success in image generation and editing tasks. Unlike cross-attention, MMDiT employs in-context conditioning to facilitate more direct and efficient cross-modal information exchange. In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. Built upon the Mamba architecture, our design ensures linear computational complexity while maintaining strong cross-modal interaction capacity. Furthermore, we introduce a novel multimodal interleaved (MI) scanning mechanism that facilitates effective information exchange between the PAN and MS modalities. Extensive experiments demonstrate the superior performance of our method compared to existing state-of-the-art (SOTA) techniques across multiple tasks and benchmarks.
摘要：全色锐化旨在通过将高分辨率全色 (PAN) 图像与其相应的低分辨率多光谱 (MS) 图像集成来生成高分辨率多光谱 (HRMS) 图像。为了实现有效的融合，充分利用两种模式之间的互补信息至关重要。传统的基于 CNN 的方法通常依赖于具有固定卷积算子的通道级联，这限制了它们对不同空间和光谱变化的适应性。虽然交叉注意力机制可以实现全局交互，但它们的计算效率低下，并且可能会稀释细粒度的对应关系，从而难以捕获复杂的语义关系。多模态扩散变压器 (MMDiT) 架构的最新进展在图像生成和编辑任务方面取得了令人瞩目的成功。与交叉注意力不同，MMDiT 采用上下文条件来促进更直接、更高效的跨模态信息交换。在本文中，我们提出了MMMamba，一种用于全色锐化的跨模式上下文融合框架，能够灵活地以零镜头方式支持图像超分辨率。我们的设计基于 Mamba 架构，确保线性计算复杂性，同时保持强大的跨模式交互能力。此外，我们引入了一种新颖的多模态交错（MI）扫描机制，促进 PAN 和 MS 模态之间的有效信息交换。大量的实验证明，与跨多个任务和基准的现有最先进（SOTA）技术相比，我们的方法具有卓越的性能。

Title: Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions

Authors: Siva Sai, Ishika Goyal, Shubham Sharma, Sri Harshita Manuri, Vinay Chamola, Rajkumar Buyya
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2512.15286
Pdf URL: https://arxiv.org/pdf/2512.15286
Copy Paste: [[2512.15286]] Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions(https://arxiv.org/abs/2512.15286)
Keywords: generative
Abstract: The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.
摘要：近年来，日益增多的网络威胁、快速演变的策略以及大量数据，导致经典的机器学习、规则和基于签名的防御策略失效，无法跟上。最近出现了一种替代方案，即量子机器学习（QML），它利用基于量子力学的计算。它为某些问题提供了更好的高维结构编码和处理。本次调查全面概述了与安全领域相关的 QML 技术，例如量子神经网络 (QNN)、量子支持向量机 (QSVM)、变分量子电路 (VQC) 和量子生成对抗网络 (QGAN)，并讨论了本文对该领域现有研究的贡献以及如何改进这些研究。它还将这些方法映射到监督、无监督和生成学习范式，以及核心网络安全任务，包括入侵和异常检测、恶意软件和僵尸网络分类以及加密流量分析。它还讨论了它们在云计算安全领域的应用，其中 QML 可以增强安全和可扩展的操作。还讨论了 QML 在网络安全领域的许多局限性以及解决这些局限性的方向。

Title: SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation

Authors: Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15310
Pdf URL: https://arxiv.org/pdf/2512.15310
Copy Paste: [[2512.15310]] SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2512.15310)
Keywords: generation, generative
Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.
摘要：具有图像级标签的弱监督语义分割（WSSS）旨在产生像素级预测，而不需要密集的注释。虽然最近的方法利用生成模型来增强现有数据，但它们仍然依赖于现实世界的训练样本。在本文中，我们介绍了一个新的方向，即零样本弱监督语义分割（ZSWSSS），并提出了 SynthSeg Agents，这是一种由大型语言模型（LLM）驱动的多代理框架，可以完全在没有真实图像的情况下生成合成训练数据。 SynthSeg Agents 包含两个关键模块：Self Refine Prompt Agent 和 Image Generation Agent。 Self Refine Prompt Agent 在基于 CLIP 的相似性和最近邻多样性过滤的指导下，通过迭代细化、记忆机制和提示空间探索，自主制作多样化且语义丰富的图像提示。然后，这些提示会传递给图像生成代理，该代理利用视觉语言模型 (VLM) 来合成候选图像。采用冻结的 CLIP 评分模型来选择高质量样本，并进一步训练基于 ViT 的分类器，以提高语义精度来重新标记整个合成数据集。我们的框架可以在没有任何真实图像监督的情况下生成高质量的训练数据。 PASCAL VOC 2012 和 COCO 2014 上的实验表明，SynthSeg Agents 在不使用真实训练图像的情况下实现了有竞争力的性能。这凸显了 LLM 驱动的代理在实现成本高效且可扩展的语义分割方面的潜力。

Title: Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

Authors: Antony Jerald, Dattesh Shanbhag, Sudhanya Chatterjee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15315
Pdf URL: https://arxiv.org/pdf/2512.15315
Copy Paste: [[2512.15315]] Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment(https://arxiv.org/abs/2512.15315)
Keywords: quality assessment
Abstract: Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image's proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.
摘要：运动伪影会降低 MRI 图像质量并增加患者回忆。现有的自动化质量评估方法很大程度上局限于二元决策，并且几乎不提供可解释性。我们引入了 AutoMAC-MRI，这是一种可解释的框架，用于对异构 MR 对比度和方向的运动伪影进行分级。该方法使用监督对比学习来学习运动严重性的判别性表示。在这个特征空间内，我们计算特定等级的亲和力分数，量化图像与每个运动等级的接近程度，从而使等级分配透明且可解释。我们在 5000 多个专家注释的脑 MRI 切片上评估 AutoMAC-MRI，这些切片涵盖多种对比和视图。根据专家标签评估亲和力分数的实验表明，这些分数与专家的判断非常一致，支持将它们用作运动严重程度的可解释度量。通过将准确的等级检测与每个等级的亲和力评分相结合，AutoMAC-MRI 可实现在线 MRI 质量控制，并有可能减少不必要的重新扫描并提高工作流程效率。

Title: A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection

Authors: Yuxin Jiang, Yunkang Can, Weiming Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15326
Pdf URL: https://arxiv.org/pdf/2512.15326
Copy Paste: [[2512.15326]] A Masked Reverse Knowledge Distillation Method Incorporating Global and Local Information for Image Anomaly Detection(https://arxiv.org/abs/2512.15326)
Keywords: restoration
Abstract: Knowledge distillation is an effective image anomaly detection and localization scheme. However, a major drawback of this scheme is its tendency to overly generalize, primarily due to the similarities between input and supervisory signals. In order to address this issue, this paper introduces a novel technique called masked reverse knowledge distillation (MRKD). By employing image-level masking (ILM) and feature-level masking (FLM), MRKD transforms the task of image reconstruction into image restoration. Specifically, ILM helps to capture global information by differentiating input signals from supervisory signals. On the other hand, FLM incorporates synthetic feature-level anomalies to ensure that the learned representations contain sufficient local information. With these two strategies, MRKD is endowed with stronger image context capture capacity and is less likely to be overgeneralized. Experiments on the widely-used MVTec anomaly detection dataset demonstrate that MRKD achieves impressive performance: image-level 98.9% AU-ROC, pixel-level 98.4% AU-ROC, and 95.3% AU-PRO. In addition, extensive ablation experiments have validated the superiority of MRKD in mitigating the overgeneralization problem.
摘要：知识蒸馏是一种有效的图像异常检测和定位方案。然而，该方案的一个主要缺点是其倾向于过度概括，这主要是由于输入信号和监控信号之间的相似性。为了解决这个问题，本文引入了一种称为掩码逆向知识蒸馏（MRKD）的新技术。通过采用图像级掩蔽（ILM）和特征级掩蔽（FLM），MRKD将图像重建任务转变为图像恢复任务。具体来说，ILM 通过区分输入信号和监控信号来帮助捕获全局信息。另一方面，FLM 结合了合成特征级异常，以确保学习到的表示包含足够的本地信息。通过这两种策略，MRKD 被赋予了更强的图像上下文捕获能力，并且不太可能被过度概括。在广泛使用的 MVTec 异常检测数据集上进行的实验表明，MRKD 取得了令人印象深刻的性能：图像级 98.9% AU-ROC、像素级 98.4% AU-ROC 和 95.3% AU-PRO。此外，广泛的消融实验验证了 MRKD 在缓解过度泛化问题方面的优越性。

Title: Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Authors: Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15340
Pdf URL: https://arxiv.org/pdf/2512.15340
Copy Paste: [[2512.15340]] Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics(https://arxiv.org/abs/2512.15340)
Keywords: generation
Abstract: Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository this https URL.
摘要：人类对话涉及持续的言语和非语言暗示的交流，例如点头、目光转移以及传达注意力和情感的面部表情。对这些双向动态进行 3D 建模对于构建富有表现力的化身和交互式机器人至关重要。然而，现有的框架通常将说话和倾听视为独立的过程，或者依赖于非因果全序列建模，从而阻碍了各个回合的时间连贯性。我们提出了 TIMAR（Turn-level Interleaved Masked AutoRegression），这是一种用于 3D 对话头部生成的因果框架，它将对话建模为交错的视听上下文。它在每个回合中融合多模态信息，并应用回合级因果注意力来积累对话历史，而轻量级扩散头部则可以预测连续的 3D 头部动态，从而捕获协调性和表达变异性。 DualTalk 基准测试表明，TIMAR 在测试集上将 Fréchet Distance 和 MSE 降低了 15-30%，并且在分布外数据上实现了类似的增益。源代码将在 GitHub 存储库中发布此 https URL。

Title: Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

Authors: Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, Ran He
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.15347
Pdf URL: https://arxiv.org/pdf/2512.15347
Copy Paste: [[2512.15347]] Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models(https://arxiv.org/abs/2512.15347)
Keywords: generative
Abstract: Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
摘要：组相对策略优化（GRPO）是一种用于对齐生成模型的强大技术，但其有效性受到大组规模和高昂计算成本之间冲突的瓶颈。在这项工作中，我们通过实证研究来调查这种权衡，得出两个关键观察结果。首先，我们发现奖励聚类现象，其中许多轨迹向组平均奖励崩溃，提供有限的优化价值。其次，我们设计了一种名为“最佳方差过滤”(OVF) 的启发式策略，并验证 OVF 选择的高方差轨迹子集可以优于较大的未过滤组。然而，这种静态的后采样 OVF 方法仍然需要关键的计算开销，因为它对最终被丢弃的轨迹执行不必要的采样。为了解决这个问题，我们提出了 Pro-GRPO（Proactive GRPO），这是一种新颖的动态框架，它将基于潜在特征的轨迹修剪集成到采样过程中。通过提前终止奖励聚类轨迹，Pro-GRPO 减少了计算开销。 Pro-GRPO 利用其效率，采用“扩展和修剪”策略。该策略首先扩大初始采样组的大小以最大化轨迹多样性，然后将多步 OVF 应用于潜在变量，避免过高的计算成本。基于扩散和基于流动的模型的大量实验证明了我们的 Pro-GRPO 框架的通用性和有效性。

Title: Robustness Evaluation of Machine Learning Models for Fault Classification and Localization In Power System Protection

Authors: Julian Oelhaf, Mehran Pashaei, Georg Kordowich, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.15385
Pdf URL: https://arxiv.org/pdf/2512.15385
Copy Paste: [[2512.15385]] Robustness Evaluation of Machine Learning Models for Fault Classification and Localization In Power System Protection(https://arxiv.org/abs/2512.15385)
Keywords: generation
Abstract: The growing penetration of renewable and distributed generation is transforming power systems and challenging conventional protection schemes that rely on fixed settings and local measurements. Machine learning (ML) offers a data-driven alternative for centralized fault classification (FC) and fault localization (FL), enabling faster and more adaptive decision-making. However, practical deployment critically depends on robustness. Protection algorithms must remain reliable even when confronted with missing, noisy, or degraded sensor data. This work introduces a unified framework for systematically evaluating the robustness of ML models in power system protection. High-fidelity EMT simulations are used to model realistic degradation scenarios, including sensor outages, reduced sampling rates, and transient communication losses. The framework provides a consistent methodology for benchmarking models, quantifying the impact of limited observability, and identifying critical measurement channels required for resilient operation. Results show that FC remains highly stable under most degradation types but drops by about 13% under single-phase loss, while FL is more sensitive overall, with voltage loss increasing localization error by over 150%. These findings offer actionable guidance for robustness-aware design of future ML-assisted protection systems.
摘要：可再生能源和分布式发电的日益普及正在改变电力系统，并对依赖固定设置和本地测量的传统保护方案提出挑战。机器学习 (ML) 为集中式故障分类 (FC) 和故障定位 (FL) 提供了数据驱动的替代方案，从而实现更快、更具适应性的决策。然而，实际部署关键取决于稳健性。即使遇到丢失、噪声或降级的传感器数据，保护算法也必须保持可靠。这项工作引入了一个统一的框架，用于系统地评估电力系统保护中机器学习模型的稳健性。高保真 EMT 仿真用于模拟真实的退化场景，包括传感器中断、采样率降低和瞬态通信丢失。该框架为基准测试模型、量化有限可观测性的影响以及确定弹性操作所需的关键测量通道提供了一致的方法。结果表明，FC 在大多数退化类型下保持高度稳定，但在单相损耗下下降约 13%，而 FL 总体上更敏感，电压损耗使定位误差增加超过 150%。这些发现为未来机器学习辅助保护系统的稳健性设计提供了可行的指导。

Title: FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows

Authors: Yeonwoo Cha, Semin Kim, Jinhyeon Kwon, Seunghoon Hong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15420
Pdf URL: https://arxiv.org/pdf/2512.15420
Copy Paste: [[2512.15420]] FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows(https://arxiv.org/abs/2512.15420)
Keywords: generation
Abstract: Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at this https URL.
摘要：任意到任意生成寻求在模态的任意子集之间进行转换，从而实现灵活的跨模态合成。尽管最近取得了成功，但现有的基于流的方法仍面临着效率低下的挑战，因为它们需要通常具有限制性配对约束的大规模数据集，建模联合分布会产生高昂的计算成本，并且依赖于复杂的多阶段训练。我们提出了 FlowBind，一个适用于任意生成的高效框架。我们的方法以其简单性而著称：它学习捕获跨模态信息的共享潜在空间，并通过特定于模态的可逆流将该潜在空间桥接到每种模态。两个组件在单个流匹配目标下联合优化，并且在推理时，可逆流充当编码器和解码器，用于跨模态的直接转换。通过共享潜在因素分解交互，FlowBind 自然地利用模态的任意子集进行训练，并实现有竞争力的生成质量，同时大幅降低数据需求和计算成本。对文本、图像和音频的实验表明，FlowBind 可以达到相当的质量，同时所需参数比以前的方法少 6 倍，训练速度快 10 倍。包含代码的项目页面可从此 https URL 获取。

Title: Copyright Infringement Risk Reduction via Chain-of-Thought and Task Instruction Prompting

Authors: Neeraj Sarna, Yuanyuan Li, Michael von Gablenz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15442
Pdf URL: https://arxiv.org/pdf/2512.15442
Copy Paste: [[2512.15442]] Copyright Infringement Risk Reduction via Chain-of-Thought and Task Instruction Prompting(https://arxiv.org/abs/2512.15442)
Keywords: generation
Abstract: Large scale text-to-image generation models can memorize and reproduce their training dataset. Since the training dataset often contains copyrighted material, reproduction of training dataset poses a copyright infringement risk, which could result in legal liabilities and financial losses for both the AI user and the developer. The current works explores the potential of chain-of-thought and task instruction prompting in reducing copyrighted content generation. To this end, we present a formulation that combines these two techniques with two other copyright mitigation strategies: a) negative prompting, and b) prompt re-writing. We study the generated images in terms their similarity to a copyrighted image and their relevance of the user input. We present numerical experiments on a variety of models and provide insights on the effectiveness of the aforementioned techniques for varying model complexity.
摘要：大规模文本到图像生成模型可以记忆并重现其训练数据集。由于训练数据集通常包含受版权保护的材料，复制训练数据集会带来版权侵权风险，可能导致人工智能用户和开发者承担法律责任和经济损失。当前的作品探讨了思维链和任务指令提示在减少版权内容生成方面的潜力。为此，我们提出了一种将这两种技术与其他两种版权缓解策略相结合的方案：a）负面提示，b）提示重写。我们根据生成的图像与受版权保护的图像的相似性及其与用户输入的相关性来研究生成的图像。我们对各种模型进行了数值实验，并提供了有关上述技术对于不同模型复杂性的有效性的见解。

Title: Multi-stage Bayesian optimisation for dynamic decision-making in self-driving labs

Authors: Luca Torresi, Pascal Friederich
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.15483
Pdf URL: https://arxiv.org/pdf/2512.15483
Copy Paste: [[2512.15483]] Multi-stage Bayesian optimisation for dynamic decision-making in self-driving labs(https://arxiv.org/abs/2512.15483)
Keywords: generation
Abstract: Self-driving laboratories (SDLs) are combining recent technological advances in robotics, automation, and machine learning based data analysis and decision-making to perform autonomous experimentation toward human-directed goals without requiring any direct human intervention. SDLs are successfully used in materials science, chemistry, and beyond, to optimise processes, materials, and devices in a systematic and data-efficient way. At present, the most widely used algorithm to identify the most informative next experiment is Bayesian optimisation. While relatively simple to apply to a wide range of optimisation problems, standard Bayesian optimisation relies on a fixed experimental workflow with a clear set of optimisation parameters and one or more measurable objective functions. This excludes the possibility of making on-the-fly decisions about changes in the planned sequence of operations and including intermediate measurements in the decision-making process. Therefore, many real-world experiments need to be adapted and simplified to be converted to the common setting in self-driving labs. In this paper, we introduce an extension to Bayesian optimisation that allows flexible sampling of multi-stage workflows and makes optimal decisions based on intermediate observables, which we call proxy measurements. We systematically compare the advantage of taking into account proxy measurements over conventional Bayesian optimisation, in which only the final measurement is observed. We find that over a wide range of scenarios, proxy measurements yield a substantial improvement, both in the time to find good solutions and in the overall optimality of found solutions. This not only paves the way to use more complex and thus more realistic experimental workflows in autonomous labs but also to smoothly combine simulations and experiments in the next generation of SDLs.
摘要：自动驾驶实验室 (SDL) 结合了机器人技术、自动化和基于机器学习的数据分析和决策方面的最新技术进步，以实现以人类为导向的目标的自主实验，而无需任何直接的人类干预。 SDL 已成功应用于材料科学、化学等领域，以系统且数据高效的方式优化工艺、材料和设备。目前，最广泛使用的用于识别信息最丰富的下一个实验的算法是贝叶斯优化。虽然适用于各种优化问题相对简单，但标准贝叶斯优化依赖于固定的实验工作流程，具有一组明确的优化参数和一个或多个可测量的目标函数。这排除了对计划操作顺序的变化做出即时决策以及在决策过程中包括中间测量的可能性。因此，许多现实世界的实验需要进行调整和简化，以转换为自动驾驶实验室的常见设置。在本文中，我们介绍了贝叶斯优化的扩展，它允许对多阶段工作流程进行灵活采样，并根据中间可观测值做出最佳决策，我们将其称为代理测量。我们系统地比较了考虑代理测量值与传统贝叶斯优化（仅观察最终测量值）的优势。我们发现，在广泛的场景中，代理测量在寻找良好解决方案的时间和找到的解决方案的整体最优性方面都取得了显着的改进。这不仅为在自主实验室中使用更复杂、更现实的实验工作流程铺平了道路，而且还为下一代 SDL 中的模拟和实验顺利结合铺平了道路。

Title: Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

Authors: Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15508
Pdf URL: https://arxiv.org/pdf/2512.15508
Copy Paste: [[2512.15508]] Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting(https://arxiv.org/abs/2512.15508)
Keywords: generation
Abstract: Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
摘要：前馈 3D 高斯泼溅 (3DGS) 模型可实现实时场景生成，但受到次优像素对齐基元放置的阻碍，这种放置依赖于密集、刚性的网格，并限制了质量和效率。我们引入了一种新的前馈架构，可以在子像素级别检测 3D 高斯基元，用自适应的“Off The Grid”分布替换像素网格。受关键点检测的启发，我们的多分辨率解码器学习在图像块之间分配基元。该模块使用自监督学习通过 3D 重建主干进行端到端训练。我们所得的无姿势模型可在几秒钟内生成逼真的场景，从而为前馈模型实现最先进的新颖视图合成。它的性能优于竞争对手，同时使用的基元数量少得多，展示了更准确、更高效的分配，可以捕获精细细节并减少伪影。此外，我们观察到，通过学习渲染 3D 高斯，我们的 3D 重建主干改善了相机姿态估计，这表明有机会在没有标签的情况下训练这些基础模型。

Title: VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics

Authors: Opeyemi Bamigbade, Mark Scanlon, John Sheppard
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2512.15512
Pdf URL: https://arxiv.org/pdf/2512.15512
Copy Paste: [[2512.15512]] VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics(https://arxiv.org/abs/2512.15512)
Keywords: generation, generative
Abstract: Recent advances in AI-driven image generation have introduced new challenges for verifying the authenticity of digital evidence in forensic investigations. Modern generative models can produce visually consistent forgeries that evade traditional detectors based on pixel or compression artefacts. Most existing approaches also lack an explicit measure of anomaly intensity, which limits their ability to quantify the severity of manipulation. This paper introduces Vision-Attention Anomaly Scoring (VAAS), a novel dual-module framework that integrates global attention-based anomaly estimation using Vision Transformers (ViT) with patch-level self-consistency scoring derived from SegFormer embeddings. The hybrid formulation provides a continuous and interpretable anomaly score that reflects both the location and degree of manipulation. Evaluations on the DF2023 and CASIA v2.0 datasets demonstrate that VAAS achieves competitive F1 and IoU performance, while enhancing visual explainability through attention-guided anomaly maps. The framework bridges quantitative detection with human-understandable reasoning, supporting transparent and reliable image integrity assessment. The source code for all experiments and corresponding materials for reproducing the results are available open source.
摘要：人工智能驱动的图像生成的最新进展为验证法医调查中数字证据的真实性带来了新的挑战。现代生成模型可以生成视觉上一致的伪造品，从而避开基于像素或压缩伪影的传统检测器。大多数现有方法还缺乏对异常强度的明确测量，这限制了它们量化操纵严重性的能力。本文介绍了视觉注意力异常评分 (VAAS)，这是一种新颖的双模块框架，它将使用视觉变换器 (ViT) 的基于全局注意力的异常估计与源自 SegFormer 嵌入的补丁级自一致性评分相结合。混合公式提供了连续且可解释的异常评分，反映了操纵的位置和程度。对 DF2023 和 CASIA v2.0 数据集的评估表明，VAAS 实现了有竞争力的 F1 和 IoU 性能，同时通过注意力引导的异常图增强了视觉可解释性。该框架将定量检测与人类可理解的推理联系起来，支持透明且可靠的图像完整性评估。所有实验的源代码和重现结果的相应材料都是开源的。

Title: DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations

Authors: Yuxiang Shi, Zhe Li, Yanwen Wang, Hao Zhu, Xun Cao, Ligang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15524
Pdf URL: https://arxiv.org/pdf/2512.15524
Copy Paste: [[2512.15524]] DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations(https://arxiv.org/abs/2512.15524)
Keywords: generation
Abstract: Portrait animation from a single source image and a driving video is a long-standing problem. Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation. However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation. To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals. Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code. First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals. Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention. Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency. Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
摘要：来自单一源图像和驾驶视频的肖像动画是一个长期存在的问题。最近的方法倾向于采用基于扩散的图像/视频生成模型来实现逼真且富有表现力的动画。然而，这些扩散模型都没有实现头部姿势和面部表情之间的高保真解开控制，阻碍了诸如仅表情或仅姿势编辑和动画等应用。为了解决这个问题，我们提出了 DeX-Portrait，这是一种新颖的方法，能够生成由解开的姿势和表情信号驱动的富有表现力的肖像动画。具体来说，我们将姿势表示为显式全局变换，将表达式表示为隐式潜在代码。首先，我们设计了一个强大的运动训练器来学习姿势和表情编码器，以提取精确和分解的驱动信号。然后，我们建议通过双分支调节机制将姿势变换注入到扩散模型中，并通过交叉注意将表达隐藏起来。最后，我们设计了一种渐进式混合无分类器指导，以实现更忠实的身份一致性。实验表明，我们的方法在动画质量和解缠结可控性方面都优于最先进的基线。

Title: An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain

Authors: João Daniel Silva, Joao Magalhaes, Devis Tuia, Bruno Martins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15531
Pdf URL: https://arxiv.org/pdf/2512.15531
Copy Paste: [[2512.15531]] An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain(https://arxiv.org/abs/2512.15531)
Keywords: generation
Abstract: The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.
摘要：遥感界最近出现了基于大视觉和语言模型 (LVLM) 的方法，这些方法可以解决计算机视觉和自然语言处理交叉领域的多项任务。为了充分利用此类模型的潜力，我们重点关注收集大量训练数据，这些数据涵盖多个遥感特定任务，例如图像字幕或视觉问答。然而，由于参数数量较多，使用和训练 LVLM 的成本很高。尽管已经探索了多种参数有效的适应技术，但这些模型的训练和推理的计算成本对于大多数机构来说仍然令人望而却步。在这项工作中，我们探索了仅编码器架构的使用，并提出了一种模型，该模型可以有效地解决多任务学习，同时在参数数量方面保持紧凑。特别是，我们的模型解决了通常不会在统一模型中探索的任务组合：从遥感图像生成文本和跨模式检索。我们的 GeoMELT 模型（以多任务高效学习转换器命名）在既定基准中的结果证实了所提出方法的功效和效率。

Title: GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Authors: Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15560
Pdf URL: https://arxiv.org/pdf/2512.15560
Copy Paste: [[2512.15560]] GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models(https://arxiv.org/abs/2512.15560)
Keywords: generation
Abstract: The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: this https URL.
摘要：文本编码器是文本到图像和文本到视频扩散模型的关键组件，从根本上决定了生成内容的语义保真度。然而，它的发展受到两大挑战的阻碍：缺乏可靠预测下游生成性能的有效评估框架，以及有效适应预训练语言模型进行视觉合成的困难。为了解决这些问题，我们引入了 GRAN-TED，这是一种为扩散模型生成稳健、对齐和细致的文本嵌入的范例。我们的贡献是双重的。首先，我们提出了 TED-6K，这是一种新颖的纯文本基准，可以有效、稳健地评估编码器的表征质量，而无需昂贵的端到端模型训练。我们证明，通过轻量级统一适配器标准化的 TED-6K 性能与编码器在下游生成任务中的有效性密切相关。其次，在这个经过验证的框架的指导下，我们使用新颖的两阶段训练范例开发了一种卓越的文本编码器。此过程涉及多模态大语言模型的初始微调阶段，以获得更好的视觉表示，然后采用分层加权方法来提取更细致和有效的文本特征。我们的实验表明，最终的 GRAN-TED 编码器不仅在 TED-6K 上实现了最先进的性能，而且在文本到图像和文本到视频生成方面也带来了明显的性能提升。我们的代码可从以下链接获取：此 https URL。

Title: Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Authors: Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15603
Pdf URL: https://arxiv.org/pdf/2512.15603
Copy Paste: [[2512.15603]] Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition(https://arxiv.org/abs/2512.15603)
Keywords: generation, generative
Abstract: Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{this https URL}{this https URL}
摘要：由于光栅图像的纠缠性质，所有视觉内容都融合到单个画布中，最近的视觉生成模型在图像编辑过程中经常难以保持一致性。相比之下，专业设计工具采用分层表示，允许独立编辑，同时保持一致性。受此启发，我们提出了 \textbf{Qwen-Image-Layered}，一种端到端的扩散模型，它将单个 RGB 图像分解为多个语义上分离的 RGBA 层，从而实现 \textbf{固有的可编辑性}，其中每个 RGBA 层都可以独立操作，而不影响其他内容。为了支持变长分解，我们引入了三个关键组件：（1）RGBA-VAE，用于统一 RGB 和 RGBA 图像的潜在表示； (2) VLD-MMDiT（可变层分解MMDiT）架构，能够分解可变数量的图像层； (3) 多阶段训练策略，将预训练图像生成模型适配为多层图像分解器。此外，为了解决高质量多层训练图像的稀缺问题，我们构建了一个管道来从 Photoshop 文档 (PSD) 中提取和注释多层图像。实验表明，我们的方法在分解质量方面显着超越了现有方法，并为一致的图像编辑建立了新的范例。我们的代码和模型发布在 \href{这个 https URL}{这个 https URL}

Title: InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization

Authors: Qirui Li, Yizhe Tang, Ran Yi, Guangben Lu, Fangyuan Zou, Peng Shu, Huan Yu, Jie Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15644
Pdf URL: https://arxiv.org/pdf/2512.15644
Copy Paste: [[2512.15644]] InpaintDPO: Mitigating Spatial Relationship Hallucinations in Foreground-conditioned Inpainting via Diverse Preference Optimization(https://arxiv.org/abs/2512.15644)
Keywords: generation
Abstract: Foreground-conditioned inpainting, which aims at generating a harmonious background for a given foreground subject based on the text prompt, is an important subfield in controllable image generation. A common challenge in current methods, however, is the occurrence of Spatial Relationship Hallucinations between the foreground subject and the generated background, including inappropriate scale, positional relationships, and viewpoints. Critically, the subjective nature of spatial rationality makes it challenging to quantify, hindering the use of traditional reward-based RLHF methods. To address this issue, we propose InpaintDPO, the first Direct Preference Optimization (DPO) based framework dedicated to spatial rationality in foreground-conditioned inpainting, ensuring plausible spatial relationships between foreground and background elements. To resolve the gradient conflicts in standard DPO caused by identical foreground in win-lose pairs, we propose MaskDPO, which confines preference optimization exclusively to the background to enhance background spatial relationships, while retaining the inpainting loss in the foreground region for robust foreground preservation. To enhance coherence at the foreground-background boundary, we propose Conditional Asymmetric Preference Optimization, which samples pairs with differentiated cropping operations and applies global preference optimization to promote contextual awareness and enhance boundary coherence. Finally, based on the observation that winning samples share a commonality in plausible spatial relationships, we propose Shared Commonality Preference Optimization to enhance the model's understanding of spatial commonality across high-quality winning samples, further promoting shared spatial rationality.
摘要：前景条件修复，旨在根据文本提示为给定的前景主体生成和谐的背景，是可控图像生成中的一个重要子领域。然而，当前方法中的一个常见挑战是前景主体和生成的背景之间发生空间关系幻觉，包括不适当的尺度、位置关系和视点。至关重要的是，空间理性的主观性质使其难以量化，从而阻碍了传统基于奖励的 RLHF 方法的使用。为了解决这个问题，我们提出了 InpaintDPO，这是第一个基于直接偏好优化（DPO）的框架，致力于前景条件修复中的空间合理性，确保前景和背景元素之间合理的空间关系。为了解决标准 DPO 中由输赢对中相同前景引起的梯度冲突，我们提出了 MaskDPO，它将偏好优化仅限制在背景上以增强背景空间关系，同时保留前景区域中的修复损失以实现稳健的前景保留。为了增强前景-背景边界的一致性，我们提出了条件非对称偏好优化，它通过差异化裁剪操作对配对进行采样，并应用全局偏好优化来促进上下文感知并增强边界一致性。最后，基于对获胜样本在合理空间关系中具有共性的观察，我们提出了共享共性偏好优化，以增强模型对高质量获胜样本的空间共性的理解，进一步促进共享空间理性。

Title: SoFlow: Solution Flow Models for One-Step Generative Modeling

Authors: Tianze Luo, Haotian Yuan, Zhuang Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.15657
Pdf URL: https://arxiv.org/pdf/2512.15657
Copy Paste: [[2512.15657]] SoFlow: Solution Flow Models for One-Step Generative Modeling(https://arxiv.org/abs/2512.15657)
Keywords: generation, generative
Abstract: The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.
摘要：扩散和流量匹配模型中的多步去噪过程导致了主要的效率问题，这激发了对少步生成的研究。我们提出解决方案流程模型（SoFlow），这是一种从头开始一步生成的框架。通过分析速度函数和速度常微分方程（ODE）的解函数之间的关系，我们提出了流匹配损失和解一致性损失来训练我们的模型。流匹配损失允许我们的模型在训练期间为无分类器引导（CFG）提供估计的速度场，从而提高生成性能。值得注意的是，我们的一致性损失不需要计算雅可比向量积（JVP），这是最近作品中的常见要求，但在 PyTorch 等深度学习框架中并未得到很好的优化。实验结果表明，当使用相同的 Diffusion Transformer (DiT) 架构和相同数量的训练 epochs 从头开始训练时，我们的模型在 ImageNet 256x256 数据集上比 MeanFlow 模型获得更好的 FID-50K 分数。

Title: Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Authors: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15693
Pdf URL: https://arxiv.org/pdf/2512.15693
Copy Paste: [[2512.15693]] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning(https://arxiv.org/abs/2512.15693)
Keywords: generation
Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
摘要：人工智能驱动的视频生成技术的滥用引起了严重的社会关注，凸显了对可靠的人工智能生成视频检测器的迫切需求。然而，大多数现有方法仅限于二元分类，并且缺乏对人类解释的必要解释。在本文中，我们提出了 Skyra，这是一种专门的多模态大语言模型 (MLLM)，它可以识别人工智能生成的视频中人类可感知的视觉伪影，并利用它们作为检测和解释的依据。为了支持这一目标，我们构建了用于监督微调（SFT）的 ViF-CoT-4K，它代表了第一个具有细粒度人类注释的大规模 AI 生成视频工件数据集。然后，我们开发了一种两阶段训练策略，系统地增强我们模型的时空伪影感知、解释能力和检测准确性。为了全面评估 Skyra，我们引入了 ViF-Bench，这是一个基准测试，包含由十多个最先进的视频生成器生成的 3K 高质量样本。大量实验表明，Skyra 在多个基准测试中超越了现有方法，而我们的评估为推进可解释的人工智能生成视频检测提供了宝贵的见解。

Title: End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Authors: Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15702
Pdf URL: https://arxiv.org/pdf/2512.15702
Copy Paste: [[2512.15702]] End-to-End Training for Autoregressive Video Diffusion via Self-Resampling(https://arxiv.org/abs/2512.15702)
Keywords: generation
Abstract: Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
摘要：自回归视频扩散模型为世界模拟带来了希望，但很容易受到训练测试不匹配引起的曝光偏差的影响。虽然最近的作品通过后训练解决了这个问题，但它们通常依赖于双向教师模型或在线鉴别器。为了实现端到端解决方案，我们引入了 Resampling Forcing，这是一个无需教师的框架，可以从头开始大规模训练自回归视频模型。我们方法的核心是自重采样方案，该方案在训练期间模拟历史帧上的推理时间模型错误。以这些退化的历史为条件，稀疏因果掩模强制执行时间因果关系，同时实现具有帧级扩散损失的并行训练。为了促进高效的长范围生成，我们进一步引入历史路由，这是一种无参数机制，可以为每个查询动态检索前 k 个最相关的历史帧。实验表明，我们的方法实现了与基于蒸馏的基线相当的性能，同时由于原生长度训练而在较长视频上表现出卓越的时间一致性。

Title: DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.15713
Pdf URL: https://arxiv.org/pdf/2512.15713
Copy Paste: [[2512.15713]] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models(https://arxiv.org/abs/2512.15713)
Keywords: generation
Abstract: In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at this https URL.
摘要：在最近的多模态研究中，扩散范式因其独特的解码优势而成为自回归范式（AR）的有前途的替代方案。然而，由于基础扩散语言模型的能力限制，扩散视觉语言模型（dVLM）的性能仍然明显落后于主流模型。这就引出了一个简单而基本的问题：是否可以基于现有强大的 AR 模型构建 dVLM？作为回应，我们提出了 DiffusionVL，这是一个可以从任何强大的 AR 模型转换而来的 dVLM 系列。通过简单的微调，我们成功地将 AR 预训练模型应用到扩散范式中。这种方法产生了两个关键的观察结果：（1）从基于 AR 的多模态模型到扩散模型的范式转变非常有效。 (2) AR语言模型直接转换为dVLM也是可行的，其性能可与LLaVA风格的视觉指令调整相媲美。此外，我们在 dVLM 中引入了块解码设计，支持任意长度生成和 KV 缓存重用，从而实现了显着的推理加速。我们进行了大量的实验。尽管训练所需的数据少于先前方法的 5%，但 DiffusionVL 实现了全面的性能改进，在 MMMU-Pro（视觉）基准上提高了 34.4%，在 MME（Cog.）基准上提高了 37.5%，同时推理速度提高了 2 倍。模型和代码在此 https URL 发布。

Title: Spatia: Video Generation with Updatable Spatial Memory

Authors: Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.15716
Pdf URL: https://arxiv.org/pdf/2512.15716
Copy Paste: [[2512.15716]] Spatia: Video Generation with Updatable Spatial Memory(https://arxiv.org/abs/2512.15716)
Keywords: generation
Abstract: Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
摘要：由于视频信号的密集、高维性质，现有的视频生成模型难以保持长期的空间和时间一致性。为了克服这一限制，我们提出了 Spatia，这是一种空间内存感知视频生成框架，它明确地将 3D 场景点云保留为持久空间内存。 Spatia 根据该空间记忆迭代生成视频剪辑，并通过视觉 SLAM 不断更新它。这种动态-静态解开设计增强了整个生成过程的空间一致性，同时保留了模型生成真实动态实体的能力。此外，Spatia 支持显式摄像机控制和 3D 感知交互式编辑等应用，为可扩展、内存驱动的视频生成提供几何基础框架。