2025-12-23

Title: SuperFlow: Training Flow Matching Models with RL on the Fly

Authors: Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao, Lifu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.17951
Pdf URL: https://arxiv.org/pdf/2512.17951
Copy Paste: [[2512.17951]] SuperFlow: Training Flow Matching Models with RL on the Fly(https://arxiv.org/abs/2512.17951)
Keywords: generation, generative
Abstract: Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.
摘要：基于流的生成模型和强化学习 (RL) 的最新进展改善了文本图像对齐和视觉质量。然而，当前针对流模型的强化学习训练仍然存在两个主要问题：（i）GRPO 式固定的每个提示组大小忽略了提示之间采样重要性的变化，这导致采样效率低下和训练速度变慢； (ii) 轨迹级优势被重新用作每步估计，这会导致流程中的信用分配出现偏差。我们提出了 SuperFlow，一种用于基于流的模型的 RL 训练框架，它通过方差感知采样来调整组大小，并以与连续时间流动态一致的方式计算步骤级优势。根据经验，SuperFlow 在仅使用原始训练步骤的 5.4% 至 56.3% 的情况下就达到了令人鼓舞的性能，并将训练时间减少了 5.2% 至 16.7%，而无需任何架构更改。在标准文本到图像 (T2I) 任务上，包括文本渲染、合成图像生成和人类偏好对齐，SuperFlow 比 SD3.5-M 提高了 4.6% 到 47.2%，比 Flow-GRPO 提高了 1.7% 到 16.0%。

Title: FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Authors: Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18073
Pdf URL: https://arxiv.org/pdf/2512.18073
Copy Paste: [[2512.18073]] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis(https://arxiv.org/abs/2512.18073)
Keywords: generation
Abstract: Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.
摘要：多模式法学硕士 (MLLM) 在复杂数据分析、视觉问答、生成和推理方面获得了显着的吸引力。最近，它们已被用于分析虹膜和面部图像的生物识别效用。然而，它们在指纹理解方面的能力尚未得到探索。在这项工作中，我们设计了一个全面的基准测试 \textsc{FPBench}，它使用零样本和思维链提示策略，评估 20 个 MLLM（开源和专有）在 7 个真实和合成数据集上针对 8 个生物识别和取证任务的性能。我们从性能、可解释性方面讨论我们的发现，并分享我们对挑战和局限性的见解。我们建立 \textsc{FPBench} 作为指纹领域理解的第一个综合基准，MLLM 为指纹基础模型铺平了道路。

Title: SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping

Authors: Thomas Boudras, Martin Schwartz, Rasmus Fensholt, Martin Brandt, Ibrahim Fayad, Jean-Pierre Wigneron, Gabriel Belouze, Fajwel Fogel, Philippe Ciais
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18128
Pdf URL: https://arxiv.org/pdf/2512.18128
Copy Paste: [[2512.18128]] SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping(https://arxiv.org/abs/2512.18128)
Keywords: super-resolution
Abstract: High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors' native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at this https URL
摘要：冠层高度的高分辨率测绘对于森林管理和生物多样性监测至关重要。尽管最近的研究导致了使用卫星图像来预测高度图的深度学习方法的出现，但这些方法通常面临数据可访问性和空间分辨率之间的权衡。为了克服这些限制，我们提出了 SERA-H，这是一种结合了超分辨率模块（EDSR）和时间注意力编码（UTAE）的端到端模型。我们的模型在高密度 LiDAR 数据 (ALS) 的监督下进行训练，从免费提供的 Sentinel-1 和 Sentinel-2 (10 m) 时间序列数据生成 2.5 m 分辨率高度图。在法国的开源基准数据集上进行评估，SERA-H 的 MAE 为 2.6 m，决定系数为 0.82，不仅优于标准 Sentinel-1/2 基线，而且实现了与依赖商业超高分辨率图像的方法（SPOT-6/7、PlanetScope、Maxar）相当或更好的性能。这些结果表明，将高分辨率监督与时间序列中嵌入的时空信息相结合，能够重建超出输入传感器原始分辨率的细节。 SERA-H 开启了以高重访频率自由绘制森林地图的可能性，其精度可与昂贵的商业图像相媲美。源代码可在此 https URL 获取

Title: Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection

Authors: Jie Yang, Rui Zhang, Ziyang Cheng, Dawei Cheng, Guang Yang, Bo Wang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2512.18133
Pdf URL: https://arxiv.org/pdf/2512.18133
Copy Paste: [[2512.18133]] Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection(https://arxiv.org/abs/2512.18133)
Keywords: generation
Abstract: Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensuring that their key characteristics are consistent with those of benign users to a high degree, which we call Adaptive Camouflage. Consequently, this narrows the differences in behavioral traits between them and benign users within the platform's database, thereby making current GFD models lose efficiency. To address this problem, we propose a relation diffusion-based graph augmentation model Grad. In detail, Grad leverages a supervised graph contrastive learning module to enhance the fraud-benign difference and employs a guided relation diffusion generator to generate auxiliary homophilic relations from scratch. Based on these, weak fraudulent signals would be enhanced during the aggregation process, thus being obvious enough to be captured. Extensive experiments have been conducted on two real-world datasets provided by WeChat Pay, one of the largest online payment platforms with billions of users, and three public datasets. The results show that our proposed model Grad outperforms SOTA methods in both various scenarios, achieving at most 11.10% and 43.95% increases in AUC and AP, respectively. Our code is released at this https URL and this https URL.
摘要：如今，金融场景下的图欺诈检测（GFD）已成为保护在线支付安全的紧迫研究课题。然而，随着有组织的犯罪集团在现实世界中变得更加专业，欺诈者正在采用更复杂的伪装策略。具体来说，欺诈者通过模仿平台收集的行为数据来伪装自己，确保自己的关键特征与良性用户高度一致，我们称之为自适应伪装。因此，这缩小了他们与平台数据库中良性用户之间行为特征的差异，从而使当前的 GFD 模型失去效率。为了解决这个问题，我们提出了一种基于关系扩散的图增强模型 Grad。具体来说，Grad 利用监督图对比学习模块来增强欺诈良性差异，并采用引导关系扩散生成器从头开始生成辅助同质关系。基于此，微弱的欺诈信号将在聚合过程中得到增强，从而足够明显，足以被捕获。我们对微信支付（拥有数十亿用户的最大在线支付平台之一）提供的两个真实数据集和三个公共数据集进行了广泛的实验。结果表明，我们提出的模型 Grad 在两种场景下都优于 SOTA 方法，AUC 和 AP 分别最多增加 11.10% 和 43.95%。我们的代码发布在这个 https URL 和这个 https URL 上。

Title: Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction

Authors: Taewon Yang, Jason Hu, Jeffrey A. Fessler, Liyue Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18161
Pdf URL: https://arxiv.org/pdf/2512.18161
Copy Paste: [[2512.18161]] Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction(https://arxiv.org/abs/2512.18161)
Keywords: generation, generative
Abstract: Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of $512 \times 512 \times 256$ ($\sim$20 mins).
摘要：扩散模型学习强大的图像先验，可用于解决医学图像重建等逆问题。然而，对于 3D 计算机断层扫描 (CT) 成像等现实应用，由于大量 GPU 资源和大规模数据集的高计算需求，直接在 3D 数据上训练扩散模型面临着巨大的挑战。现有的工作大多重用2D扩散先验来解决3D反问题，但未能充分实现和利用扩散模型对高维数据的生成能力。在这项研究中，我们提出了一种新颖的基于 3D 块的扩散模型，该模型可以从有限的数据中学习完整的 3D 扩散先验，从而能够可扩展地生成高分辨率 3D 图像。我们的核心思想是通过将位置感知的 3D 局部块和下采样 3D 体积的联合分布建模为全局上下文，学习 3D 块的先验以实现可扩展的效率，同时耦合局部和全局信息以保证高质量的 3D 图像生成。我们的方法不仅能够实现高质量的 3D 生成，而且还为高分辨率 3D 反演问题提供前所未有的高效且准确的解决方案。跨多个数据集的 3D CT 重建实验表明，我们的方法在性能和效率方面均优于最先进的方法，特别是实现了 $512 \times 512 \times 256$（$\sim$20 分钟）的高分辨率 3D 重建。

Title: MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Authors: Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18181
Pdf URL: https://arxiv.org/pdf/2512.18181
Copy Paste: [[2512.18181]] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation(https://arxiv.org/abs/2512.18181)
Keywords: generation
Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: this https URL
摘要：随着在线舞蹈视频平台的兴起和人工智能生成内容（AIGC）的快速发展，音乐驱动的舞蹈生成已成为一个引人注目的研究方向。尽管音乐驱动的 3D 舞蹈生成、姿势驱动的图像动画和音频驱动的头部说话合成等相关领域取得了实质性进展，但现有方法无法直接适应这项任务。此外，该领域的有限研究仍然难以共同实现高质量的视觉外观和逼真的人体运动。因此，我们提出了 MACE-Dance，这是一种音乐驱动的舞蹈视频生成框架，具有级联专家混合 (MoE)。运动专家执行音乐到 3D 运动生成，同时增强运动合理性和艺术表现力，而外观专家则执行运动和参考条件视频合成，保留具有时空一致性的视觉特征。具体来说，Motion Expert 采用具有 BiMamba-Transformer 混合架构的扩散模型和无指导训练 (GFT) 策略，在 3D 舞蹈生成中实现了最先进的 (SOTA) 性能。外观专家采用解耦的运动学-美学微调策略，在姿势驱动的图像动画中实现了最先进的 (SOTA) 性能。为了更好地对这项任务进行基准测试，我们整理了一个大规模且多样化的数据集，并设计了一个运动外观评估协议。基于该协议，MACE-Dance 还实现了最先进的性能。项目页面：此 https URL

Title: Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Authors: Junho Lee, Kwanseok Kim, Joonseok Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18184
Pdf URL: https://arxiv.org/pdf/2512.18184
Copy Paste: [[2512.18184]] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching(https://arxiv.org/abs/2512.18184)
Keywords: generation, generative
Abstract: Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian's omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at this https URL.
摘要：流匹配已成为一种强大的生成建模方法，可以灵活选择源分布。虽然高斯分布很常用，但在高维数据生成中更好的替代方案的潜力在很大程度上仍未被探索。在本文中，我们提出了一种新颖的 2D 模拟，可以在可解释的 2D 设置中捕获高维几何属性，使我们能够分析训练期间流匹配的学习动态。基于此分析，我们得出了关于流匹配行为的几个关键见解：（1）密度近似会由于模式差异而矛盾地降低性能，（2）方向对齐在过度集中时会遭受路径纠缠，（3）高斯的全向覆盖确保稳健的学习，（4）范数错位会产生大量的学习成本。基于这些见解，我们提出了一个实用的框架，将规范对齐训练与定向修剪采样相结合。这种方法保持了稳定流学习所必需的强大的全向监督，同时消除了推理过程中数据稀疏区域的初始化。重要的是，我们的修剪策略可以应用于任何使用高斯源训练的流匹配模型，无需重新训练即可立即获得性能提升。实证评估表明发电质量和采样效率均得到持续改善。我们的研究结果为源分布设计提供了实用的见解和指南，并引入了一种易于应用的技术来改进现有的流量匹配模型。我们的代码可以在这个 https URL 上找到。

Title: Loom: Diffusion-Transformer for Interleaved Generation

Authors: Mingcheng Ye, Jiaming Liu, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18254
Pdf URL: https://arxiv.org/pdf/2512.18254
Copy Paste: [[2512.18254]] Loom: Diffusion-Transformer for Interleaved Generation(https://arxiv.org/abs/2512.18254)
Keywords: generation
Abstract: Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.
摘要：交错文本图像生成旨在在单个序列中共同生成连贯的视觉框架和对齐的文本描述，从而实现风格转换、构图合成和程序教程等任务。我们提出了 Loom，一个用于交错文本图像生成的统一扩散变换器框架。 Loom 通过全参数微调和交错架构扩展了 Bagel 统一模型，该架构交替文本和视觉嵌入以进行多条件推理和顺序规划。语言规划策略首先将用户指令分解为逐步提示和框架嵌入，从而指导时间一致的合成。对于每一帧，Loom 以一小组采样的先前帧和全局文本上下文为条件，而不是连接所有历史记录，从而产生可控且高效的长范围生成。在风格转移、构图生成和类似教程的过程中，Loom 提供了卓越的构图性、时间连贯性和文本图像对齐。实验表明，Loom 的性能明显优于开源基线 Anole，在文本到交错任务中的时间和语义指标上实现了 2.6 分的平均增益（5 分制）。我们还策划了一个 50K 交错教程数据集，并展示了相对于统一和扩散编辑基线的强大改进。

Title: NOVA: Discovering Well-Conditioned Winograd Transforms through Numerical Optimization of Vandermonde Arithmetic

Authors: Jayant Lohia
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.18453
Pdf URL: https://arxiv.org/pdf/2512.18453
Copy Paste: [[2512.18453]] NOVA: Discovering Well-Conditioned Winograd Transforms through Numerical Optimization of Vandermonde Arithmetic(https://arxiv.org/abs/2512.18453)
Keywords: generation
Abstract: Winograd convolution is the standard algorithm for efficient inference, reducing arithmetic complexity by 2.25x for 3x3 kernels. However, it faces a critical barrier in the modern era of low precision computing: numerical instability. As tiles scale to maximize efficiency (e.g., F(6,3), F(8,3)), the condition numbers of standard integer based transforms explode, reaching kappa = 2 x 10^5 for F(8,3), rendering them unusable in FP16 or Int8. We introduce NOVA (Numerical Optimization of Vandermonde Arithmetic), a discovery framework that breaks the decades old convention of integer interpolation. Treating Winograd point selection as a continuous optimization problem, NOVA searches the manifold R^n-1 via Evolution Strategy, snaps candidates to simple rationals, and guarantees correctness via symbolic verification. This process uncovers a hidden landscape of stable, fractional configurations such as {+-5/6, +-7/6, +-3/5} that defy traditional vocabulary constraints. The impact is transformative: NOVA improves the conditioning of F(8,3) by 415x in 1D, which squares to a 172,484x improvement for 2D convolution. In real world FP16 ImageNet inference, where standard transforms collapse to random chance (e.g., 4.7 percent accuracy on VGG16), NOVA's points restore full accuracy (75 to 78 percent), recovering over 70 percentage points without retraining, calibration, or learned parameters. These discovered transforms act as drop in replacements, effectively unlocking the efficiency of large tile Winograd convolution for next generation hardware.
摘要：Winograd 卷积是高效推理的标准算法，可将 3x3 内核的算术复杂度降低 2.25 倍。然而，在现代低精度计算时代，它面临着一个关键障碍：数值不稳定。随着图块缩放以最大化效率（例如，F(6,3)、F(8,3)），基于标准整数的变换的条件数会爆炸，对于 F(8,3) 达到 kappa = 2 x 10^5，导致它们在 FP16 或 Int8 中无法使用。我们引入了 NOVA（范德蒙德算术数值优化），这是一个打破了数十年整数插值惯例的发现框架。将 Winograd 点选择视为连续优化问题，NOVA 通过进化策略搜索流形 R^n-1，将候选点捕捉到简单的有理数，并通过符号验证保证正确性。这个过程揭示了稳定的分数配置的隐藏景观，例如{+-5/6、+-7/6、+-3/5}，这些配置挑战了传统词汇的限制。影响是革命性的：NOVA 将 1D 中的 F(8,3) 调节提高了 415 倍，这相当于 2D 卷积的 172,484 倍改进。在现实世界的 FP16 ImageNet 推理中，标准变换崩溃为随机机会（例如，VGG16 上的准确度为 4.7%），NOVA 的点恢复了完全准确度（75% 至 78%），在无需重新训练、校准或学习参数的情况下恢复了超过 70 个百分点。这些发现的变换起到了替换的作用，有效地释放了下一代硬件的大型瓦片 Winograd 卷积的效率。

Title: Plasticine: A Traceable Diffusion Model for Medical Image Translation

Authors: Tianyang Zhanng, Xinxing Cheng, Jun Cheng, Shaoming Zheng, He Zhao, Huazhu Fu, Alejandro F Frangi, Jiang Liu, Jinming Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18455
Pdf URL: https://arxiv.org/pdf/2512.18455
Copy Paste: [[2512.18455]] Plasticine: A Traceable Diffusion Model for Medical Image Translation(https://arxiv.org/abs/2512.18455)
Keywords: generation
Abstract: Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.
摘要：由于成像设备和人口分布的变化而产生的领域差距给医学图像分析中的机器学习带来了重大挑战。现有的图像到图像转换方法主要旨在学习域之间的映射，通常生成具有解剖尺度和形状变化的多种合成数据，但它们通常在转换过程中忽略空间对应性。对于临床应用，可追溯性（定义为提供原始图像和翻译图像之间的像素级对应关系的能力）同样重要。这一特性增强了临床可解释性，但在以前的方法中很大程度上被忽视了。为了解决这一差距，我们提出了 Plasticine，据我们所知，这是第一个以可追溯性为核心目标而明确设计的端到端图像到图像翻译框架。我们的方法在去噪扩散框架内结合了强度平移和空间变换。这种设计能够生成具有可解释的强度过渡和空间相干变形的合成图像，支持整个翻译过程中的像素级可追溯性。

Title: Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review

Authors: Oraib Almegdadi, João Marcelino, Sarah Fakhreddine, João Manso, Nuno C. Marques
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18466
Pdf URL: https://arxiv.org/pdf/2512.18466
Copy Paste: [[2512.18466]] Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review(https://arxiv.org/abs/2512.18466)
Keywords: quality assessment
Abstract: Sustainable water quality underpins ecological balance and water security. Assessing and managing lakes and reservoirs is difficult due to data sparsity, heterogeneity, and nonlinear relationships among parameters. This review examines how Self-Organizing Map (SOM), an unsupervised AI technique, is applied to water quality assessment. It synthesizes research on parameter selection, spatial and temporal sampling strategies, and clustering approaches. Emphasis is placed on how SOM handles multidimensional data and uncovers hidden patterns to support effective water management. The growing availability of environmental data from in-situ sensors, remote sensing imagery, IoT technologies, and historical records has significantly expanded analytical opportunities in environmental monitoring. SOM has proven effective in analysing complex datasets, particularly when labelled data are limited or unavailable. It enables high-dimensional data visualization, facilitates the detection of hidden ecological patterns, and identifies critical correlations among diverse water quality indicators. This review highlights SOMs versatility in ecological assessments, trophic state classification, algal bloom monitoring, and catchment area impact evaluations. The findings offer comprehensive insights into existing methodologies, supporting future research and practical applications aimed at improving the monitoring and sustainable management of lake and reservoir ecosystems.
摘要：可持续的水质是生态平衡和水安全的基础。由于数据稀疏、异质性以及参数之间的非线性关系，评估和管理湖泊和水库非常困难。本综述探讨了如何将自组织映射 (SOM) 这种无监督的人工智能技术应用于水质评估。它综合了参数选择、空间和时间采样策略以及聚类方法的研究。重点关注 SOM 如何处理多维数据并揭示隐藏模式以支持有效的水资源管理。来自现场传感器、遥感图像、物联网技术和历史记录的环境数据的可用性不断增加，显着扩大了环境监测的分析机会。事实证明，SOM 在分析复杂数据集方面非常有效，特别是当标记数据有限或不可用时。它实现了高维数据可视化，有助于检测隐藏的生态模式，并识别不同水质指标之间的关键相关性。本综述强调了 SOM 在生态评估、营养状态分类、藻华监测和流域影响评估方面的多功能性。研究结果为现有方法提供了全面的见解，支持未来的研究和实际应用，旨在改善湖泊和水库生态系统的监测和可持续管理。

Title: Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study

Authors: Janek Dyer, Jagdeep Ahluwalia, Javad Zarrin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18524
Pdf URL: https://arxiv.org/pdf/2512.18524
Copy Paste: [[2512.18524]] Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study(https://arxiv.org/abs/2512.18524)
Keywords: generative
Abstract: The ability to discriminate between generative graph models is critical to understanding complex structural patterns in both synthetic graphs and the real-world structures that they emulate. While Graph Neural Networks (GNNs) have seen increasing use to great effect in graph classification tasks, few studies explore their integration with interpretable graph theoretic features. This paper investigates the classification of synthetic graph families using a hybrid approach that combines GNNs with engineered graph-theoretic features. We generate a large and structurally diverse synthetic dataset comprising graphs from five representative generative families, Erdos-Renyi, Watts-Strogatz, Barab'asi-Albert, Holme-Kim, and Stochastic Block Model. These graphs range in size up to 1x10^4 nodes, containing up to 1.1x10^5 edges. A comprehensive range of node and graph level features is extracted for each graph and pruned using a Random Forest based feature selection pipeline. The features are integrated into six GNN architectures: GCN, GAT, GATv2, GIN, GraphSAGE and GTN. Each architecture is optimised for hyperparameter selection using Optuna. Finally, models were compared against a baseline Support Vector Machine (SVM) trained solely on the handcrafted features. Our evaluation demonstrates that GraphSAGE and GTN achieve the highest classification performance, with 98.5% accuracy, and strong class separation evidenced by t-SNE and UMAP visualisations. GCN and GIN also performed well, while GAT-based models lagged due to limitations in their ability to capture global structures. The SVM baseline confirmed the importance of the message passing functionality for performance gains and meaningful class separation.
摘要：区分生成图模型的能力对于理解合成图及其模拟的现实世界结构中的复杂结构模式至关重要。虽然图神经网络（GNN）在图分类任务中的应用越来越广泛，效果显着，但很少有研究探索它们与可解释的图论特征的集成。本文使用将 GNN 与工程图论特征相结合的混合方法研究了合成图族的分类。我们生成了一个大型且结构多样的综合数据集，其中包含来自五个代表性生成族（Erdos-Renyi、Watts-Strogatz、Barab'asi-Albert、Holme-Kim 和随机块模型）的图。这些图的大小范围最多为 1x10^4 个节点，最多包含 1.1x10^5 条边。为每个图提取全面的节点和图级特征，并使用基于随机森林的特征选择管道进行修剪。这些功能集成到六种 GNN 架构中：GCN、GAT、GATv2、GIN、GraphSAGE 和 GTN。每个架构都针对使用 Optuna 的超参数选择进行了优化。最后，将模型与仅基于手工特征训练的基线支持向量机 (SVM) 进行比较。我们的评估表明，GraphSAGE 和 GTN 实现了最高的分类性能，准确率为 98.5%，t-SNE 和 UMAP 可视化证明了强大的类别分离。 GCN 和 GIN 也表现良好，而基于 GAT 的模型由于捕获全局结构的能力有限而表现落后。 SVM 基线证实了消息传递功能对于性能提升和有意义的类分离的重要性。

Title: Enhancing Medical Large Vision-Language Models via Alignment Distillation

Authors: Aofei Chang, Ting Wang, Fenglong Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18554
Pdf URL: https://arxiv.org/pdf/2512.18554
Copy Paste: [[2512.18554]] Enhancing Medical Large Vision-Language Models via Alignment Distillation(https://arxiv.org/abs/2512.18554)
Keywords: generation
Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.
摘要：医学大视觉语言模型 (Med-LVLM) 在临床应用中显示出有希望的结果，但由于视觉理解不一致，经常会出现幻觉输出。在这项工作中，我们确定了导致此问题的两个基本限制：视觉表示学习不足和视觉注意力对齐不良。为了解决这些问题，我们提出了 MEDALIGN，这是一个简单、轻量级的对齐蒸馏框架，它将视觉对齐知识从特定领域的对比语言图像预训练 (CLIP) 模型转移到 Med-LVLM。 MEDALIGN 引入了两种蒸馏损失：基于视觉标记级相似性结构的空间感知视觉对齐损失，以及将注意力引导到诊断相关区域的注意力感知蒸馏损失。关于医疗报告生成和医学视觉问答 (VQA) 基准的广泛实验表明，MEDALIGN 持续提高性能和可解释性，产生更多基于视觉的输出。

Title: Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment

Authors: Ruiqi Chen (1), Giacomo Vedovati (2), Todd Braver (3), ShiNung Ching (2) ((1) Division of Biology and Biomedical Sciences, Washington University in St. Louis, (2) Department of Electrical and Systems Engineering, Washington University in St. Louis, (3) Department of Psychological and Brain Sciences, Washington University in St. Louis)
Subjects: cs.LG, eess.SY, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.18566
Pdf URL: https://arxiv.org/pdf/2512.18566
Copy Paste: [[2512.18566]] Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment(https://arxiv.org/abs/2512.18566)
Keywords: generation, generative
Abstract: Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM's ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.
摘要：循环神经网络 (RNN) 等动态系统模型在理论神经科学中越来越受欢迎，用于假设生成和数据分析。评估此类模型的动态是理解其学习生成机制的关键。然而，这种评估受到两个主要挑战的阻碍：首先，跨模型学习动力学的比较很困难，因为它们的坐标系没有强制等效。其次，在 RNN 等高维非线性模型中，识别机械上重要的低维主题（例如极限集）是很困难的。在这里，我们提出了一个解决这两个问题的综合框架，称为学习模型的微同胚向量场对齐（DFORM）。 DFORM 学习两个动态系统的状态空间之间的非线性坐标变换，以最大限度地一对一的方式对齐它们的轨迹。通过这样做，DFORM 可以评估两个模型是否表现出拓扑等效性，即尽管坐标系不同但机制相似。该方法的副产品是一种在嵌入高维系统的低维流形上定位动态图案的方法。我们使用规范拓扑等效系统、RNN 和非线性流相关系统验证了 DFORM 识别线性和非线性坐标变换的能力。 DFORM 还被证明可以提供拓扑不同系统之间相似性的量化。然后我们证明了 DFORM 可以在高维模型中定位重要的动态基序，包括不变流形和鞍极限集。最后，使用一组在人类功能 MRI (fMRI) 记录上训练的 RNN 模型，我们证明 DFORM 可以从高维数据驱动模型中识别极限环，这与之前的数值分析非常吻合。

Title: SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Authors: Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.18583
Pdf URL: https://arxiv.org/pdf/2512.18583
Copy Paste: [[2512.18583]] SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models(https://arxiv.org/abs/2512.18583)
Keywords: generation
Abstract: Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at this https URL.
摘要：对抗性模仿学习（AIL）是模仿学习中的主导框架，它从专家演示中推断奖励以指导策略优化。尽管提供更多专家演示通常会提高性能和稳定性，但在某些情况下收集此类演示可能具有挑战性。受到扩散模型在数据生成方面的成功的启发，我们提出了 SD2AIL，它通过扩散模型利用综合演示。我们首先在鉴别器中采用扩散模型来生成合成演示作为伪专家数据，以增强专家演示。为了从大量（伪）专家演示中选择性地重放最有价值的演示，我们进一步引入了优先专家演示重放策略（PEDR）。模拟任务的实验结果证明了我们方法的有效性和鲁棒性。特别是，在 Hopper 任务中，我们的方法实现了 3441 的平均回报，比最先进的方法高出 89。我们的代码将在此 https URL 中提供。

Title: Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

Authors: Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18595
Pdf URL: https://arxiv.org/pdf/2512.18595
Copy Paste: [[2512.18595]] Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows(https://arxiv.org/abs/2512.18595)
Keywords: generation
Abstract: Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models' inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.
摘要：由于多尺度、异构物理过程的严重耦合，预测多物理场动力学的计算成本高昂且具有挑战性。虽然神经代理有望带来范式转变，但该领域目前正遭受着“掌握的幻觉”，正如顶级评论中反复强调的那样：现有的评估过度依赖于简化的、低维的代理，这无法暴露模型在现实制度中固有的脆弱性。为了弥补这一关键差距，我们提出了 REALM（多物理场真实人工智能学习），这是一个严格的基准测试框架，旨在测试具有挑战性的、应用驱动的反应流的神经代理。 REALM 具有 11 个高保真数据集，涵盖从规范的多物理场问题到复杂的推进和消防安全场景，以及标准化的端到端培训和评估协议，其中包含多物理场感知预处理和强大的推出策略。使用这个框架，我们系统地对十几个代表性代理模型系列进行基准测试，包括谱算子、卷积模型、Transformers、逐点算子和图/网格网络，并确定了三个稳健的趋势：（i）由维数、刚度和网格不规则性共同控制的缩放障碍，导致快速增长的推出误差； (ii) 性能主要由架构归纳偏差而不是参数计数控制； (iii) 名义精度指标和物理上值得信赖的行为之间持续存在差距，其中具有高相关性的模型仍然错过关键的瞬态结构和积分量。总而言之，REALM 揭示了当前神经代理在现实多物理场流上的局限性，并提供了严格的测试平台来推动下一代物理感知架构的开发。

Title: SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

Authors: Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18599
Pdf URL: https://arxiv.org/pdf/2512.18599
Copy Paste: [[2512.18599]] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback(https://arxiv.org/abs/2512.18599)
Keywords: restoration
Abstract: Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.
摘要：复杂图像恢复旨在从受模糊、噪声、雨和压缩伪影等多种退化影响的输入中恢复高质量图像。最近的恢复代理由视觉语言模型和大型语言模型提供支持，提供了有前景的恢复功能，但由于反射、回滚和迭代工具搜索而遭受严重的效率瓶颈。此外，它们的性能在很大程度上取决于退化识别模型，需要大量注释进行训练，限制了它们在无标签环境中的适用性。为了解决这些限制，我们提出了一种基于策略优化的恢复框架，该框架学习轻量级代理来确定工具调用序列。该代理在顺序决策过程中运行，在每一步选择最合适的恢复操作以最大化最终图像质量。为了在无标签环境中进行训练，我们引入了一种由多模式大语言模型驱动的新颖奖励机制，该机制充当与人类一致的评估器，并为政策改进提供感知反馈。经过训练后，我们的代理将执行确定性恢复计划，无需多余的工具调用，从而显着加快推理速度，同时保持高恢复质量。大量实验表明，尽管不使用监督，我们的方法仍能在全参考指标上匹配 SOTA 性能，并在不同的退化场景中超越无参考指标的现有方法。

Title: PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

Authors: Ruiqi Chen, Kaitong Cai, Yijia Fan, Keze Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18614
Pdf URL: https://arxiv.org/pdf/2512.18614
Copy Paste: [[2512.18614]] PTTA: A Pure Text-to-Animation Framework for High-Quality Creation(https://arxiv.org/abs/2512.18614)
Keywords: generation
Abstract: Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.
摘要：传统动画制作流程复杂，人工成本高。虽然最近的视频生成模型（例如 Sora、Kling 和 CogVideoX）在自然视频合成方面取得了令人印象深刻的结果，但它们在应用于动画生成时表现出明显的局限性。最近的工作，例如 AniSora，通过针对动画风格微调图像到视频模型，展示了有希望的性能，但在文本到视频设置中的类似探索仍然有限。在这项工作中，我们提出了 PTTA，一个用于高质量动画创作的纯文本到动画框架。我们首先构建一个小规模但高质量的动画视频和文本描述配对数据集。基于预训练的文本到视频模型 HunyuanVideo，我们进行微调以使其适应动画风格的生成。跨多个维度的广泛视觉评估表明，所提出的方法始终优于动画视频合成中的可比基线。

Title: Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers

Authors: Xiyue Bai, Ronghao Yu, Jia Xiu, Pengfei Zhou, Jie Xia, Peng Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18635
Pdf URL: https://arxiv.org/pdf/2512.18635
Copy Paste: [[2512.18635]] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers(https://arxiv.org/abs/2512.18635)
Keywords: generation
Abstract: Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.
摘要：直接从神经信号生成或编辑图像在神经科学、视觉和脑机交互的交叉领域具有巨大的潜力。在本文中，我们提出了 Uni-Neur2Img，一个用于神经信号驱动图像生成和编辑的统一框架。该框架引入了基于 LoRA 的参数高效神经信号注入模块，该模块将每个调节信号作为可插拔组件独立处理，从而在不改变基本模型参数的情况下促进灵活的多模态调节。此外，我们采用因果注意机制来适应条件生成任务的长序列建模需求。现有的神经驱动生成研究主要关注文本模态作为条件或中间表示，导致对视觉模态作为直接调节信号的探索有限。为了弥补这一研究差距，我们引入了 EEG-Style 数据集。我们对公共基准和自行收集的神经信号数据集进行了综合评估：（1）在公共 CVPR40 数据集上进行脑电图驱动的图像生成； (2) 在公共 Loongx 数据集上进行神经信号引导图像编辑，以进行语义感知的局部修改； (3) 在我们自行收集的 EEG-Style 数据集上进行 EEG 驱动的风格迁移。广泛的实验结果表明，生成保真度、编辑一致性和风格转移质量显着提高，同时保持较低的计算开销和对其他模式的强大可扩展性。因此，Uni-Neur2Img 为桥接神经信号和视觉内容生成提供了统一、高效且可扩展的解决方案。

Title: Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Authors: Linwei Qiu, Gongzhe Li, Xiaozhe Zhang, Qinlin Sun, Fengying Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18718
Pdf URL: https://arxiv.org/pdf/2512.18718
Copy Paste: [[2512.18718]] Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts(https://arxiv.org/abs/2512.18718)
Keywords: restoration
Abstract: Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.
摘要：图像校正和矩形调整是智能手机等实际摄影系统中很有价值的任务。不可否认，深度学习最近取得的显着进步给这些领域带来了显着的性能提升。然而，现有的方法主要依赖于特定于任务的架构。这极大地限制了它们在各种不同任务中的泛化能力和有效应用。在本文中，我们介绍了统一校正框架（UniRect），这是一种从一致的失真校正角度解决这些实际任务的综合方法。我们的方法通过模拟不同类型的镜头，将各种特定于任务的逆问题合并到通用畸变模型中。为了处理各种变形，UniRect采用了一个具有双组件结构的任务无关的校正框架：一个{变形模块}，它利用一种新颖的残余渐进薄板样条（RP-TPS）模型来解决复杂的几何变形，以及随后的恢复模块，它采用残余曼巴块（RMB）来抵消变形过程引起的退化并增强输出图像的保真度。此外，稀疏专家混合（SMoE）结构旨在避免多任务学习中由于不同的扭曲而导致的繁重任务竞争。大量的实验表明，与其他最新方法相比，我们的模型已经实现了最先进的性能。

Title: Generating Risky Samples with Conformity Constraints via Diffusion Models

Authors: Han Yu, Hao Zou, Xingxuan Zhang, Zhengyi Wang, Yue He, Kehan Li, Peng Cui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18722
Pdf URL: https://arxiv.org/pdf/2512.18722
Copy Paste: [[2512.18722]] Generating Risky Samples with Conformity Constraints via Diffusion Models(https://arxiv.org/abs/2512.18722)
Keywords: generation
Abstract: Although neural networks achieve promising performance in many tasks, they may still fail when encountering some examples and bring about risks to applications. To discover risky samples, previous literature attempts to search for patterns of risky samples within existing datasets or inject perturbation into them. Yet in this way the diversity of risky samples is limited by the coverage of existing datasets. To overcome this limitation, recent works adopt diffusion models to produce new risky samples beyond the coverage of existing datasets. However, these methods struggle in the conformity between generated samples and expected categories, which could introduce label noise and severely limit their effectiveness in applications. To address this issue, we propose RiskyDiff that incorporates the embeddings of both texts and images as implicit constraints of category conformity. We also design a conformity score to further explicitly strengthen the category conformity, as well as introduce the mechanisms of embedding screening and risky gradient guidance to boost the risk of generated samples. Extensive experiments reveal that RiskyDiff greatly outperforms existing methods in terms of the degree of risk, generation quality, and conformity with conditioned categories. We also empirically show the generalization ability of the models can be enhanced by augmenting training data with generated samples of high conformity.
摘要：尽管神经网络在许多任务中取得了可喜的性能，但在遇到某些示例时仍然可能会失败，并给应用带来风险。为了发现风险样本，以前的文献尝试在现有数据集中搜索风险样本的模式或向其中注入扰动。然而，通过这种方式，风险样本的多样性受到现有数据集覆盖范围的限制。为了克服这一限制，最近的工作采用扩散模型来产生超出现有数据集覆盖范围的新风险样本。然而，这些方法在生成的样本和预期类别之间的一致性方面存在困难，这可能会引入标签噪声并严重限制其在应用中的有效性。为了解决这个问题，我们提出了 RiskyDiff，它将文本和图像的嵌入作为类别一致性的隐式约束。我们还设计了一个一致性评分来进一步明确地强化类别一致性，并引入嵌入筛选和风险梯度引导机制来提高生成样本的风险。大量实验表明，RiskyDiff 在风险程度、生成质量和条件类别符合性方面大大优于现有方法。我们还凭经验表明，可以通过使用生成的高一致性样本来增强训练数据来增强模型的泛化能力。

Title: $M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Authors: Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18735
Pdf URL: https://arxiv.org/pdf/2512.18735
Copy Paste: [[2512.18735]] $M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models(https://arxiv.org/abs/2512.18735)
Keywords: generation
Abstract: Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from this https URL and full benchmark data from this https URL.
摘要：现代大型多模态模型（LMM）在静态图像和单态时空理解方面表现出了非凡的能力。然而，它们理解两个不同视频观察之间共享空间上下文中物体动态变化的能力在很大程度上仍未被探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤其重要。在本文中，我们引入了 $M^3-Verse$，一个多模式、多状态、多维度的基准，来正式评估这种能力。它建立在配对视频的基础上，提供状态变化前后室内场景的多视角观察。该基准测试总共包含 270 个场景和 2,932 个问题，分为 50 多个子任务，探讨 4 个核心能力。我们评估了 16 个最先进的 LMM，并观察了它们在跟踪状态转换方面的局限性。为了应对这些挑战，我们进一步提出了一个简单而有效的基线，可以在多状态感知方面实现显着的性能改进。因此，$M^3-Verse$ 提供了一个具有挑战性的新测试平台，通过更全面地了解我们的动态视觉世界来促进下一代模型的开发。您可以从此 https URL 获取构建管道，并从此 https URL 获取完整的基准数据。

Title: Is Your Conditional Diffusion Model Actually Denoising?

Authors: Daniel Pfrommer, Zehao Dou, Christopher Scarvelis, Max Simchowitz, Ali Jadbabaie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18736
Pdf URL: https://arxiv.org/pdf/2512.18736
Copy Paste: [[2512.18736]] Is Your Conditional Diffusion Model Actually Denoising?(https://arxiv.org/abs/2512.18736)
Keywords: generation, generative
Abstract: We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.
摘要：我们研究带有条件变量的扩散模型的归纳偏差，该模型已广泛应用于文本条件生成图像模型和观察条件连续控制策略。我们观察到，当有条件地查询这些模型时，它们的生成始终偏离制定扩散模型的理想化“去噪”过程，从而导致流行采样算法（例如 DDPM、DDIM）之间存在分歧。我们引入了进度偏差，这是一种严格的测量方法，可以捕获标准去噪过程的偏差率，并提供计算它的方法。至关重要的是，我们证明了与理想化去噪过程的偏差与模型容量或训练数据量无关。我们假设这种现象的发生是由于难以在调节空间的不同部分桥接不同的降噪流，并从理论上表明这种现象是如何通过对平滑度的归纳偏差而产生的。

Title: Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Authors: Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18741
Pdf URL: https://arxiv.org/pdf/2512.18741
Copy Paste: [[2512.18741]] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation(https://arxiv.org/abs/2512.18741)
Keywords: generation
Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
摘要：帧级自回归（frame-AR）模型取得了重大进展，实现了与双向扩散模型相当的实时视频生成，并成为交互式世界模型和游戏引擎的基础。然而，目前长视频生成的方法通常依赖于窗口注意力，它天真地丢弃窗口外的历史上下文，导致灾难性的遗忘和场景不一致；相反，保留完整的历史记录会产生高昂的内存成本。为了解决这种权衡问题，我们提出了 \textbf{Memorize-and-Generate (MAG)}，这是一个将内存压缩和帧生成解耦为不同任务的框架。具体来说，我们训练一个内存模型来将历史信息压缩到一个紧凑的 KV 缓存中，并训练一个单独的生成器模型来利用这种压缩表示来合成后续帧。此外，我们引入 \textbf{MAG-Bench} 来严格评估历史内存保留。大量实验表明，MAG 实现了卓越的历史场景一致性，同时在标准视频生成基准上保持了具有竞争力的性能。

Title: MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18766
Pdf URL: https://arxiv.org/pdf/2512.18766
Copy Paste: [[2512.18766]] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation(https://arxiv.org/abs/2512.18766)
Keywords: generation, generative
Abstract: Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
摘要：强化学习 (RL) 已展现出在训练后语言模型和自回归视觉生成模型方面的巨大潜力，但使 RL 适应屏蔽生成模型仍然具有挑战性。核心因素是策略优化需要考虑每个步骤的概率可能性，因为其多步骤和迭代细化过程。这种对整个采样轨迹的依赖会带来很高的计算成本，而本地优化随机步骤通常会产生次优结果。在本文中，我们提出了 MaskFocus，这是一种新颖的 RL 框架，它通过关注关键步骤来实现屏蔽生成模型的有效策略优化。具体来说，我们通过测量每个采样步骤的中间图像与最终生成的图像之间的相似度来确定步骤级信息增益。至关重要的是，我们利用这一点来确定最关键和最有价值的步骤，并对它们执行有针对性的策略优化。此外，我们设计了一种基于熵的动态路由采样机制，以鼓励模型对低熵样本探索更有价值的掩蔽策略。对多个文本到图像基准的广泛实验验证了我们方法的有效性。

Title: In-Context Audio Control of Video Diffusion Transformers

Authors: Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18772
Pdf URL: https://arxiv.org/pdf/2512.18772
Copy Paste: [[2512.18772]] In-Context Audio Control of Video Diffusion Transformers(https://arxiv.org/abs/2512.18772)
Keywords: generation
Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.
摘要：视频生成领域的最新进展已经转向统一的、基于变压器的基础模型，该模型可以处理上下文中的多个条件输入。然而，这些模型主要关注文本、图像和深度图等模态，而对音频等严格时间同步信号的探索还不够。本文介绍了视频扩散变压器 (ICAC) 的上下文音频控制，这是一个框架，研究音频信号的集成，以在统一的全注意力架构（类似于 FullDiT）中生成语音驱动的视频。我们系统地探索了注入音频条件的三种不同机制：标准交叉注意力、2D 自注意力和统一 3D 自注意力。我们的研究结果表明，虽然 3D 注意力在捕捉时空视听相关性方面具有最大潜力，但它也带来了巨大的训练挑战。为了克服这个问题，我们提出了一种 Masked 3D Attention 机制，该机制限制注意力模式以强制时间对齐，从而实现稳定的训练和卓越的性能。我们的实验表明，这种方法以音频流和参考图像为条件，实现了强大的唇形同步和视频质量。

Title: Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Authors: Guangtao Lyu, Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2512.18804
Pdf URL: https://arxiv.org/pdf/2512.18804
Copy Paste: [[2512.18804]] Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation(https://arxiv.org/abs/2512.18804)
Keywords: generation
Abstract: Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
摘要：音乐到 3D 舞蹈生成旨在从音乐中合成逼真且节奏同步的人类舞蹈。虽然现有的方法通常依赖于额外的流派标签来进一步改进舞蹈生成，但这些标签通常是嘈杂的、粗糙的、不可用的或不足以捕捉现实世界音乐的多样性，这可能导致节奏错位或风格漂移。相比之下，我们观察到节奏（反映音乐节奏和节奏的核心属性）在数据集和流派中保持相对一致，通常范围为 60 到 200 BPM。基于这一发现，我们提出了 TempoMoE，一种分层节奏感知混合专家模块，可增强扩散模型及其节奏感知。 TempoMoE 将运动专家组织成不同节奏范围的节奏结构组，多尺度节拍专家捕捉精细和长范围的节奏动态。分层节奏自适应路由可动态选择和融合音乐特征中的专家，从而无需手动进行流派标签即可实现灵活、节奏一致的生成。大量实验表明 TempoMoE 在舞蹈质量和节奏对齐方面取得了最先进的结果。

Title: Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction

Authors: Guangtao Lyu, Xinyi Cheng, Chenghao Xu, Qi Liu, Muli Yang, Fen Fang, Huilin Chen, Jiexi Yan, Xu Yang, Cheng Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18813
Pdf URL: https://arxiv.org/pdf/2512.18813
Copy Paste: [[2512.18813]] Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction(https://arxiv.org/abs/2512.18813)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.
摘要：大视觉语言模型（LVLM）已显示出非凡的能力，但幻觉仍然是一个持续存在的挑战。这项工作对 LVLM 中视觉感知和令牌生成的内部演化进行了系统分析，揭示了两个关键模式。首先，感知遵循三阶段的 GATE 过程：早期层执行全局扫描，中间层接近并收紧核心内容，后面的层探索补充区域。其次，生成呈现出 SAD（从属积累到主导）模式，其中幻觉令牌是由于缺乏注意力（视觉感知）或前馈网络（内部知识）支持的次主导令牌的重复积累而产生的。在这些发现的指导下，我们设计了 VDC（验证主导校正）策略，该策略检测不受支持的令牌并用经过验证的主导令牌替换它们，以提高输出可靠性。跨多个模型和基准的大量实验证实 VDC 可以显着减轻幻觉。

Title: EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Authors: Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang, Bing Deng, Junzhe Lu, Haoqian Wang, Jieping Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18814
Pdf URL: https://arxiv.org/pdf/2512.18814
Copy Paste: [[2512.18814]] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer(https://arxiv.org/abs/2512.18814)
Keywords: generation
Abstract: Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
摘要：视频生成模型已经取得了显着进步，但由于人类发音的高度自由度，它们仍然难以合成复杂的人类运动。这种限制源于仅像素训练目标的内在限制，该目标本质上使模型偏向于外观保真度，而牺牲了学习基础运动学原理的代价。为了解决这个问题，我们引入了 EchoMotion，这是一个旨在对外观和人体运动的联合分布进行建模的框架，从而提高复杂人体动作视频生成的质量。 EchoMotion 使用双分支架构扩展了 DiT（扩散变压器）框架，该架构联合处理来自不同模态的令牌。此外，我们提出了 MVS-RoPE（运动视频同步 RoPE），它为视频和运动令牌提供统一的 3D 位置编码。通过为双模态潜在序列提供同步坐标系，MVS-RoPE 建立了一种归纳偏差，促进两种模态之间的时间对齐。我们还提出了运动视频两阶段训练策略。该策略使模型能够执行复杂人类动作视频及其相应运动序列的联合生成，以及多功能的跨模态条件生成任务。为了促进具有这些功能的模型的训练，我们构建了 HuMoVe，这是一个包含大约 80,000 个高质量、以人为中心的视频运动对的大型数据集。我们的研究结果表明，明确表示人体运动与外观是互补的，显着提高了以人为中心的视频生成的连贯性和合理性。

Title: Generative Modeling through Spectral Analysis of Koopman Operator

Authors: Yuanchao Xu, Fengyi Li, Masahiro Fujisawa, Youssef Marzouk, Isao Ishikawa
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2512.18837
Pdf URL: https://arxiv.org/pdf/2512.18837
Copy Paste: [[2512.18837]] Generative Modeling through Spectral Analysis of Koopman Operator(https://arxiv.org/abs/2512.18837)
Keywords: generation, generative
Abstract: We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a generative modeling framework that combines operator-theoretic spectral analysis with optimal transport. The novel insight is that the spectral structure required for accelerated Wasserstein gradient descent can be directly estimated from trajectory data via Koopman operator approximation which can eliminate the need for explicit knowledge of the target potential or neural network training. We provide rigorous convergence analysis and establish connection to Feynman-Kac theory that clarifies the method's probabilistic foundation. Experiments across diverse settings, including compact manifold sampling, metastable multi-well systems, image generation, and high dimensional stochastic partial differential equation, demonstrate that KSWGD consistently achieves faster convergence than other existing methods while maintaining high sample quality.
摘要：我们提出了 Koopman Spectral Wasserstein Gradient Descent (KSWGD)，这是一种将算子理论谱分析与最优传输相结合的生成建模框架。新颖的见解是，可以通过库普曼算子近似从轨迹数据直接估计加速 Wasserstein 梯度下降所需的谱结构，这可以消除对目标电位或神经网络训练的显式知识的需要。我们提供严格的收敛分析，并与 Feynman-Kac 理论建立联系，阐明了该方法的概率基础。跨不同设置（包括紧凑流形采样、亚稳态多孔系统、图像生成和高维随机偏微分方程）的实验表明，KSWGD 始终能够比其他现有方法实现更快的收敛，同时保持高样本质量。

Title: The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation

Authors: Feng Bao, Hui Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18928
Pdf URL: https://arxiv.org/pdf/2512.18928
Copy Paste: [[2512.18928]] The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation(https://arxiv.org/abs/2512.18928)
Keywords: generative
Abstract: This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.
摘要：这项工作提出了一种新颖的非线性最优滤波器，即整体薛定谔桥非线性滤波器。所提出的滤波器找到了标准预测过程和分析过程的扩散生成模型的结合，以实现一个过滤步骤。所设计的方法没有发现结构模型错误，并且无需导数、无需训练且高度可并行化。实验结果表明，在混沌环境下，所设计的算法在高达 40 或以上的（轻度）高维高度非线性动态下表现良好。在给定不同非线性水平的大量测试中，它还显示出比集成卡尔曼滤波器和粒子滤波器等经典方法更好的性能。未来的工作将侧重于将所提出的方法扩展到实际气象应用并建立严格的收敛分析。

Title: LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Authors: Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala, Matyas Bohacek
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.18930
Pdf URL: https://arxiv.org/pdf/2512.18930
Copy Paste: [[2512.18930]] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer(https://arxiv.org/abs/2512.18930)
Keywords: generative
Abstract: Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.
摘要：生成模型中的艺术风格迁移仍然是一个重大挑战，因为现有方法通常仅通过模型微调、额外的适配器或即时工程来引入风格，所有这些都可能在计算上昂贵，并且仍然可能将风格与主题纠缠在一起。在本文中，我们介绍了一种轻训练和推理的、可解释的方法来表示和转移艺术风格。我们的方法在生成图像模型的潜在嵌入之上利用了特定于艺术的稀疏自动编码器（SAE）。经过艺术数据的训练，我们的 SAE 学习了一组新出现的、很大程度上解开的风格和构图概念，对应于与笔触、纹理和调色板相关的风格相关元素，以及语义和结构概念。我们将其称为 LouvreSAE，并用它来构建风格配置文件：紧凑、可分解的引导向量，无需任何模型更新或优化即可实现风格转移。与之前基于概念的风格迁移方法不同，我们的方法不需要微调，不需要 LoRA 训练，也不需要额外的推理过程，只需少量参考图像即可直接控制艺术风格。我们在 ArtBench10 上验证了我们的方法，达到或超越了现有的风格评估方法（VGG 风格损失和 CLIP 评分风格），同时速度提高了 1.7-20 倍，而且最重要的是，可解释。

Title: When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Authors: Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma, Kevin Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18934
Pdf URL: https://arxiv.org/pdf/2512.18934
Copy Paste: [[2512.18934]] When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models(https://arxiv.org/abs/2512.18934)
Keywords: generation
Abstract: Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at this https URL.
摘要：灾难性遗忘对持续学习提出了根本性挑战，特别是当模型被量化以提高部署效率时。我们系统地研究了大型语言模型中量化精度（FP16、INT8、INT4）和重播缓冲区策略之间的相互作用，揭示了意想不到的动态。虽然 FP16 实现了卓越的初始任务性能（在 NLU 上为 74.44%），但我们观察到后续任务的惊人反转：量化模型在最终任务前向准确度上比 FP16 高出 8-15%，而 INT4 在代码生成方面的性能几乎是 FP16 的两倍（40% vs 20%）。至关重要的是，即使是最小的重放缓冲区 (0.1%) 也能显着提高保留率 - 将所有精度级别的数学训练后的 NLU 保留率从 45% 提高到 65% - INT8 始终在学习可塑性和知识保留之间实现最佳平衡。我们假设量化引起的噪声起到隐式正则化的作用，防止过度拟合困扰高精度模型的新任务梯度。这些发现挑战了“精度越高越好”的传统观念，表明 INT8 量化可以提供计算效率和卓越的持续学习动态。我们的结果为在持续学习场景中部署压缩模型提供了实用指南：小型重播缓冲区 (1-2%) 足以完成 NLU 任务，而数学和代码则受益于中等缓冲区 (5-10%)，量化模型需要比 FP16 更少的重播才能实现可比较的保留。代码可从此 https URL 获取。

Title: Symmetrization of 3D Generative Models

Authors: Nicolas Caytuiro, Ivan Sipiran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18953
Pdf URL: https://arxiv.org/pdf/2512.18953
Copy Paste: [[2512.18953]] Symmetrization of 3D Generative Models(https://arxiv.org/abs/2512.18953)
Keywords: generation, generative
Abstract: We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.
摘要：我们提出了一种新颖的以数据为中心的方法，通过修改训练数据而不是模型架构来促进 3D 生成模型的对称性。我们的方法首先分析现实世界 3D 形状和最先进模型生成的样本中的反射对称性。我们假设仅在半对象（通过沿 x=0 平面反射一半形状而获得）上训练生成模型，使模型能够学习部分几何形状的丰富分布，这些几何形状在生成过程中反射时，会产生视觉上合理且几何对称的完整形状。为了测试这一点，我们从三个 ShapeNet 类（飞机、汽车和椅子）构建了一个新的半对象数据集，并训练了两个生成模型。实验表明，与原始模型和原始数据集对象生成的对象相比，生成的形状是对称且一致的。

Title: Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18957
Pdf URL: https://arxiv.org/pdf/2512.18957
Copy Paste: [[2512.18957]] Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation(https://arxiv.org/abs/2512.18957)
Keywords: generative
Abstract: The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
摘要：强化学习（RL）代理在现实应用中的部署通常会因训练和部署环境不匹配而导致性能下降而受到阻碍。分布式鲁棒强化学习 (DR-RL) 通过优化一组不确定的过渡动态的最坏情况性能来解决这个问题。然而，现有的工作通常依赖于大量的先验知识——例如对生成模型或大型离线数据集的访问——并且主要集中在不能扩展到复杂领域的表格方法。我们通过提出一种具有通用函数逼近的在线 DR-RL 算法来克服这些限制，该算法纯粹通过与环境的交互来学习最佳的鲁棒策略，而不需要先验模型或离线数据，从而能够在高维任务中进行部署。我们进一步提供了理论分析，在总变异不确定性集下建立了接近最优的次线性后悔界限，证明了我们方法的样本效率和有效性。

Title: DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation

Authors: Guandong Li, Yijun Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18964
Pdf URL: https://arxiv.org/pdf/2512.18964
Copy Paste: [[2512.18964]] DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation(https://arxiv.org/abs/2512.18964)
Keywords: generation
Abstract: Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
摘要：最近的免调整身份定制方法实现了高面部保真度，但常常忽略视觉背景，例如照明、皮肤纹理和环境色调。这种限制会导致“语义视觉失调”，即准确的面部几何形状与输入的独特氛围发生冲突，从而导致不自然的“贴纸状”效果。我们提出**DVI（解缠结视觉身份）**，这是一种零样本框架，可将身份正交分解为细粒度语义和粗粒度视觉流。与仅依赖语义向量的方法不同，DVI 利用 VAE 潜在空间的固有统计特性，利用均值和方差作为全局视觉氛围的轻量级描述符。我们引入了一种**无参数特征调制**机制，该机制可以利用这些视觉统计数据自适应地调制语义嵌入，从而有效地注入参考文献的“视觉灵魂”而无需训练。此外，**动态时间粒度调度器**与扩散过程保持一致，在早期去噪阶段优先考虑视觉氛围，同时在稍后细化语义细节。大量实验表明，DVI 无需参数微调即可显着增强视觉一致性和大气保真度，保持稳健的身份保留，并在 IBench 评估中超越最先进的方法。

Title: CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

Authors: Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang, Suhui Wu, Yongxin Chen, Hao Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19020
Pdf URL: https://arxiv.org/pdf/2512.19020
Copy Paste: [[2512.19020]] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization(https://arxiv.org/abs/2512.19020)
Keywords: generation
Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at this https URL.
摘要：在视频生成中实现精确的摄像机控制仍然具有挑战性，因为现有方法通常依赖于摄像机姿态注释，而这些注释很难扩展到大型动态数据集，并且经常与深度估计不一致，从而导致训练测试差异。我们引入了 CETCAM，这是一种摄像机可控的视频生成框架，它通过一致且可扩展的标记化方案消除了对摄像机注释的需要。 CETCAM 利用几何基础模型（例如 VGGT）的最新进展来估计深度和相机参数，并将它们转换为统一的、几何感知的标记。这些令牌通过轻量级上下文块无缝集成到预训练的视频传播主干中。经过两个渐进阶段的训练，CETCAM 首先从不同的原始视频数据中学习强大的摄像机可控性，然后使用精心策划的高保真数据集细化细粒度的视觉质量。跨多个基准的大量实验证明了最先进的几何一致性、时间稳定性和视觉真实感。此外，CETCAM 对其他控制模式（包括修复和布局控制）表现出强大的适应性，突出了其超越相机控制的灵活性。项目页面可通过此 https URL 获取。

Title: Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

Authors: Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao, Jianyang Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19026
Pdf URL: https://arxiv.org/pdf/2512.19026
Copy Paste: [[2512.19026]] Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation(https://arxiv.org/abs/2512.19026)
Keywords: generation, generative
Abstract: The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.
摘要：个性化生成模型的兴起提出了一个核心问题：我们应该如何评估身份保存？给定参考图像（例如，一个人的宠物），我们期望生成的图像保留与主体身份相关的精确细节。然而，当前的生成评估指标强调参考和输出之间的整体语义相似性，而忽略了这些细粒度的区分细节。我们引入了 Finer-Personalization Rank，这是一种专为身份保存量身定制的评估协议。 Finer-Personalization Rank 采用排名视图，而不是成对相似性：它将每个生成的图像视为针对由视觉上相似的真实图像组成的身份标记图库的查询。检索指标（例如，平均平均精度）衡量性能，其中较高的分数表明保留了特定于身份的细节（例如，独特的头部位置）。我们在多个粒度上评估身份——从细粒度类别（例如鸟类、汽车模型）到个体实例（例如重新识别）。在 CUB、Stanford Cars 和动物 Re-ID 基准测试中，精细个性化排名比纯语义指标更忠实地反映了身份保留，并揭示了几种流行个性化方法中的重大身份漂移。这些结果将基于图库的协议定位为个性化生成的原则性和实用性评估。

Title: WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Authors: Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19048
Pdf URL: https://arxiv.org/pdf/2512.19048
Copy Paste: [[2512.19048]] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency(https://arxiv.org/abs/2512.19048)
Keywords: generation, generative
Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
摘要：图像水印支持真实性和来源，但许多方案仍然很容易通过各种扭曲和强大的生成编辑来绕过。基于深度学习的水印提高了基于扩散的图像编辑的鲁棒性，但当通过图像到视频（I2V）将水印图像转换为视频时，仍然存在差距，其中每帧水印检测减弱。 I2V 已从简短、不稳定的剪辑迅速发展为多秒、时间连贯的场景，现在它不仅服务于内容创建，还服务于世界建模和模拟工作流程，这使得跨模式水印恢复变得至关重要。我们推出了 WaTeRFlow，这是一个专为 I2V 下的鲁棒性而定制的框架。它由 (i) FUSE（流引导统一合成引擎）组成，它在训练期间通过指令驱动的编辑和快速视频扩散代理使编码器-解码器暴露于现实失真，(ii) 具有稳定每帧预测的时间一致性损失 (TCL) 的光流扭曲，以及 (iii) 维持调节信号的语义保留损失。跨代表性 I2V 模型的实验表明，在视频生成之前或之后应用各种失真时，从帧中准确恢复水印，具有更高的首帧和每帧比特精度和弹性。

Title: Decoupled Generative Modeling for Human-Object Interaction Synthesis

Authors: Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19049
Pdf URL: https://arxiv.org/pdf/2512.19049
Copy Paste: [[2512.19049]] Decoupled Generative Modeling for Human-Object Interaction Synthesis(https://arxiv.org/abs/2512.19049)
Keywords: generative
Abstract: Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.
摘要：综合真实的人机交互 (HOI) 对于 3D 计算机视觉和机器人技术、支撑动画和具体控制至关重要。现有方法通常需要手动指定中间路径点，并将所有优化目标放在单个网络上，这增加了复杂性，降低了灵活性，并导致诸如人与物体运动不同步或穿透等错误。为了解决这些问题，我们提出了人机交互综合的解耦生成建模（DecHOI），它将路径规划和动作综合分开。轨迹生成器首先生成没有规定路径点的人和物体轨迹，动作生成器根据这些路径条件来合成详细的运动。为了进一步提高接触真实感，我们采用了对抗性训练，其中鉴别器专注于远端关节的动力学。该框架还对移动的对应物进行建模，并支持动态场景中的响应式长序列规划，同时保持计划的一致性。在 FullBodyManipulation 和 3D-FUTURE 这两个基准测试中，DecHOI 在大多数定量指标和定性评估方面都超越了之前的方法，感知研究同样更喜欢我们的结果。

Title: Efficient Personalization of Generative Models via Optimal Experimental Design

Authors: Guy Schacht, Ziyad Sheebaelhamd, Riccardo De Santi, Mojmír Mutný, Andreas Krause
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2512.19057
Pdf URL: https://arxiv.org/pdf/2512.19057
Copy Paste: [[2512.19057]] Efficient Personalization of Generative Models via Optimal Experimental Design(https://arxiv.org/abs/2512.19057)
Keywords: generative
Abstract: Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.
摘要：从人类反馈中进行偏好学习能够使生成模型与最终用户的需求保持一致。获取人类反馈的成本高昂且耗时，这就产生了对数据高效的查询选择方法的需求。这项工作提出了一种新颖的方法，利用最佳实验设计向人类询问信息最丰富的偏好查询，从中我们可以有效地阐明有效建模用户偏好的潜在奖励函数。我们将偏好查询选择问题表述为最大化有关潜在偏好模型的信息的问题。我们证明该问题具有凸优化公式，并引入了一种统计和计算高效的算法 ED-PBRL，该算法得到理论保证的支持，可以有效地构造图像或文本等结构化查询。我们通过将文本到图像生成模型个性化为用户特定的样式，凭经验提出了所提出的框架，表明与随机查询选择相比，它需要更少的偏好查询。

Title: Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding

Authors: Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu, Zhihong Jin, Zheng Hu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.19070
Pdf URL: https://arxiv.org/pdf/2512.19070
Copy Paste: [[2512.19070]] Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding(https://arxiv.org/abs/2512.19070)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: this https URL)
摘要：大视觉语言模型 (LVLM) 弥合了视觉和语言模式之间的差距，在各个领域展示了强大的潜力。然而，尽管取得了重大进展，LVLM 在物体识别任务中仍然面临严重的幻觉问题。这些模型通常无法准确识别某些对象，导致生成的文本看起来很流畅，但与视觉内容不对应，这可能在现实应用中产生严重后果。最近，已经提出了几种减轻 LVLM 幻觉的方法，但大多数只专注于减少语言模态中的幻觉。为了减轻语言和视觉方式的幻觉，我们引入了不需要训练的幻觉解缠解码（HDD）方法。 HDD 通过分割原始图像并选择增强原始图像的图像来增强原始图像，同时还利用空白图像来消除原始图像和分割图像中的语言先验幻觉。这种设计不仅减少了模型对语言先验的依赖，还增强了其视觉性能。（代码：此 https 网址）

Title: Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Authors: Hengyi Feng, Zeang Sheng, Meiyi Qiang, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19115
Pdf URL: https://arxiv.org/pdf/2512.19115
Copy Paste: [[2512.19115]] Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?(https://arxiv.org/abs/2512.19115)
Keywords: generation, generative
Abstract: Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.
摘要：尽管多模态大语言模型（MLLM）在生成任务中取得了显着的成功，但我们观察到它们在零样本多模态检索任务中表现出违反直觉的缺陷。在这项工作中，我们研究了阻碍 MLLM 充当有效检索器的潜在机制。在稀疏自动编码器 (SAE) 的帮助下，我们将 MLLM 输出表示分解为可解释的语义概念，以探究其内在行为。我们的分析表明，MLLM 的表示空间绝大多数由文本语义主导；多模态检索所必需的视觉信息仅占一小部分。这种不平衡因 MLLM 过度关注桥接图像-文本模态而变得更加复杂，这促进了生成，但使嵌入均质化，最终削弱了多模态检索所需的判别能力。我们进一步发现，对 MLLM 相似性计算贡献最大的特定特征组件实际上是主动降低检索性能的干扰因素。总的来说，我们的工作首次在多模态检索的背景下对 MLLM 表示进行了深入的可解释性分析，并为增强 MLLM 的多模态检索能力提供了可能的方向。

Title: OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions

Authors: Wendong Bu, Kaihang Pan, Yuze Lin, Jiacheng Li, Kai Shen, Wenqiao Zhang, Juncheng Li, Jun Xiao, Siliang Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19159
Pdf URL: https://arxiv.org/pdf/2512.19159
Copy Paste: [[2512.19159]] OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions(https://arxiv.org/abs/2512.19159)
Keywords: generation
Abstract: Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: this https URL.
摘要：大型语言模型（LLM）将不同的语言任务统一在一个框架内，但这种统一在人类动作生成中仍未得到探索。现有方法仅限于孤立的任务，限制了自由形式和全目标生成的灵活性。为了解决这个问题，我们提出了 OmniMoGen，这是一个统一的框架，可以通过交错的文本运动指令生成多功能运动。 OmniMoGen 基于简洁的 RVQ-VAE 和变压器架构构建，支持端到端指令驱动的运动生成。我们构建了 X2Mo，一个包含超过 137K 交错文本运动指令的大型数据集，并引入了 AnyContext，一个评估交错运动生成的基准。实验表明，OmniMoGen 在文本转动画、动作编辑和 AnyContext 方面实现了最先进的性能，展示了合成编辑、自我反思生成和知识知情生成等新兴功能。这些结果标志着迈向下一代智能运动的一步。项目页面：此 https URL。

Title: HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry

Authors: Na Gao, Chenfei Ye, Yanwu Yang, Anqi Li, Zhengbo He, Li Liang, Zhiyuan Liu, Xingyu Hao, Ting Ma, Tengfei Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19214
Pdf URL: https://arxiv.org/pdf/2512.19214
Copy Paste: [[2512.19214]] HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry(https://arxiv.org/abs/2512.19214)
Keywords: generative
Abstract: Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus' conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.
摘要：海马亚结构的准确表征对于检测细微的结构变化和识别早期神经退行性生物标志物至关重要。然而，人类海马体的高受试者间变异性和复杂的折叠模式阻碍了跨受试者和纵向分析的一致性。大多数现有方法依赖于特定于主题的建模，并且缺乏稳定的内在坐标系来适应解剖变异性，这限制了它们建立可靠的个体间和个体内对应关系的能力。为了解决这个问题，我们提出了 HippMetric，这是一种基于骨骼表示（s-rep）的框架，用于海马亚结构形态测量以及个体和扫描之间的逐点对应。 HippMetric 以轴参考形态模型 (ARMM) 为基础，采用与海马解剖结构和功能一致的可变形骨骼坐标系，为对应提供生物基础参考。我们的框架包括两个核心模块：基于骨骼的坐标系统，尊重海马体保守的纵向层状结构，其中功能单元（层状）垂直于长轴堆叠，从而实现跨受试者和时间的解剖学一致定位；以及通过表面重建、变形和几何约束辐条细化生成的个性化 s-reps，强制边界遵守、正交性和不相交，以生成数学上有效的骨架几何形状。对两个国际队列进行的广泛实验表明，与现有形状模型相比，HippMetric 具有更高的准确性、可靠性和对应稳定性。

Title: Regression generation adversarial network based on dual data evaluation strategy for industrial application

Authors: Zesen Wang, Yonggang Li, Lijuan Lan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19232
Pdf URL: https://arxiv.org/pdf/2512.19232
Copy Paste: [[2512.19232]] Regression generation adversarial network based on dual data evaluation strategy for industrial application(https://arxiv.org/abs/2512.19232)
Keywords: generation, generative
Abstract: Soft sensing infers hard-to-measure data through a large number of easily obtainable variables. However, in complex industrial scenarios, the issue of insufficient data volume persists, which diminishes the reliability of soft sensing. Generative Adversarial Networks (GAN) are one of the effective solutions for addressing insufficient samples. Nevertheless, traditional GAN fail to account for the mapping relationship between labels and features, which limits further performance improvement. Although some studies have proposed solutions, none have considered both performance and efficiency simultaneously. To address these problems, this paper proposes the multi-task learning-based regression GAN framework that integrates regression information into both the discriminator and generator, and implements a shallow sharing mechanism between the discriminator and regressor. This approach significantly enhances the quality of generated samples while improving the algorithm's operational efficiency. Moreover, considering the importance of training samples and generated samples, a dual data evaluation strategy is designed to make GAN generate more diverse samples, thereby increasing the generalization of subsequent modeling. The superiority of method is validated through four classic industrial soft sensing cases: wastewater treatment plants, surface water, $CO_2$ absorption towers, and industrial gas turbines.
摘要：软测量通过大量容易获得的变量来推断难以测量的数据。然而，在复杂的工业场景中，数据量不足的问题仍然存在，这降低了软测量的可靠性。生成对抗网络（GAN）是解决样本不足的有效解决方案之一。然而，传统的GAN未能考虑标签和特征之间的映射关系，这限制了性能的进一步提升。尽管一些研究提出了解决方案，但没有一个研究同时考虑性能和效率。为了解决这些问题，本文提出了基于多任务学习的回归 GAN 框架，将回归信息集成到判别器和生成器中，并在判别器和回归器之间实现浅层共享机制。这种方法显着提高了生成样本的质量，同时提高了算法的运行效率。而且，考虑到训练样本和生成样本的重要性，设计了双重数据评估策略，使GAN生成更加多样化的样本，从而增加后续建模的泛化性。通过废水处理厂、地表水、$CO_2$吸收塔和工业燃气轮机四个经典工业软测量案例验证了该方法的优越性。

Title: VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Authors: Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19243
Pdf URL: https://arxiv.org/pdf/2512.19243
Copy Paste: [[2512.19243]] VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis(https://arxiv.org/abs/2512.19243)
Keywords: generation, generative
Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
摘要：生成模型现在可以生成逼真的图像，但它们仍然难以应对专业设计师发出的长的、多目标的提示。为了揭示这一差距并更好地评估模型在现实环境中的性能，我们引入了 Long Goal Bench (LGBench)，这是一个包含 2,000 个任务的套件（1,000 个 T2I 和 1,000 个 I2I），其平均指令包含 18 到 22 个紧密耦合的目标，涵盖全局布局、局部对象放置、版式和徽标保真度。我们发现，即使是最先进的模型也只能满足不到 72% 的目标，并且经常会错过本地化编辑，这证实了当前流程的脆弱性。为了解决这个问题，我们推出了 VisionDirector，一种免训练的视觉语言监控器，它 (i) 从长指令中提取结构化目标，(ii) 在一次性生成和分阶段编辑之间动态做出决定，(iii) 在每次编辑后运行微网格采样并进行语义验证和回滚，以及 (iv) 记录目标级别奖励。我们通过组相对策略优化进一步微调规划器，产生更短的编辑轨迹（3.1 与 4.2 步）和更强的对齐。 VisionDirector 在 GenEval（整体提高 7%）和 ImgEdit（绝对提高 0.07）上实现了新的技术水平，同时在排版、多对象场景和姿势编辑方面产生了一致的质量改进。

Title: 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory

Authors: Xinyang Song, Libin Wang, Weining Wang, Zhiwei Li, Jianxin Sun, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19271
Pdf URL: https://arxiv.org/pdf/2512.19271
Copy Paste: [[2512.19271]] 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory(https://arxiv.org/abs/2512.19271)
Keywords: generation
Abstract: Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
摘要：最近的图像生成方法通常单独处理主题、风格和结构驱动的调节，导致特征纠缠和有限的任务可转移性。在本文中，我们介绍了 3SGen，这是一个任务感知统一框架，可在单个模型中执行所有三种调节模式。 3SGen 采用配备可学习语义查询的 MLLM 来对齐文本图像语义，并辅以保留细粒度视觉细节的 VAE 分支。其核心是，自适应任务特定内存 (ATM) 模块通过轻量级门控机制和多个可扩展内存项，动态解开、存储和检索特定于条件的先验信息，例如主题的身份、样式的纹理和结构的空间布局。这种设计减轻了任务间干扰，并自然地扩展到组合输入。此外，我们提出了 3SGen-Bench，这是一个统一的图像驱动生成基准，具有用于评估跨任务保真度和可控性的标准化指标。对我们提出的 3SGen-Bench 和其他公共基准进行的广泛实验证明了我们在各种图像驱动生成任务中的卓越性能。

Title: Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation

Authors: Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19275
Pdf URL: https://arxiv.org/pdf/2512.19275
Copy Paste: [[2512.19275]] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation(https://arxiv.org/abs/2512.19275)
Keywords: generative
Abstract: Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
摘要：生成式人工智能 (GenAI) 模型彻底改变了动画，能够以卓越的视觉保真度合成人类和运动模式。然而，生成真正逼真的人类动画仍然是一个艰巨的挑战，即使是微小的不一致也会使主题显得不自然。当对人工智能生成的视频进行行为生物识别评估时，这种限制尤其重要，其中定义身份的微妙运动线索很容易丢失或扭曲。本研究调查最先进的 GenAI 人类动画模型是否可以保留通过步态生物识别进行人员识别所需的微妙时空细节。具体来说，我们在两个主要评估任务中评估了四种不同的 GenAI 模型，以评估它们的能力：i）在不同的复杂性条件下从参考视频中恢复步态模式，以及 ii）将这些步态模式转移到不同的视觉身份。我们的结果表明，虽然视觉质量大多很高，但在专注于识别的任务中，生物识别保真度仍然很低，这表明当前的 GenAI 模型很难将身份与运动分开。此外，通过身份转移任务，我们暴露了基于外观的步态识别的一个基本缺陷：当纹理与运动分离时，识别崩溃，证明当前的 GenAI 模型依赖于视觉属性而不是时间动态。

Title: RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning

Authors: Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19300
Pdf URL: https://arxiv.org/pdf/2512.19300
Copy Paste: [[2512.19300]] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning(https://arxiv.org/abs/2512.19300)
Keywords: generation
Abstract: Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
摘要：通过集成来自不同类别的不同文本概念来合成新颖的对象仍然是文本到图像（T2I）生成中的重大挑战。现有的方法常常存在概念混合不充分、缺乏严格的评估以及输出不理想的问题——表现为概念不平衡、肤浅的组合或仅仅是并置。为了解决这些限制，我们提出了强化混合学习（RMLer），这是一个将跨类别概念融合制定为强化学习问题的框架：混合特征作为状态，混合策略作为动作，视觉结果作为奖励。具体来说，我们设计了一个 MLP 策略网络来预测混合跨类别文本嵌入的动态系数。我们进一步引入基于（1）语义相似性和（2）融合对象及其组成概念之间的组合平衡的视觉奖励，通过近端策略优化来优化策略。在推理时，选择策略利用这些奖励来策划最高质量的融合对象。大量的实验证明了 RMLer 在合成不同类别的连贯、高保真对象方面的优越性，优于现有方法。我们的工作为生成新颖的视觉概念提供了一个强大的框架，在电影、游戏和设计中具有广阔的应用前景。

Title: MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Authors: Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19311
Pdf URL: https://arxiv.org/pdf/2512.19311
Copy Paste: [[2512.19311]] MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture(https://arxiv.org/abs/2512.19311)
Keywords: generation
Abstract: This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
摘要：本文研究了训练-测试差异（又名暴露偏差）问题，以改进扩散模型。在训练期间，预测网络在一个训练时间步的输入是相应的真实噪声数据，它是噪声和数据的插值，而在测试期间，输入是生成的噪声数据。我们提出了一种新颖的训练方法，称为 MixFlow，用于提高性能。我们的方法受到慢流现象的启发：观察到在给定采样时间步长最接近生成的噪声数据的地面实况插值对应于较高噪声的时间步长（称为慢速时间步长），即相应的地面实况时间步长慢于采样时间步长。 MixFlow 利用减慢时间步长的插值（称为减慢插值混合）来对每个训练时间步长的预测网络进行后训练。类条件图像生成（包括 SiT、REPA 和 RAE）和文本到图像生成的实验验证了我们方法的有效性。我们的 RAE 模型上的 MixFlow 方法在 ImageNet 上取得了强劲的生成结果：在 256 x 256 下为 1.43 FID（无指导）和 1.10（有指导），在 512 x 512 下为 1.55 FID（无指导）和 1.10（有指导）。

Title: GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis

Authors: Siyuan Mei, Yan Xia, Fuxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19336
Pdf URL: https://arxiv.org/pdf/2512.19336
Copy Paste: [[2512.19336]] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis(https://arxiv.org/abs/2512.19336)
Keywords: generative
Abstract: The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.
摘要：磁共振成像 (MRI) 和锥束 CT (CBCT) 综合计算断层扫描 (CT) 通过在自适应放射治疗中实现准确的解剖学表征，在临床治疗计划中发挥着关键作用。在这项工作中，我们提出了 GANeXt，这是一种基于 3D 补丁、完全由 ConvNeXt 驱动的生成对抗网络，用于跨不同模式和解剖区域的统一 CT 合成。具体来说，GANeXt 采用由具有紧凑卷积核的堆叠 3D ConvNeXt 块构成的高效 U 形生成器，而鉴别器则采用条件 PatchGAN。为了提高合成质量，我们结合了损失函数的组合，包括平均绝对误差（MAE）、感知损失、基于分割的屏蔽 MAE 和对抗性损失，以及用于多头分割鉴别器的 Dice 损失和交叉熵的组合。对于这两个任务，训练都是使用两个单独的 AdamW 优化器（用于生成器和判别器）以 8 的批量大小进行训练，每个优化器都配备了预热和余弦衰减调度器，学习率分别为 $5\times10^{-4}$ 和 $1\times10^{-3}$。数据预处理包括可变形配准、前景裁剪、输入模态的百分位标准化以及 CT 线性标准化至 $[-1024, 1000]$ 范围。数据增强涉及 $(0.8, 1.3)$ 范围内的随机缩放（仅适用于 MRI-to-CT）、固定尺寸裁剪至 MRI-to-CT 的 $32\times160\times192$ 和 CBCT-to-CT 的 $32\times128\times128$ 以及随机翻转。在推理过程中，我们应用具有 0.8$ 重叠和平均折叠的滑动窗口方法来重建全尺寸 sCT，然后对 CT 归一化进行反演。在没有任何微调的情况下对所有区域进行联合训练后，使用完整的训练数据集在 MRI-to-CT 的 3000 个 epoch 和 CBCT-to-CT 的 1000 个 epoch 结束时选择最终模型。

Title: Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation

Authors: Isshaan Singh, Divyansh Chawla, Anshu Garg, Shivin Mangal, Pallavi Gupta, Khushi Agarwal, Nimrat Singh Khalsa, Nandan Patel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19361
Pdf URL: https://arxiv.org/pdf/2512.19361
Copy Paste: [[2512.19361]] Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation(https://arxiv.org/abs/2512.19361)
Keywords: generation
Abstract: The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.
摘要：在现代物联网驱动的食品供应链中，对智能实时腐败预测系统的需求变得至关重要，因为易腐烂的商品极易受到环境条件的影响。现有方法往往缺乏对动态条件的适应性，无法实时优化决策。为了应对这些挑战，我们提出了一种混合强化学习框架，集成了长短期记忆（LSTM）和循环神经网络（RNN），以增强腐败预测。这种混合架构捕获传感器数据内的时间依赖性，从而实现稳健和自适应的决策。与可解释的人工智能原理保持一致，采用基于规则的分类器环境，根据特定领域的阈值提供腐败程度的透明地面实况标签。这种结构化设计允许代理在明确定义的语义边界内运行，支持可跟踪和可解释的决策。使用可解释性驱动的指标来监控模型行为，包括破坏准确性、奖励步数比、损失减少率和探索衰减。这些指标提供了定量的绩效评估和对学习动态的洞察。按类别的损坏分布可视化用于分析代理的决策概况和策略行为。对模拟和实时硬件数据的广泛评估表明，基于 LSTM 和 RNN 的代理在预测准确性和决策效率方面优于替代强化学习方法，同时保持可解释性。结果凸显了混合深度强化学习与集成可解释性对于可扩展的基于物联网的食品监测系统的潜力。

Title: dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

Authors: Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19433
Pdf URL: https://arxiv.org/pdf/2512.19433
Copy Paste: [[2512.19433]] dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models(https://arxiv.org/abs/2512.19433)
Keywords: generation, generative
Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: this https URL.
摘要：扩散多模态大语言模型（dMLLM）最近作为一种统一图像生成和理解的新颖架构而出现。然而，开发有效且高效的测试时间缩放（TTS）方法来释放其全部生成潜力仍然是一个尚未充分探索的挑战。为了解决这个问题，我们提出了 dMLLM-TTS，这是一种在两个互补的缩放轴上运行的新颖框架：（1）轨迹探索缩放以增强生成假设的多样性，以及（2）迭代细化缩放以实现稳定生成。传统的 TTS 方法通常在这两个维度上执行线性搜索，从而产生 O(NT) 的大量计算成本，并且需要外部验证器进行 N 中最佳选择。为了克服这些限制，我们提出了两项创新。首先，我们设计了一种复杂度为 O(N+T) 的高效分层搜索算法，可以自适应扩展和修剪采样轨迹。其次，我们引入了一种自我验证的反馈机制，该机制利用 dMLLM 的固有图像理解能力来评估文本图像对齐，从而无需外部验证器。对三个代表性 dMLLM（例如 Lumina-DiMOO、MMaDA、Muddit）的 GenEval 基准进行的广泛实验表明，我们的框架显着提高了生成质量，同时实现了比线性搜索高出 6 倍的效率。项目页面：此 https URL。

Title: Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation

Authors: Guoli Jia, Junyao Hu, Xinwei Long, Kai Tian, Kaiyan Zhang, KaiKai Zhao, Ning Ding, Bowen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19479
Pdf URL: https://arxiv.org/pdf/2512.19479
Copy Paste: [[2512.19479]] Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation(https://arxiv.org/abs/2512.19479)
Keywords: generation
Abstract: Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.
摘要：基于扩散模型的图像生成已展现出令人印象深刻的能力，激发了对多样化和专业应用的探索。由于情感在广告中的重要性，以情感为导向的图像生成越来越受到人们的关注。然而，当前的面向情感的方法存在情感捷径，即情感近似于语义。二十年的研究证明，情感并不等同于语义。为此，我们提出了 Emotion-Director，一个由两个模块组成的跨模式协作框架。首先，我们提出了一种跨模态协作扩散模型，缩写为 MC-Diffusion。 MC-Diffusion 将视觉提示与文本提示相结合以进行指导，从而能够生成超越语义的以情感为导向的图像。此外，我们通过负面视觉提示改进了 DPO 优化，增强了模型对相同语义下不同情绪的敏感度。其次，我们提出了 MC-Agent，一种跨模态协作代理系统，可以重写文本提示来表达预期的情绪。为了避免类似模板的重写，MC-Agent采用多代理来模拟人类对情感的主观性，并采用概念链工作流程来提高重写提示的视觉表现力。大量的定性和定量实验证明了Emotion-Director在面向情感的图像生成方面的优越性。

Title: DK-STN: A Domain Knowledge Embedded Spatio-Temporal Network Model for MJO Forecast

Authors: Hongliang Li, Nong Zhang, Zhewen Xu, Xiang Li, Changzheng Liu, Chongbo Zhao, Jie Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19506
Pdf URL: https://arxiv.org/pdf/2512.19506
Copy Paste: [[2512.19506]] DK-STN: A Domain Knowledge Embedded Spatio-Temporal Network Model for MJO Forecast(https://arxiv.org/abs/2512.19506)
Keywords: generation
Abstract: Understanding and predicting the Madden-Julian Oscillation (MJO) is fundamental for precipitation forecasting and disaster prevention. To date, long-term and accurate MJO prediction has remained a challenge for researchers. Conventional MJO prediction methods using Numerical Weather Prediction (NWP) are resource-intensive, time-consuming, and highly unstable (most NWP methods are sensitive to seasons, with better MJO forecast results in winter). While existing Artificial Neural Network (ANN) methods save resources and speed forecasting, their accuracy never reaches the 28 days predicted by the state-of-the-art NWP method, i.e., the operational forecasts from ECMWF, since neural networks cannot handle climate data effectively. In this paper, we present a Domain Knowledge Embedded Spatio-Temporal Network (DK-STN), a stable neural network model for accurate and efficient MJO forecasting. It combines the benefits of NWP and ANN methods and successfully improves the forecast accuracy of ANN methods while maintaining a high level of efficiency and stability. We begin with a spatial-temporal network (STN) and embed domain knowledge in it using two key methods: (i) applying a domain knowledge enhancement method and (ii) integrating a domain knowledge processing method into network training. We evaluated DK-STN with the 5th generation of ECMWF reanalysis (ERA5) data and compared it with ECMWF. Given 7 days of climate data as input, DK-STN can generate reliable forecasts for the following 28 days in 1-2 seconds, with an error of only 2-3 days in different seasons. DK-STN significantly exceeds ECMWF in that its forecast accuracy is equivalent to ECMWF's, while its efficiency and stability are significantly superior.
摘要：了解和预测马登-朱利安振荡 (MJO) 对于降水预报和灾害预防至关重要。迄今为止，长期准确的 MJO 预测仍然是研究人员面临的挑战。传统的数值天气预报（NWP）MJO预报方法资源密集、耗时长、不稳定（大多数NWP方法对季节敏感，MJO预报结果在冬季较好）。虽然现有的人工神经网络（ANN）方法节省了资源并加快了预测速度，但由于神经网络无法有效处理气候数据，其准确性永远无法达到最先进的 NWP 方法（即 ECMWF 的业务预测）预测的 28 天。在本文中，我们提出了一种领域知识嵌入时空网络（DK-STN），这是一种用于准确高效 MJO 预测的稳定神经网络模型。它结合了 NWP 和 ANN 方法的优点，成功提高了 ANN 方法的预测精度，同时保持了高水平的效率和稳定性。我们从时空网络（STN）开始，并使用两种关键方法将领域知识嵌入其中：（i）应用领域知识增强方法和（ii）将领域知识处理方法集成到网络训练中。我们用第五代ECMWF再分析（ERA5）数据评估了DK-STN，并与ECMWF进行了比较。输入 7 天的气候数据，DK-STN 可以在 1-2 秒内生成接下来 28 天的可靠预报，不同季节的误差仅为 2-3 天。 DK-STN 显着优于 ECMWF，预测精度与 ECMWF 相当，而效率和稳定性明显优于 ECMWF。

Title: StoryMem: Multi-shot Long Video Storytelling with Memory

Authors: Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19539
Pdf URL: https://arxiv.org/pdf/2512.19539
Copy Paste: [[2512.19539]] StoryMem: Multi-shot Long Video Storytelling with Memory(https://arxiv.org/abs/2512.19539)
Keywords: generation
Abstract: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
摘要：视觉叙事需要生成具有电影质量和远程一致性的多镜头视频。受人类记忆的启发，我们提出了 StoryMem，这是一种将长视频叙事重新表述为以显式视觉记忆为条件的迭代镜头合成的范例，将预先训练的单镜头视频扩散模型转变为多镜头叙事者。这是通过新颖的内存到视频 (M2V) 设计实现的，该设计维护了历史生成镜头中关键帧的紧凑且动态更新的内存库。然后，仅通过 LoRA 微调，通过潜在串联和负 RoPE 移位将存储的内存注入单次视频扩散模型。语义关键帧选择策略与审美偏好过滤一起，进一步确保了整个世代的信息丰富且稳定的记忆。此外，所提出的框架自然地适应平滑的镜头过渡和定制的故事生成应用程序。为了便于评估，我们引入了 ST-Bench，这是一个用于多镜头视频叙事的多样化基准。大量实验表明，StoryMem 比以前的方法实现了卓越的交叉镜头一致性，同时保持了高美感质量和及时的一致性，标志着向连贯的一分钟长视频叙事迈出了重要一步。

Title: ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

Authors: Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19546
Pdf URL: https://arxiv.org/pdf/2512.19546
Copy Paste: [[2512.19546]] ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars(https://arxiv.org/abs/2512.19546)
Keywords: generation
Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
摘要：尽管在会说话的化身生成方面取得了重大进展，但现有方法面临着严峻的挑战：对于不同动作的文本跟随能力不足，动作和音频内容之间缺乏时间对齐，以及对姿势骨架等额外控制信号的依赖。我们提出了 ActAvatar，这是一个框架，通过捕获动作语义和时间上下文，通过文本指导实现动作控制的阶段级精度。我们的方法引入了三个核心创新：（1）阶段感知交叉注意（PACA），它将提示分解为全局基本块和时间锚定阶段块，使模型能够专注于阶段相关标记以实现精确的时间语义对齐；（2）渐进式视听对齐，将模态影响与分层特征学习过程相结合——早期层优先考虑文本以建立动作结构，而较深层则强调音频以细化嘴唇运动，防止模态干扰；（3）两阶段训练策略，首先在不同的数据上建立强大的视听对应，然后通过对结构化注释的微调注入动作控制，保持视听对齐和模型的文本跟踪能力。大量实验表明，ActAvatar 在动作控制和视觉质量方面均明显优于最先进的方法。

Title: BabyFlow: 3D modeling of realistic and expressive infant faces

Authors: Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19560
Pdf URL: https://arxiv.org/pdf/2512.19560
Copy Paste: [[2512.19560]] BabyFlow: 3D modeling of realistic and expressive infant faces(https://arxiv.org/abs/2512.19560)
Keywords: generative
Abstract: Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.
摘要：通过分析婴儿颅面形态可以帮助早期发现发育障碍，但由于数据有限和频繁的自发表情，对婴儿面部进行建模具有挑战性。我们推出 BabyFlow，这是一种生成式人工智能模型，可以分解面部身份和表情，从而实现对两者的独立控制。通过使用归一化流，BabyFlow 可以学习灵活的概率表示，无需限制性线性假设即可捕捉富有表现力的婴儿面部复杂的非线性变化。为了解决表达数据稀缺和不受控制的问题，我们进行跨年龄表达迁移，调整成人 3D 扫描的表达，以通过真实且系统的表达变体来丰富婴儿数据集。因此，BabyFlow 提高了 3D 重建精度，特别是在嘴、眼、鼻等高表达区域，并支持婴儿表情的合成和修改，同时保留身份。此外，通过与扩散模型集成，BabyFlow 生成具有一致 3D 几何形状的高保真 2D 婴儿图像，为数据增强和早期面部分析提供了强大的工具。

Title: MapTrace: Scalable Data Generation for Route Tracing on Maps

Authors: Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19609
Pdf URL: https://arxiv.org/pdf/2512.19609
Copy Paste: [[2512.19609]] MapTrace: Scalable Data Generation for Route Tracing on Maps(https://arxiv.org/abs/2512.19609)
Keywords: generation
Abstract: While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
摘要：虽然多模态大语言模型在许多视觉和文本推理任务上实现了类似人类的性能，但它们在细粒度空间理解（例如地图上的路线追踪）方面的熟练程度仍然有限。与可以快速学习解析和导航地图的人类不同，当前的模型通常无法遵守基本的路径约束，部分原因是收集大规模、像素精确的路径注释的成本高昂且困难。为了解决这个问题，我们引入了一个可扩展的合成数据生成管道，它利用合成地图图像和像素级解析来自动为这项具有挑战性的任务生成精确的注释。使用此管道，我们构建了跨 4k 地图的 23k 路径样本的微调数据集，使模型能够获得更多类似于人类的空间能力。使用此数据集，我们对开源和专有 MLLM 进行微调。 MapBench 上的结果表明，微调显着提高了鲁棒性，将成功率提高了 6.4 个百分点，同时还减少了路径跟踪误差 (NDTW)。这些成果凸显了预训练模型中不存在的细粒度空间推理可以通过综合监督进行明确的教授。

Title: Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment

Authors: Da Tan, Michael Beck, Christopher P. Bidinosti, Robert H. Gulden, Christopher J. Henry
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19632
Pdf URL: https://arxiv.org/pdf/2512.19632
Copy Paste: [[2512.19632]] Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment(https://arxiv.org/abs/2512.19632)
Keywords: generation, generative
Abstract: The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.
摘要：农业人工智能的成功在很大程度上取决于大型、多样化和高质量的植物图像数据集，但在真实田间条件下收集此类数据成本高昂、劳动力密集且受季节限制。本文研究了基于扩散的生成模型，通过植物图像合成、室内到室外转换和专家偏好对齐微调来应对这些挑战。首先，稳定扩散模型对带标题的室内和室外植物图像进行微调，以生成油菜和大豆的逼真的文本调节图像。使用初始分数、Frechet 初始距离和下游表型分类进行的评估表明，合成图像有效地增强了训练数据并提高了准确性。其次，我们使用基于 DreamBooth 的文本反演和图像引导扩散弥合了高分辨率室内数据集和有限的室外图像之间的差距，生成翻译后的图像，以增强 YOLOv8 的杂草检测和分类。最后，偏好引导的微调框架根据专家分数训练奖励模型，并应用奖励加权更新以产生更稳定且与专家一致的输出。这些组件共同展示了一条通往农业人工智能数据高效生成管道的实用途径。

Title: Over++: Generative Video Compositing for Layer Interaction Effects

Authors: Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19661
Pdf URL: https://arxiv.org/pdf/2512.19661
Copy Paste: [[2512.19661]] Over++: Generative Video Compositing for Layer Interaction Effects(https://arxiv.org/abs/2512.19661)
Keywords: generation, generative
Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
摘要：在专业视频合成工作流程中，艺术家必须在前景主体和背景图层之间手动创建环境交互，例如阴影、反射、灰尘和飞溅。现有的视频生成模型在添加此类效果的同时很难保留输入视频，而当前的视频修复方法要么需要昂贵的每帧掩模，要么会产生令人难以置信的结果。我们引入了增强合成，这是一项新任务，可以根据文本提示和输入视频层合成逼真的半透明环境效果，同时保留原始场景。为了解决这个任务，我们提出了 Over++，一个视频效果生成框架，它不对相机姿势、场景平稳性或深度监督做出任何假设。我们构建了一个专为该任务定制的配对效果数据集，并引入了一种不配对的增强策略，以保留文本驱动的可编辑性。我们的方法还支持可选的蒙版控制和关键帧指导，而不需要密集的注释。尽管训练数据有限，但 Over++ 仍能产生多样化且真实的环境效果，并且在效果生成和场景保存方面均优于现有基线。

Title: Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Authors: Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex, Chih-Wei Chang, Erik H Middlebrooks, Richard L.J Qiu, Pretesh Patel, Ashesh B. Jania, Hui Mao, Zhen Tian, Xiaofeng Yang
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2512.19676
Pdf URL: https://arxiv.org/pdf/2512.19676
Copy Paste: [[2512.19676]] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning(https://arxiv.org/abs/2512.19676)
Keywords: super-resolution
Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.
摘要：背景：高分辨率 MRI 对于诊断至关重要，但较长的采集时间限制了临床应用。超分辨率（SR）可以增强扫描后的分辨率，但现有的深度学习方法面临保真度与效率的权衡。目的：为 MRI SR 开发计算高效且准确的深度学习框架，保留解剖细节以进行临床整合。材料和方法：我们提出了一种新颖的 SR 框架，将多头选择性状态空间模型 (MHSSM) 与轻量级通道 MLP 相结合。该模型使用 2D 补丁提取和混合扫描来捕获远程依赖性。每个 MambaFormer 模块都集成了 MHSSM、深度卷积和门控通道混合。评估使用 7T 脑 T1 MP2RAGE 图 (n=142) 和 1.5T 前列腺 T2w MRI (n=334)。比较包括双三次插值、GAN（CycleGAN、Pix2pix、SPSR）、变压器（SwinIR）、Mamba（MambaIR）和扩散模型（I2SB、Res-SRDiff）。结果：我们的模型以卓越的效率实现了卓越的性能。对于7T大脑数据：SSIM=0.951+-0.021，PSNR=26.90+-1.41 dB，LPIPS=0.076+-0.022，GMSD=0.083+-0.017，显着优于所有基线（p<0.001）。对于前列腺数据：SSIM=0.770+-0.049，PSNR=27.15+-2.19 dB，LPIPS=0.190+-0.095，GMSD=0.087+-0.013。该框架仅使用了 0.9M 参数和 57 GFLOP，与 Res-SRDiff 相比，参数减少了 99.8%，计算量减少了 97.5%，同时在准确性和效率上优于 SwinIR 和 MambaIR。结论：所提出的框架提供了一种高效、准确的 MRI SR 解决方案，可跨数据集提供增强的解剖细节。其低计算需求和最先进的性能显示出临床转化的强大潜力。

Title: WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Authors: Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19678
Pdf URL: https://arxiv.org/pdf/2512.19678
Copy Paste: [[2512.19678]] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion(https://arxiv.org/abs/2512.19678)
Keywords: generation, generative
Abstract: Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{this https URL}{this https URL}.
摘要：生成长距离、几何一致的视频提出了一个基本的困境：虽然一致性要求严格遵守像素空间中的 3D 几何，但最先进的生成模型在相机调节的潜在空间中运行最有效。这种脱节导致当前的方法难以应对遮挡区域和复杂的相机轨迹。为了弥补这一差距，我们提出了 WorldWarp，这是一个将 3D 结构锚与 2D 生成细化器结合起来的框架。为了建立几何基础，WorldWarp 维护了一个通过高斯溅射 (3DGS) 构建的在线 3D 几何缓存。通过明确地将历史内容扭曲成新的视图，该缓存充当结构脚手架，确保每个新框架尊重先前的几何形状。然而，静态扭曲不可避免地会因遮挡而留下孔洞和伪影。我们使用专为“填充和修改”目标而设计的时空扩散（ST-Diff）模型来解决这个问题。我们的关键创新是时空变化的噪声计划：空白区域接收全部噪声以触发生成，而扭曲区域接收部分噪声以实现细化。通过在每一步动态更新 3D 缓存，WorldWarp 可以保持视频块之间的一致性。因此，它通过确保 3D 逻辑引导结构而扩散逻辑完善纹理来实现最先进的保真度。项目页面：\href{此 https URL}{此 https URL}。

Title: VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Authors: Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19680
Pdf URL: https://arxiv.org/pdf/2512.19680
Copy Paste: [[2512.19680]] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation(https://arxiv.org/abs/2512.19680)
Keywords: generation
Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$\pi$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$\pi$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$\pi$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$\pi$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at this https URL.
摘要：自回归 (AR) 视觉生成依赖于分词器将图像映射到离散序列或从离散序列映射图像。然而，标记生成器经过训练，可以根据真实标记重建干净的图像，而 AR 生成器仅针对标记可能性进行优化。这种未对准导致生成的令牌序列可以解码成低质量图像，而无需来自像素空间的直接监督。我们提出了 VA-$\pi$，这是一个轻量级的后训练框架，可以通过原则性的像素空间目标直接优化 AR 模型。 VA-$\pi$ 将生成器-分词器对齐公式化为变分优化，导出统一像素重建和自回归建模的证据下界 (ELBO)。为了在离散令牌空间下进行优化，VA-$\pi$ 引入了基于强化的对齐策略，将 AR 生成器视为策略，使用像素空间重建质量作为其内在奖励。奖励是通过预测的标记序列在教师强制下重建原始图像的能力来衡量的，从而为模型提供直接的像素级指导，而无需昂贵的自由运行采样。 ELBO 的正则化项充当自然正则化器，保持令牌的分布一致性。 VA-$\pi$ 能够快速适应现有的 AR 生成器，无需标记器重新训练，也无需外部奖励模型。只需 1% ImageNet-1K 数据和 25 分钟的调整，它就可以将 LlamaGen-XXL 上的 FID 从 14.36 降低到 7.65，将 IS 从 86.55 提高到 116.70，同时在 GenEval 上的视觉生成模型（LlamaGen：从 0.306 到 0.339）和统一多模态模型的文本到图像任务中也取得了显着的成果（Janus-Pro：从 0.725 到 0.744）。代码可从此 https URL 获取。

Title: From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19683
Pdf URL: https://arxiv.org/pdf/2512.19683
Copy Paste: [[2512.19683]] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs(https://arxiv.org/abs/2512.19683)
Keywords: generation
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
摘要：虽然多模态大型语言模型（MLLM）在语义任务上取得了令人印象深刻的性能，但它们的空间智能（对于强大且基础的人工智能系统至关重要）仍然不发达。现有的基准无法诊断这一限制：它们要么专注于过于简化的定性推理，要么依赖特定领域的室内数据，而受到缺乏具有可验证度量标准事实的室外数据集的限制。为了弥补这一差距，我们引入了一个大规模基准测试，该基准测试是根据同步立体摄像机、LiDAR 和 IMU/GPS 传感器捕获的行人视角视频构建的。该数据集提供了度量精确的 3D 信息，能够自动生成跨越层次结构的空间推理问题 - 从定性关系推理到定量度量和运动学理解。评估表明，在结构化室内基准测试中观察到的性能提升在开放世界环境中消失了。使用合成异常场景和致盲测试的进一步分析证实，当前的 MLLM 严重依赖于语言先验，而不是基于视觉推理。因此，我们的基准提供了一个原则平台，用于诊断这些限制并推进基于物理的空间智能。

Title: Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Authors: Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19686
Pdf URL: https://arxiv.org/pdf/2512.19686
Copy Paste: [[2512.19686]] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models(https://arxiv.org/abs/2512.19686)
Keywords: generation
Abstract: Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
摘要：最近，思想链（CoT）的引入极大地提高了统一模型的生成能力。然而，据观察，当前生成过程中的思维过程主要关注文本与文本提示的一致性，而忽略了多模态生成（例如多参考生成）期间与视觉参考图像的\textbf{视觉上下文一致性}。缺乏这种一致性会导致无法维护关键视觉特征（如人物 ID、对象属性、风格）。为此，我们将视觉上下文一致性融入到统一模型的推理中，通过以下方式明确激励模型维持这种一致性：1）自适应视觉规划：生成结构化视觉检查列表以找出所需保持一致性的视觉元素；2）迭代视觉校正：在检查列表的指导下进行自我反思，并以迭代方式细化生成的结果。为了实现这一目标，我们使用监督微调来教导模型如何规划视觉检查、进行自我反思和自我细化，并使用 flow-GRPO 通过定制的视觉检查奖励进一步增强视觉一致性。实验表明，我们的方法在多模态生成中优于零样本统一模型和具有文本 CoT 的模型，表现出更高的视觉上下文一致性。

Title: Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Authors: Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, Rolandos Alexandros Potamias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19692
Pdf URL: https://arxiv.org/pdf/2512.19692
Copy Paste: [[2512.19692]] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models(https://arxiv.org/abs/2512.19692)
Keywords: generation
Abstract: Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.
摘要：生成真实的人与人交互是一项具有挑战性的任务，不仅需要高质量的个人身体和手部动作，还需要所有交互者之间的连贯协调。由于可用数据的限制和学习复杂性的增加，以前的方法往往忽略手部动作，限制了交互的真实性和表现力。此外，当前基于扩散的方法同时生成整个运动序列，限制了它们捕捉人类交互的反应性和适应性本质的能力。为了解决这些限制，我们引入了 Interact2Ar，这是第一个用于生成全身、人与人交互的端到端文本条件自回归扩散模型。 Interact2Ar 通过专用并行分支整合详细的手部运动学，从而实现高保真全身生成。此外，我们引入了一种自回归管道与一种新颖的记忆技术相结合，该技术有助于使用高效的大上下文窗口来适应人类交互的固有可变性。我们模型的适应性使得一系列下游应用成为可能，包括时间运动合成、对干扰的实时适应以及从二元场景扩展到多人场景。为了验证生成的运动，我们引入了一组专门用于评估全身交互的强大评估器和扩展指标。通过定量和定性实验，我们展示了 Interact2Ar 最先进的性能。