2025-12-30

Title: Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

Authors: Hans Jarett J. Ong, Brian Godwin S. Lim, Dominic Dayta, Renzo Roel P. Tan, Kazushi Ikeda
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.22150
Pdf URL: https://arxiv.org/pdf/2512.22150
Copy Paste: [[2512.22150]] Towards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders(https://arxiv.org/abs/2512.22150)
Keywords: generative
Abstract: Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.
摘要：无监督表示学习旨在恢复潜在的生成因素，但依赖于统计独立性的标准方法通常无法捕获因果依赖性。一个核心挑战是可识别性：正如解开表示学习和非线性 ICA 文献中所确立的那样，如果没有监督、辅助信号或强归纳偏差，就不可能从观测数据中解开因果变量。在这项工作中，我们提出了潜在加性噪声模型因果自动编码器（LANCA），将加性噪声模型（ANM）操作为无监督发现的强归纳偏差。从理论上讲，我们证明虽然 ANM 约束不能保证一般混合情况下的唯一可识别性，但它通过限制从任意微分同胚到仿射类的可接受变换来解决组件方面的不确定性。从方法上来说，LANCA 认为 VAE 固有的随机编码掩盖了潜在因果发现所需的结构残差，因此采用了确定性 Wasserstein 自动编码器 (WAE) 与可微分的 ANM 层相结合。该架构将剩余独立性从被动假设转变为明确的优化目标。根据经验，LANCA 在合成物理基准（Pendulum、Flow）和真实感环境（CANDLE）上的表现优于最先进的基线，它对复杂背景场景产生的虚假相关性表现出卓越的鲁棒性。

Title: SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

Authors: Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Yixue Hao, Yuan Zhou, Qinglin Lu, Long Hu, Junchi Yan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22170
Pdf URL: https://arxiv.org/pdf/2512.22170
Copy Paste: [[2512.22170]] SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models(https://arxiv.org/abs/2512.22170)
Keywords: generation
Abstract: Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
摘要：视频生成模型与人类偏好的训练后对齐是一个关键目标。为此过程开发有效的奖励模型 (RM) 面临着重大的方法障碍。当前的数据收集范例依赖于即时的成对注释，因此受到标签噪声的影响。同时，基于 VLM 的 RM 的架构设计，特别是其输出机制，仍未得到充分探索。此外，RM 在训练后很容易受到奖励黑客攻击。为了缓解这些限制，我们提出了 SoliReward，一个用于视频 RM 训练的系统框架。我们的框架首先通过单项二进制注释获取高质量、经济高效的数据，然后使用交叉提示配对策略构建偏好对。在架构上，我们采用分层渐进查询注意机制来增强特征聚合。最后，我们引入了修改后的 BT 损失，明确适应平局获胜的情况。这种方法规范了正样本的 RM 分数分布，提供更细致的偏好信号，以减轻对少数得分最高样本的过度关注。我们的方法在评估物理合理性、受试者畸形和语义对齐的基准上得到了验证，证明了直接 RM 评估指标和视频生成模型后训练有效性的改进。代码和基准将公开。

Title: Wireless Traffic Prediction with Large Language Model

Authors: Chuanting Zhang, Haixia Zhang, Jingping Qiao, Zongzhang Li, Mohamed-Slim Alouini
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22178
Pdf URL: https://arxiv.org/pdf/2512.22178
Copy Paste: [[2512.22178]] Wireless Traffic Prediction with Large Language Model(https://arxiv.org/abs/2512.22178)
Keywords: generation
Abstract: The growing demand for intelligent, adaptive resource management in next-generation wireless networks has underscored the importance of accurate and scalable wireless traffic prediction. While recent advancements in deep learning and foundation models such as large language models (LLMs) have demonstrated promising forecasting capabilities, they largely overlook the spatial dependencies inherent in city-scale traffic dynamics. In this paper, we propose TIDES (Traffic Intelligence with DeepSeek-Enhanced Spatial-temporal prediction), a novel LLM-based framework that captures spatial-temporal correlations for urban wireless traffic prediction. TIDES first identifies heterogeneous traffic patterns across regions through a clustering mechanism and trains personalized models for each region to balance generalization and specialization. To bridge the domain gap between numerical traffic data and language-based models, we introduce a prompt engineering scheme that embeds statistical traffic features as structured inputs. Furthermore, we design a DeepSeek module that enables spatial alignment via cross-domain attention, allowing the LLM to leverage information from spatially related regions. By fine-tuning only lightweight components while freezing core LLM layers, TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead. Extensive experiments on real-world cellular traffic datasets demonstrate that TIDES significantly outperforms state-of-the-art baselines in both prediction accuracy and robustness. Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.
摘要：下一代无线网络对智能、自适应资源管理日益增长的需求凸显了准确且可扩展的无线流量预测的重要性。虽然深度学习和大型语言模型 (LLM) 等基础模型的最新进展已经证明了有前景的预测能力，但它们在很大程度上忽视了城市规模交通动态中固有的空间依赖性。在本文中，我们提出了 TIDES（具有 DeepSeek 增强型时空预测的交通智能），这是一种基于 LLM 的新型框架，可捕获城市无线交通预测的时空相关性。 TIDES首先通过聚类机制识别跨区域的异构流量模式，并为每个区域训练个性化模型以平衡泛化和专业化。为了弥合数字流量数据和基于语言的模型之间的领域差距，我们引入了一种即时工程方案，该方案将统计流量特征嵌入为结构化输入。此外，我们设计了一个 DeepSeek 模块，该模块可以通过跨域注意力实现空间对齐，从而允许法学硕士利用来自空间相关区域的信息。通过仅微调轻量级组件，同时冻结核心 LLM 层，TIDES 实现了对特定领域模式的有效适应，而不会产生过多的训练开销。对真实世界蜂窝流量数据集的大量实验表明，TIDES 在预测准确性和鲁棒性方面均显着优于最先进的基线。我们的结果表明，将空间感知集成到基于 LLM 的预测器中是在未来 6G 系统中解锁可扩展和智能网络管理的关键。

Title: ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis

Authors: Shaghayegh Shajarian, Kennedy Marsh, James Benson, Sajad Khorsandroo, Mahmoud Abdelsalam
Subjects: cs.LG, cs.AI, cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2512.22223
Pdf URL: https://arxiv.org/pdf/2512.22223
Copy Paste: [[2512.22223]] ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis(https://arxiv.org/abs/2512.22223)
Keywords: generation
Abstract: Modern networks generate vast, heterogeneous traffic that must be continuously analyzed for security and performance. Traditional network traffic analysis systems, whether rule-based or machine learning-driven, often suffer from high false positives and lack interpretability, limiting analyst trust. In this paper, we present ReGAIN, a multi-stage framework that combines traffic summarization, retrieval-augmented generation (RAG), and Large Language Model (LLM) reasoning for transparent and accurate network traffic analysis. ReGAIN creates natural-language summaries from network traffic, embeds them into a multi-collection vector database, and utilizes a hierarchical retrieval pipeline to ground LLM responses with evidence citations. The pipeline features metadata-based filtering, MMR sampling, a two-stage cross-encoder reranking mechanism, and an abstention mechanism to reduce hallucinations and ensure grounded reasoning. Evaluated on ICMP ping flood and TCP SYN flood traces from the real-world traffic dataset, it demonstrates robust performance, achieving accuracy between 95.95% and 98.82% across different attack types and evaluation benchmarks. These results are validated against two complementary sources: dataset ground truth and human expert assessments. ReGAIN also outperforms rule-based, classical ML, and deep learning baselines while providing unique explainability through trustworthy, verifiable responses.
摘要：现代网络产生大量异构流量，必须对其安全性和性能进行持续分析。传统的网络流量分析系统，无论是基于规则的还是机器学习驱动的，往往误报率较高且缺乏可解释性，从而限制了分析师的信任。在本文中，我们提出了 ReGAIN，这是一个多阶段框架，结合了流量汇总、检索增强生成 (RAG) 和大型语言模型 (LLM) 推理，以实现透明且准确的网络流量分析。 ReGAIN 根据网络流量创建自然语言摘要，将其嵌入到多集合向量数据库中，并利用分层检索管道将 LLM 响应与证据引用结合起来。该管道具有基于元数据的过滤、MMR 采样、两级跨编码器重新排序机制以及弃权机制，以减少幻觉并确保推理有根据。根据真实流量数据集的 ICMP ping 泛洪和 TCP SYN 泛洪跟踪进行评估，它表现出强大的性能，在不同的攻击类型和评估基准上实现了 95.95% 到 98.82% 的准确度。这些结果根据两个互补来源进行了验证：数据集基本事实和人类专家评估。 ReGAIN 的性能还优于基于规则的经典 ML 和深度学习基线，同时通过值得信赖、可验证的响应提供独特的可解释性。

Title: Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation

Authors: Bhaktipriya Radharapu, Eshika Saxena, Kenneth Li, Chenxi Whitehouse, Adina Williams, Nicola Cancedda
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22245
Pdf URL: https://arxiv.org/pdf/2512.22245
Copy Paste: [[2512.22245]] Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation(https://arxiv.org/abs/2512.22245)
Keywords: generation
Abstract: As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges' hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with $\approx10$x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.
摘要：随着基于法学硕士的法官成为行业应用不可或缺的一部分，有效地获得经过良好校准的不确定性估计对于生产部署变得至关重要。然而，现有的技术，例如言语置信度和多代方法，通常要么校准不佳，要么计算成本昂贵。我们引入了用基于 Brier 分数的损失训练的线性探针，以根据推理法官的隐藏状态提供校准的不确定性估计，不需要额外的模型训练。我们在客观任务（推理、数学、事实性、编码）和主观人类偏好判断上评估我们的方法。我们的结果表明，与现有方法相比，探针实现了卓越的校准，节省了约 10 美元的计算量，可以稳健地推广到未见过的评估领域，并在高置信度预测中提供更高的准确性。然而，探测器产生的保守估计在更简单的数据集上表现不佳，但可能有利于优先考虑低误报率的安全关键型部署。总的来说，我们的工作表明，基于可解释性的不确定性估计为生产中的法学硕士法官提供了实用且可扩展的即插即用解决方案。

Title: The Physics Constraint Paradox: When Removing Explicit Constraints Improves Physics-Informed Data for Machine Learning

Authors: Rahul D Ray
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22261
Pdf URL: https://arxiv.org/pdf/2512.22261
Copy Paste: [[2512.22261]] The Physics Constraint Paradox: When Removing Explicit Constraints Improves Physics-Informed Data for Machine Learning(https://arxiv.org/abs/2512.22261)
Keywords: generation
Abstract: Physics-constrained data generation is essential for machine learning in scientific domains where real data are scarce; however, existing approaches often over-constrain models without identifying which physical components are necessary. We present a systematic ablation study of a physics-informed grating coupler spectrum generator that maps five geometric parameters to 100-point spectral responses. By selectively removing explicit energy conservation enforcement, Fabry-Perot oscillations, bandwidth variation, and noise, we uncover a physics constraint paradox: explicit energy conservation enforcement is mathematically redundant when the underlying equations are physically consistent, with constrained and unconstrained variants achieving identical conservation accuracy (mean error approximately 7 x 10^-9). In contrast, Fabry-Perot oscillations dominate threshold-based bandwidth variability, accounting for a 72 percent reduction in half-maximum bandwidth spread when removed (with bandwidth spread reduced from 132.3 nm to 37.4 nm). We further identify a subtle pitfall: standard noise-addition-plus-renormalization pipelines introduce 0.5 percent unphysical negative absorption values. The generator operates at 200 samples per second, enabling high-throughput data generation and remaining orders of magnitude faster than typical full-wave solvers reported in the literature. Finally, downstream machine learning evaluation reveals a clear physics-learnability trade-off: while central wavelength prediction remains unaffected, removing Fabry-Perot oscillations improves bandwidth prediction accuracy by 31.3 percent in R-squared and reduces RMSE by 73.8 percent. These findings provide actionable guidance for physics-informed dataset design and highlight machine learning performance as a diagnostic tool for assessing constraint relevance.
摘要：物理约束的数据生成对于真实数据稀缺的科学领域的机器学习至关重要；然而，现有的方法常常过度约束模型，而没有确定哪些物理组件是必要的。我们对基于物理的光栅耦合器光谱发生器进行了系统的烧蚀研究，该发生器将五个几何参数映射到 100 点光谱响应。通过有选择地消除显式能量守恒执行、法布里-珀罗振荡、带宽变化和噪声，我们发现了一个物理约束悖论：当底层方程物理上一致时，显式能量守恒执行在数学上是多余的，并且受约束和无约束变体实现相同的守恒精度（平均误差约为 7 x 10^-9）。相比之下，法布里-珀罗振荡主导基于阈值的带宽变化，当去除法布里-珀罗振荡时，半最大带宽扩展减少了 72%（带宽扩展从 132.3 nm 减少到 37.4 nm）。我们进一步发现了一个微妙的陷阱：标准的噪声添加加重整化管道引入了 0.5% 的非物理负吸收值。该发生器以每秒 200 个样本的速度运行，可实现高吞吐量数据生成，并且比文献中报道的典型全波求解器快几个数量级。最后，下游机器学习评估揭示了明显的物理-可学习性权衡：虽然中心波长预测不受影响，但消除法布里-珀罗振荡将 R 平方的带宽预测精度提高了 31.3%，并将 RMSE 降低了 73.8%。这些发现为基于物理的数据集设计提供了可行的指导，并强调了机器学习性能作为评估约束相关性的诊断工具。

Title: Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Authors: Antara Titikhsha, Om Kulkarni, Dharun Muthaiah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22272
Pdf URL: https://arxiv.org/pdf/2512.22272
Copy Paste: [[2512.22272]] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models(https://arxiv.org/abs/2512.22272)
Keywords: generation, generative
Abstract: Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-{\Sigma}. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.
摘要：文本到图像扩散模型生成高度详细的纹理，但它们通常依赖于表面外观并且无法遵循严格的几何约束，特别是当这些约束与文本提示所暗示的样式冲突时。这反映了人类感知和当前生成模型之间更广泛的语义差距。我们研究是否可以通过使用轻量级、现成的鉴别器作为外部引导信号来引入几何理解，而无需专门训练。我们提出了一位在 THINGS 三元组数据集上接受过训练的人类感知嵌入（HPE）教师，该数据集捕获了人类对物体形状的敏感度。通过将这位老师的梯度注入到潜在扩散过程中，我们表明几何和风格可以以可控的方式分离。我们通过三种架构评估这种方法：具有 U-Net 主干的稳定扩散 v1.5、流匹配模型 SiT-XL/2 和扩散变压器 PixArt-{\Sigma}。我们的实验表明，在没有连续引导的情况下，流动模型往往会漂移回默认轨迹，并且我们演示了复杂的三维形状（例如伊姆斯椅）到粉红色金属等冲突材料上的零镜头转移。与无引导的基线相比，这种引导生成将语义对齐提高了约 80%。总的来说，我们的结果表明，小型教师模型可以可靠地指导大型生成系统，从而实现更强的几何控制并拓宽文本到图像合成的创意范围。

Title: GeCo: A Differentiable Geometric Consistency Metric for Video Generation

Authors: Leslie Gu, Junhwa Hur, Charles Herrmann, Fangneng Zhan, Todd Zickler, Deqing Sun, Hanspeter Pfister
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22274
Pdf URL: https://arxiv.org/pdf/2512.22274
Copy Paste: [[2512.22274]] GeCo: A Differentiable Geometric Consistency Metric for Video Generation(https://arxiv.org/abs/2512.22274)
Keywords: generation
Abstract: We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
摘要：我们引入了 GeCo，一种基于几何的度量，用于联合检测静态场景中的几何变形和遮挡不一致伪影。通过融合残余运动和深度先验，GeCo 生成可解释的、密集的一致性图来揭示这些伪影。我们使用 GeCo 系统地对最新的视频生成模型进行基准测试，发现常见的故障模式，并进一步将其用作免训练引导损失，以减少视频生成过程中的变形伪影。

Title: The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Authors: Dingyu Wang, Zimu Yuan, Jiajun Liu, Shanggui Liu, Nan Zhou, Tianxing Xu, Di Huang, Dong Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22275
Pdf URL: https://arxiv.org/pdf/2512.22275
Copy Paste: [[2512.22275]] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency(https://arxiv.org/abs/2512.22275)
Keywords: generation
Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
摘要：背景：基础模型快速融入临床实践和公共卫生需要对其真正的临床推理能力进行严格评估，而不仅仅是狭隘的检查成功。目前的基准通常基于医疗执照考试或策划的小插曲，无法捕捉现实世界患者护理所必需的综合、多模式推理。方法：我们开发了骨骼与关节 (B&J) 基准，这是一个综合评估框架，包含来自骨科和运动医学领域真实患者案例的 1,245 个问题。该基准评估了反映临床推理路径的 7 个任务的模型，包括知识回忆、文本和图像解释、诊断生成、治疗计划和理由提供。我们评估了 11 个视觉语言模型 (VLM) 和 6 个大型语言模型 (LLM)，将它们的性能与专家得出的基本事实进行了比较。结果：我们的结果表明任务类型之间存在明显的性能差距。虽然最先进的模型在结构化多项选择问题上实现了超过 90% 的高精度，但在需要多模态集成的开放式任务上，其性能明显下降，准确率几乎没有达到 60%。 VLM 在解释医学图像方面表现出很大的局限性，并且经常表现出严重的文本驱动的幻觉，常常忽略相互矛盾的视觉证据。值得注意的是，专门针对医疗应用进行微调的模型与通用模型相比并没有表现出一致的优势。结论：当前的人工智能模型尚不能在临床上胜任复杂的多模式推理。目前，他们的安全部署应仅限于支持性、基于文本的角色。核心临床任务的未来进步等待多模式集成和视觉理解方面的根本性突破。

Title: Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

Authors: Zikun Guoa, Adeyinka.P. Adedigbaa, Rammohan Mallipeddi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22287
Pdf URL: https://arxiv.org/pdf/2512.22287
Copy Paste: [[2512.22287]] Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation(https://arxiv.org/abs/2512.22287)
Keywords: generation, generative
Abstract: Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.
摘要：综合设备数据对于开发非侵入式负载监控算法和实现隐私保护能源研究至关重要，但标记数据集的稀缺仍然是一个重大障碍。最近基于 GAN 的方法已经证明了合成负载模式的可行性，但大多数现有方法在单个模型中统一处理所有设备，忽略间歇性和连续设备之间的行为差异，导致训练不稳定和输出保真度有限。为了解决这些限制，我们提出了集群聚合 GAN 框架，这是一种混合生成方法，可根据每个设备的行为特征将其路由到专门的分支。对于间歇性电器，集群模块对相似的激活模式进行分组，并为每个集群分配专用发电机，确保常见和罕见的操作模式都获得足够的建模能力。连续设备遵循一个单独的分支，该分支采用基于 LSTM 的生成器来捕获逐渐的时间演化，同时通过序列压缩保持训练稳定性。对 UVIC 智能插头数据集的大量实验表明，所提出的框架在衡量真实性、多样性和训练稳定性的指标上始终优于基线方法，并且将集群集成为主动生成组件大大提高了可解释性和可扩展性。这些发现将所提出的框架确立为非侵入式负载监控研究中合成负载生成的有效方法。

Title: Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model

Authors: Renping Zhou, Zanlin Ni, Tianyi Chen, Zeyu Liu, Yang Yue, Yulin Wang, Yuxuan Wang, Jingshu Liu, Gao Huang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22288
Pdf URL: https://arxiv.org/pdf/2512.22288
Copy Paste: [[2512.22288]] Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model(https://arxiv.org/abs/2512.22288)
Keywords: generation
Abstract: Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: this https URL .
摘要：最近，掩模扩散模型（MDM）在视觉、语言和跨模态生成方面显示出了巨大的潜力。然而，他们的训练和推理程序之间存在显着的差异。特别是，MDM 推理是一个多步骤的迭代过程，不仅受模型本身的控制，还受指示令牌解码轨迹的各种时间表的控制（例如，每一步要解码多少个令牌）。相比之下，MDM 通常使用简化的单步 BERT 式目标进行训练，该目标屏蔽一部分标记并同时预测所有标记。这种步骤级简化从根本上将训练范式与推理的轨迹级性质脱节，使得推理时间表在训练期间永远不会优化。在本文中，我们介绍了 Co-GRPO，它将 MDM 生成重新表述为统一的马尔可夫决策过程 (MDP)，联合结合了模型和推理时间表。通过在轨迹级别应用组相对策略优化，Co-GRPO 在共享奖励下协同优化模型参数和调度参数，而不需要通过多步生成过程进行昂贵的反向传播。这种整体优化使训练与推理更加彻底地结合起来，并显着提高了生成质量。四个基准（ImageReward、HPS、GenEval 和 DPG-Bench）的实证结果证明了我们方法的有效性。有关更多详细信息，请参阅我们的项目页面：此 https URL 。

Title: A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

Authors: Philip Xu, David Elizondo, Raouf Hamzaoui
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22294
Pdf URL: https://arxiv.org/pdf/2512.22294
Copy Paste: [[2512.22294]] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation(https://arxiv.org/abs/2512.22294)
Keywords: generation
Abstract: We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
摘要：我们介绍了 Uni4D，这是一个用于大规模开放词汇 3D 检索和受控 4D 生成的统一框架，基于跨文本、3D 模型和图像模态的结构化三级对齐。 Uni4D 基于 Align3D 130 数据集，采用 3D 文本多头注意力和搜索模型，通过改进的语义对齐来优化文本到 3D 检索。该框架通过三个组件进一步增强了跨模式对齐：精确文本到 3D 检索、多视图 3D 到图像对齐以及用于生成时间一致的 4D 资产的图像到文本对齐。实验结果表明，Uni4D 实现了高质量的 3D 检索和可控的 4D 生成，推进了动态多模态理解和实际应用。

Title: Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

Authors: Vesal Ahsani, Babak Hossein Khalaj
Subjects: cs.CV, cs.HC, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2512.22298
Pdf URL: https://arxiv.org/pdf/2512.22298
Copy Paste: [[2512.22298]] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware(https://arxiv.org/abs/2512.22298)
Keywords: generation
Abstract: In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.
摘要：车内驾驶员监控系统 (DMS) 必须在计算、功耗和成本的严格限制下以低延迟识别与分心和困倦相关的行为。我们推出了一款单摄像头车内驾驶员行为识别系统，专为部署在两个低成本边缘平台上而设计：Raspberry Pi 5（仅 CPU）和 Google Coral Edge TPU。所提出的管道结合了（i）紧凑的每帧视觉模型，（ii）混杂因素感知标签设计，以减少视觉上相似的误报，以及（iii）时间决策头，仅当预测既可信又持续时才触发警报。该系统涵盖 17 个行为类别，包括多种手机使用模式、饮食、吸烟、向后伸手、凝视/注意力转移、乘客互动、仪容仪表、控制面板互动、打哈欠和闭眼睡眠。训练和评估使用涵盖不同驾驶员、车辆和照明条件的许可数据集（详细信息请参阅第 6 节），并且我们进一步在实际车载测试中验证运行时行为。经过优化的部署在具有 INT8 推理功能的 Raspberry Pi 5 上实现了约 16 FPS（每帧延迟低于 60 毫秒），在 Coral Edge TPU 上实现了约 25 FPS，从而能够在廉价的硬件上实现实时监控和稳定的警报生成。最后，我们讨论了可靠的车内人类状态感知如何作为以人为中心的车辆智能的上游输入，包括新兴的代理车辆概念。

Title: MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Authors: Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, Peng Wu, Guibing Guo, Wei Feng, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Xingwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22310
Pdf URL: https://arxiv.org/pdf/2512.22310
Copy Paste: [[2512.22310]] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation(https://arxiv.org/abs/2512.22310)
Keywords: generation
Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
摘要：多主题视频生成旨在根据文本提示和多个参考图像合成视频，确保每个主题保留自然比例和视觉保真度。然而，当前的方法面临两个挑战：尺度不一致，其中主体大小的变化导致不自然的生成，以及排列敏感性，其中参考输入的顺序导致主体失真。在本文中，我们提出了 MoFu，一个解决这两个挑战的统一框架。对于尺度不一致的问题，我们引入了尺度感知调制（SMO），这是一个法学硕士指导的模块，可以从提示中提取隐式尺度线索并调节特征以确保主题大小一致。为了解决排列敏感性，我们提出了一种简单而有效的傅里叶融合策略，该策略通过快速傅里叶变换处理参考特征的频率信息以产生统一的表示。此外，我们设计了尺度排列稳定性损失来共同鼓励尺度一致和排列不变的生成。为了进一步评估这些挑战，我们建立了一个专门的基准，在主题规模和参考排列方面控制变化。大量实验表明，MoFu 在保持自然尺度、主题保真度和整体视觉质量方面明显优于现有方法。

Title: LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Authors: Xudong Ling, Tianxi Huang, Qian Dong, Tao He, Chaorong Li, Guiduo Duan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22317
Pdf URL: https://arxiv.org/pdf/2512.22317
Copy Paste: [[2512.22317]] LangPrecip: Language-Aware Multimodal Precipitation Nowcasting(https://arxiv.org/abs/2512.22317)
Keywords: generation, generative
Abstract: Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent this http URL further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
摘要：短期降水临近预报本质上是一个不确定且约束不足的时空预报问题，特别是对于快速发展的极端天气事件。现有的生成方法主要依赖于视觉调节，使未来的运动受到弱约束且模糊。我们提出了一种语言感知的多模态临近预报框架（LangPrecip），它将气象文本视为降水演化的语义运动约束。通过将临近预报制定为整流流范式下的语义约束轨迹生成问题，我们的方法能够在潜在的该 http URL 中实现文本和雷达信息的高效且物理一致的集成，进一步介绍了 LangPrecip-160k，这是一个具有 160k 配对雷达序列和运动描述的大规模多模态数据集。对瑞典和 MRMS 数据集的实验表明，与最先进的方法相比，取得了一致的改进，在 80 分钟的准备时间内，大雨 CSI 分别获得了超过 60% 和 19% 的增益。

Title: DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22324
Pdf URL: https://arxiv.org/pdf/2512.22324
Copy Paste: [[2512.22324]] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models(https://arxiv.org/abs/2512.22324)
Keywords: generation
Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.
摘要：人类运动是组合性的：复杂的行为可以被描述为更简单的基元的组合。然而，现有的方法主要集中于正向建模，例如学习从文本到运动的整体映射或从一组运动概念组成复杂的运动。在本文中，我们考虑逆向视角：将整体运动分解为语义上有意义的子组件。我们提出了 DeMoGen，一种用于分解学习的组合训练范例，采用基于能量的扩散模型。这种能量公式直接捕获多个运动概念的组合分布，使模型能够发现它们，而无需依赖单个概念的真实运动。在这个范例中，我们引入了三种训练变体来鼓励对运动的分解理解： 1. DeMoGen-Exp 明确地训练分解的文本提示； 2. DeMoGen-OSS进行正交自监督分解； 3. DemoGen-SC 强制原始文本嵌入和分解文本嵌入之间的语义一致性。这些变体使我们的方法能够将可重用的运动基元与复杂的运动序列分开。我们还证明，分解的运动概念可以灵活地重新组合，以生成多样化且新颖的运动，并推广到训练分布之外。此外，我们构建了一个文本分解数据集来支持构图训练，作为扩展资源来促进文本到运动的生成和运动构图。

Title: Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Authors: Xin Yu, Xiaojuan Qi, Zhengqi Li, Kai Zhang, Richard Zhang, Zhe Lin, Eli Shechtman, Tianyu Wang, Yotam Nitzan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22374
Pdf URL: https://arxiv.org/pdf/2512.22374
Copy Paste: [[2512.22374]] Self-Evaluation Unlocks Any-Step Text-to-Image Generation(https://arxiv.org/abs/2512.22374)
Keywords: generation
Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
摘要：我们引入了自我评估模型（Self-E），这是一种新颖的、从头开始的文本到图像生成训练方法，支持任意步骤推理。 Self-E 与流匹配模型类似地从数据中学习，同时采用一种新颖的自我评估机制：它使用当前的分数估计来评估自己生成的样本，有效地充当动态自教师。与传统的扩散或流动模型不同，它不仅仅依赖于局部监督，而局部监督通常需要许多推理步骤。与基于蒸馏的方法不同，它不需要经过预先培训的教师。即时本地学习和自驱动全局匹配的结合弥补了两种范式之间的差距，使得从头开始训练高质量的文本到图像模型即使在步数非常低的情况下也能表现出色。对大规模文本到图像基准的大量实验表明，Self-E 不仅在少步生成方面表现出色，而且在 50 步生成方面也能与最先进的流匹配模型相媲美。我们进一步发现，随着推理步骤的增加，其性能单调提高，从而在单个统一模型中实现超快的几步生成和高质量的长轨迹采样。据我们所知，Self-E 是第一个从头开始的、任意步骤的文本到图像模型，为高效和可扩展的生成提供了统一的框架。

Title: iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI

Authors: Himanshu Naidu, Yuxiang Zhang, Sachin Mehta, Anat Caspi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22392
Pdf URL: https://arxiv.org/pdf/2512.22392
Copy Paste: [[2512.22392]] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI(https://arxiv.org/abs/2512.22392)
Keywords: generation
Abstract: Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian
摘要：准确、最新的人行道数据对于建设无障碍和包容性的行人基础设施至关重要，但目前的数据收集方法往往成本高昂、分散且难以扩展。我们推出 iOSPointMapper，这是一款移动应用程序，可使用最新一代 iPhone 和 iPad 实现实时、注重隐私的地面人行道测绘。该系统利用设备上的语义分割、基于 LiDAR 的深度估计和融合的 GPS/IMU 数据来检测和定位与人行道相关的特征，例如交通标志、交通信号灯和电线杆。为了确保透明度并提高数据质量，iOSPointMapper 包含一个用户引导的注释界面，用于在提交之前验证系统输出。收集的数据经过匿名处理后传输到运输数据交换计划 (TDEI)，在那里与更广泛的多式联运数据集无缝集成。对系统特征检测和空间测绘性能的详细评估揭示了该应用程序在增强行人测绘方面的潜力。这些功能共同提供了一种可扩展且以用户为中心的方法，以缩小行人中的关键数据差距

Title: DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

Authors: Hansang Lee, Chaelin Lee, Nieun Seo, Joon Seok Lim, Helen Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22406
Pdf URL: https://arxiv.org/pdf/2512.22406
Copy Paste: [[2512.22406]] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization(https://arxiv.org/abs/2512.22406)
Keywords: generative
Abstract: We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn's Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32\% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet's maximum converged performance ($31.03\% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.
摘要：我们提出了 DeFloMat（流匹配检测），这是一种新颖的生成对象检测框架，通过集成条件流匹配（CFM）来解决基于扩散的检测器（例如 DiffusionDet）的关键延迟瓶颈。扩散模型通过将检测制定为多步骤随机去噪过程来实现高精度，但它们对大量采样步骤 ($T \gg 60$) 的依赖使得它们对于时间敏感的临床应用（例如磁共振肠造影 (MRE) 中的克罗恩病检测）来说不切实际。 DeFloMat 用源自条件最优传输 (OT) 理论的高度直接、确定性流场取代了这种缓慢的随机路径，特别是近似整流流。这种转变可以通过简单的常微分方程 (ODE) 求解器进行快速推理。我们在具有挑战性的 MRE 临床数据集上证明了 DeFloMat 的优越性。至关重要的是，DeFloMat 只需 $3$ 推理步骤即可实现最先进的准确度 ($43.32\% \text{ } AP_{10:50}$)，这意味着比 DiffusionDet 的最大收敛性能（$31.03\% \text{ } AP_{10:50}$ 在 $4$ 步上）性能提高了 $1.4\times$。此外，我们的确定性流程显着增强了定位特性，在少步机制中产生了卓越的召回率和稳定性。 DeFloMat 解决了生成准确性和临床效率之间的权衡，为稳定和快速的对象定位设立了新标准。

Title: EmoCtrl: Controllable Emotional Image Content Generation

Authors: Jingyuan Yang, Weibin Luo, Hui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22437
Pdf URL: https://arxiv.org/pdf/2512.22437
Copy Paste: [[2512.22437]] EmoCtrl: Controllable Emotional Image Content Generation(https://arxiv.org/abs/2512.22437)
Keywords: generation
Abstract: An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl's strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.
摘要：图像通过其视觉内容和情感基调来传达意义，共同塑造人类的感知。我们引入了可控情感图像内容生成（C-EICG），其目的是生成在表达目标情感的同时忠实于给定内容描述的图像。现有的文本到图像模型保证了内容的一致性，但缺乏情感意识，而情感驱动的模型以内容失真为代价产生情感结果。为了解决这一差距，我们提出了 EmoCtrl，它由带有内容、情感和情感提示注释的数据集支持，将抽象情感与视觉提示联系起来。 EmoCtrl 结合了文本和视觉情感增强模块，通过描述性语义和感知线索丰富情感表达。学习到的情感标记表现出互补的效果，如通过消融和可视化所证明的那样。定量和定性实验表明，EmoCtrl 实现了忠实的内容和表达性的情绪控制，在多个方面优于现有方法。用户研究证实 EmoCtrl 与人类偏好高度一致。此外，EmoCtrl 可以很好地推广到创意应用，进一步证明了所学习的情感标记的鲁棒性和适应性。

Title: Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework

Authors: Zhicheng Zhao, Yuancheng Xu, Andong Lu, Chenglong Li, Jin Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22447
Pdf URL: https://arxiv.org/pdf/2512.22447
Copy Paste: [[2512.22447]] Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework(https://arxiv.org/abs/2512.22447)
Keywords: quality assessment
Abstract: Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.
摘要：基于光学和合成孔径雷达（SAR）融合的物体检测引起了遥感领域的重大研究兴趣，因为这些模式为全天候监测提供了补充信息。然而，实际部署受到固有挑战的严重限制。由于不同的成像机制、时间异步和配准困难，获得对齐良好的光学 SAR 图像对仍然极其困难，经常导致模态数据丢失或降级。尽管最近的方法试图解决这个问题，但它们仍然受到随机缺失模式的鲁棒性有限的影响，并且缺乏有效的机制来确保基于融合的检测的一致性能改进。为了解决这些限制，我们提出了一种新颖的质量感知动态融合网络（QDFNet），用于稳健的光学 SAR 物体检测。我们提出的方法利用可学习的参考标记来动态评估特征可靠性并在存在缺失模态的情况下指导自适应融合。特别是，我们设计了一个动态模态质量评估（DMQA）模块，该模块采用可学习的参考标记来迭代地细化特征可靠性评估，从而能够精确识别退化区域并为后续融合提供质量指导。此外，我们开发了正交约束归一化融合（OCNF）模块，该模块采用正交约束来保持模态独立性，同时根据可靠性分数动态调整融合权重，有效抑制不可靠的特征传播。在 SpaceNet6-OTD 和 OGSOD-2.0 数据集上进行的大量实验证明了 QDFNet 与最先进的方法相比的优越性和有效性，特别是在部分模态损坏或丢失数据的情况下。

Title: Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Authors: Sukhyun Jeong, Yong-Hoon Choi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.22464
Pdf URL: https://arxiv.org/pdf/2512.22464
Copy Paste: [[2512.22464]] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing(https://arxiv.org/abs/2512.22464)
Keywords: generation
Abstract: Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.
摘要：基于文本的 3D 动作生成旨在从自然语言描述中自动合成不同的动作，以扩展用户的创造力，而动作编辑则根据文本修改现有的动作序列，同时保留其整体结构。基于姿势代码的框架（例如 CoMo）将可量化的姿势属性映射到支持可解释运动控制的离散姿势代码中，但它们的逐帧表示难以捕获微妙的时间动态和高频细节，通常会降低重建保真度和局部可控性。为了解决这个限制，我们引入了姿势引导的运动残差细化（PGR$^2$M），这是一种混合表示，通过残差矢量量化（RVQ）学习残差代码来增强可解释的姿势代码。姿态引导的 RVQ 分词器将运动分解为对粗略全局结构进行编码的姿态潜伏和对细粒度时间变化进行建模的残余潜伏。残差丢失进一步阻止了对残差的过度依赖，保留了姿势代码的语义对齐和可编辑性。在此分词器之上，基础 Transformer 从文本中自回归预测姿势代码，而精炼 Transformer 则预测以文本、姿势代码和量化阶段为条件的残差代码。 HumanML3D 和 KIT-ML 上的实验表明，与 CoMo 和最近基于扩散和标记化的基线相比，PGR$^2$M 改进了生成和编辑的 Fréchet 起始距离和重建指标，而用户研究证实它可以实现直观的、保留结构的运动编辑。

Title: Decomposing Task Vectors for Refined Model Editing

Authors: Hamed Damirchi, Ehsan Abbasnejad, Zhen Zhang, Javen Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22511
Pdf URL: https://arxiv.org/pdf/2512.22511
Copy Paste: [[2512.22511]] Decomposing Task Vectors for Refined Model Editing(https://arxiv.org/abs/2512.22511)
Keywords: generation
Abstract: Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.
摘要：大型预训练模型已经改变了机器学习，但有效地调整这些模型以展示精确的、特定于概念的行为仍然是一个重大挑战。任务向量定义为微调模型参数和预训练模型参数之间的差异，提供了一种引导神经网络实现所需行为的机制。这催生了专门针对特定行为定制任务向量的大型存储库。这些任务向量的算术运算允许无缝组合所需的行为，而不需要大型数据集。然而，这些向量通常包含重叠的概念，这些概念可能在算术运算期间相互干扰，从而导致不可预测的结果。我们提出了一种原则性的分解方法，将每个任务向量分为两个部分：一个部分捕获多个任务向量之间的共享知识，另一个部分隔离每个特定任务所独有的信息。通过识别跨投影的不变子空间，我们的方法可以更精确地控制概念操作，而不会意外放大或减少其他行为。我们证明了我们的分解方法在三个领域的有效性：使用共享组件作为附加任务向量，将图像分类中的多任务合并提高了 5%；通过仅混合独特组件，在扩散模型中实现清洁风格混合，而不会产生退化；通过否定与独特组件隔离的有毒信息，在语言模型中实现 47% 的毒性降低，同时保持一般知识任务的性能。我们的方法提供了一个新的框架来理解和控制任务向量算术，解决模型编辑操作的基本限制。

Title: DreamOmni3: Scribble-based Editing and Generation

Authors: Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li, Junjia Huang, Xu Zhao, Yitong Wang, Ruihang Chu, Bei Yu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22525
Pdf URL: https://arxiv.org/pdf/2512.22525
Copy Paste: [[2512.22525]] DreamOmni3: Scribble-based Editing and Generation(https://arxiv.org/abs/2512.22525)
Keywords: generation
Abstract: Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
摘要：最近统一的生成和编辑模型以其令人印象深刻的性能取得了巨大的成功。这些模型主要依靠文本提示进行基于指令的编辑和生成，但语言通常无法捕获用户预期的编辑位置和细粒度的视觉细节。为此，我们提出了两个任务：基于涂鸦的编辑和生成，从而可以在结合用户文本、图像和手绘草图的图形用户界面（GUI）上进行更灵活的创建。我们推出 DreamOmni3，解决两个挑战：数据创建和框架设计。我们的数据合成管道包括两部分：基于涂鸦的编辑和生成。对于基于涂鸦的编辑，我们定义了四个任务：基于涂鸦和指令的编辑、基于涂鸦和多模式指令的编辑、图像融合和涂鸦编辑。基于 DreamOmni2 数据集，我们提取可编辑区域并叠加手绘框、圆圈、涂鸦或裁剪图像来构建训练数据。对于基于涂鸦的生成，我们定义了三个任务：涂鸦和基于指令的生成、涂鸦和基于多模式指令的生成以及涂鸦生成，遵循类似的数据创建管道。对于该框架，我们没有使用二进制掩模来处理涉及多个涂鸦、图像和指令的复杂编辑，而是提出了一种联合输入方案，将原始图像和涂鸦源图像输入模型中，使用不同的颜色来区分区域并简化处理。通过对两个图像应用相同的索引和位置编码，该模型可以精确定位潦草区域，同时保持准确的编辑。最后，我们为这些任务建立了全面的基准，以促进进一步的研究。实验结果表明DreamOmni3具有出色的性能，模型和代码将公开发布。

Title: CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Authors: Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22536
Pdf URL: https://arxiv.org/pdf/2512.22536
Copy Paste: [[2512.22536]] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation(https://arxiv.org/abs/2512.22536)
Keywords: generation
Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
摘要：保持叙事连贯性和视觉一致性仍然是开放域视频生成的核心挑战。现有的文本到视频模型通常独立处理每个镜头，导致身份漂移、场景不一致和不稳定的时间结构。我们提出了 CoAgent，这是一种用于相干视频生成的协作闭环框架，它将该过程制定为计划-合成-验证管道。给定用户提示、风格参考和节奏约束，故事板规划器将输入分解为具有明确实体、空间关系和时间线索的结构化镜头级计划。全局上下文管理器维护实体级内存，以保持镜头之间的外观和身份一致性。然后，合成模块在视觉一致性控制器的指导下生成每个镜头，而验证器代理使用视觉语言推理评估中间结果，并在检测到不一致时触发选择性再生。最后，具有节奏意识的剪辑师会完善时间节奏和过渡，以匹配所需的叙事流程。大量实验表明，CoAgent 显着提高了长视频生成的连贯性、视觉一致性和叙事质量。

Title: Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Authors: Jesen Zhang, Ningyuan Liu, Kaitong Cai, Sidi Liu, Jing Yang, Ziliang Chen, Xiaofei Sun, Keze Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22545
Pdf URL: https://arxiv.org/pdf/2512.22545
Copy Paste: [[2512.22545]] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains(https://arxiv.org/abs/2512.22545)
Keywords: generation
Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
摘要：多模态法学硕士通常会产生流畅但不可靠的推理，表现出步步连贯性较弱且视觉基础不足，这很大程度上是因为现有的对齐方法仅监督最终答案，而忽略了中间推理过程的可靠性。我们引入了 SR-MCR，这是一种轻量级且无标签的框架，它通过利用直接从模型输出导出的内在过程信号来调整推理。五种自我参照线索——语义对齐、词汇保真度、非冗余、视觉基础和步骤一致性——被集成到标准化、可靠性加权的奖励中，提供细粒度的过程级指导。没有批评家的 GRPO 目标，通过信心感知冷却机制得到增强，进一步稳定了训练并抑制了琐碎或过度自信的世代。 SR-MCR 基于 Qwen2.5-VL 构建，在广泛的视觉基准中提高了答案准确性和推理连贯性；在同等规模的开源模型中，SR-MCR-7B 实现了最先进的性能，平均准确率为 81.4%。消融研究证实了每个奖励项和冷却模块的独立贡献。

Title: Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification

Authors: Guikun Xu, Xiaohan Yi, Peilin Zhao, Yatao Bian
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2512.22597
Pdf URL: https://arxiv.org/pdf/2512.22597
Copy Paste: [[2512.22597]] Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification(https://arxiv.org/abs/2512.22597)
Keywords: generation, generative
Abstract: Generating low-energy conformer ensembles and identifying ground-state conformations from molecular graphs remain computationally demanding with physics-based pipelines. Current learning-based approaches often suffer from a fragmented paradigm: generative models capture diversity but lack reliable energy calibration, whereas deterministic predictors target a single structure and fail to represent ensemble variability. Here we present EnFlow, a unified framework that couples flow matching (FM) with an explicitly learned energy model through an energy-guided sampling scheme defined along a non-Gaussian FM path. By incorporating energy-gradient guidance during sampling, our method steers trajectories toward lower-energy regions, substantially improving conformational fidelity, particularly in the few-step regime. The learned energy function further enables efficient energy-based ranking of generated ensembles for accurate ground-state identification. Extensive experiments on GEOM-QM9 and GEOM-Drugs demonstrate that EnFlow simultaneously improves generation metrics with 1--2 ODE-steps and reduces ground-state prediction errors compared with state-of-the-art methods.
摘要：对于基于物理的管道，生成低能构象异构体集合并从分子图中识别基态构象仍然对计算要求很高。当前基于学习的方法经常遭受支离破碎的范式的困扰：生成模型捕获多样性但缺乏可靠的能量校准，而确定性预测器针对单一结构并且无法表示整体变异性。在这里，我们提出 EnFlow，一个统一的框架，通过沿非高斯 FM 路径定义的能量引导采样方案，将流量匹配 (FM) 与显式学习的能量模型结合起来。通过在采样过程中结合能量梯度引导，我们的方法将轨迹引导至较低能量区域，显着提高构象保真度，特别是在少步状态下。学习到的能量函数进一步实现了对生成的集合进行基于能量的高效排序，以实现准确的基态识别。 GEOM-QM9 和 GEOM-Drugs 上的大量实验表明，与最先进的方法相比，EnFlow 同时通过 1--2 ODE 步骤改进了生成指标，并减少了基态预测误差。

Title: PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment

Authors: Bin Wang, Yang Xu, Huan Zhao, Hao Zhang, Zixing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22602
Pdf URL: https://arxiv.org/pdf/2512.22602
Copy Paste: [[2512.22602]] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment(https://arxiv.org/abs/2512.22602)
Keywords: generation
Abstract: Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.
摘要：语音驱动的 3D 头部说话生成旨在生成与语音精确同步的逼真面部动画。虽然在实现高口型同步精度方面取得了相当大的进展，但现有方法在很大程度上忽视了个人说话风格的复杂细微差别，这限制了个性化和现实性。在这项工作中，我们提出了一种新颖的个性化3D头部动画框架，即“PTalker”。该框架通过从音频和面部运动序列中分离风格来保留说话风格，并通过音频和网格模态之间的三级对齐机制来增强口型同步的准确性。具体来说，为了有效地解开风格和内容，我们设计了解开约束，将驱动的音频和运动序列编码到不同的风格和内容空间中，以增强说话风格的表示。为了提高口型同步精度，我们采用了包含三个方面的模态对齐机制：使用图注意力网络捕获3D网格结构中的顶点连接性的空间对齐，使用交叉注意力捕获和同步时间依赖性的时间对齐，以及通过top-k双向对比损失和KL散度约束进行特征对齐，以确保语音和网格模态之间的一致性。对公共数据集进行的广泛定性和定量实验表明，PTalker 可以有效生成逼真、风格化的 3D 头部说话，准确匹配特定身份的说话风格，优于最先进的方法。源代码和补充视频可在以下位置获取：PTalker。

Title: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Authors: Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.22615
Pdf URL: https://arxiv.org/pdf/2512.22615
Copy Paste: [[2512.22615]] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone(https://arxiv.org/abs/2512.22615)
Keywords: generation
Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $\pi_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
摘要：虽然自回归大型视觉语言模型（VLM）取得了显着的成功，但它们的顺序生成通常限制了它们在复杂视觉规划和动态机器人控制中的功效。在这项工作中，我们研究了在基于扩散的大语言模型（dLLM）上构建视觉语言模型以克服这些限制的潜力。我们推出 Dream-VL，这是一种基于开放扩散的 VLM (dVLM)，在之前的 dVLM 中实现了最先进的性能。 Dream-VL 可与基于各种基准的开放数据训练的顶级基于 AR 的 VLM 相媲美，但在应用于视觉规划任务时表现出卓越的潜力。在 Dream-VL 的基础上，我们引入了 Dream-VLA，这是一种基于 dLLM 的视觉-语言-动作模型（dVLA），通过对开放机器人数据集的持续预训练而开发。我们证明，这种扩散主干的原生双向性质可以作为 VLA 任务的卓越基础，本质上适合动作分块和并行生成，从而在下游微调中显着加快收敛速度。 Dream-VLA 在 LIBERO 上实现了 97.2% 的平均成功率，在 SimplerEnv-Bridge 上实现了 71.4% 的总体平均成功率，在 SimplerEnv-Fractal 上实现了 60.5% 的总体平均成功率，超越了 $\pi_0$ 和 GR00T-N1 等领先模型。我们还验证了 dVLM 在不同训练目标的下游任务上超越了 AR 基线。我们发布了 Dream-VL 和 Dream-VLA 以促进社区的进一步研究。

Title: Rethinking Memory Design in SAM-Based Visual Object Tracking

Authors: Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi, Naoufel Werghi, Sajid Javed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22624
Pdf URL: https://arxiv.org/pdf/2512.22624
Copy Paste: [[2512.22624]] Rethinking Memory Design in SAM-Based Visual Object Tracking(https://arxiv.org/abs/2512.22624)
Keywords: generation
Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
摘要：\noindent 内存已成为在现代基于分段的框架中实现稳健的视觉对象跟踪的核心机制。最近基于 Segment Anything Model 2 (SAM2) 构建的方法通过改进过去的观测值的存储和重用方式，展示了强大的性能。然而，现有方法以特定于方法的方式解决内存限制，导致人们对基于 SAM 的跟踪中更广泛的内存设计原理知之甚少。此外，目前还不清楚这些记忆机制如何转移到更强大的下一代基础模型，例如 Segment Anything Model 3 (SAM3)。在这项工作中，我们提出了基于 SAM 的视觉对象跟踪的以记忆为中心的系统研究。我们首先分析基于 SAM2 的代表性跟踪器，并表明大多数方法的主要区别在于如何选择短期记忆帧，同时共享共同的以对象为中心的表示。基于这一见解，我们忠实地在 SAM3 框架内重新实现这些内存机制，并在十个不同的基准上进行大规模评估，从而能够对内存设计进行独立于主干强度的受控分析。在我们的实证研究的指导下，我们提出了一个统一的混合记忆框架，该框架将记忆明确地分解为短期外观记忆和长期干扰解决记忆。这种分解使得能够以模块化和有原则的方式集成现有的内存策略。大量实验表明，所提出的框架在 SAM2 和 SAM3 主干上持续提高了长期遮挡、复杂运动和干扰物较多的情况下的鲁棒性。代码可在以下位置获得：此 https URL。 \textbf{这是预印本。一些结果正在最终确定，可能会在未来的修订中更新。}

Title: Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Authors: Yuming Gu, Yizhi Wang, Yining Hong, Yipeng Gao, Hao Jiang, Angtian Wang, Bo Liu, Nathaniel S. Dennler, Zhengfei Kuang, Hao Li, Gordon Wetzstein, Chongyang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22626
Pdf URL: https://arxiv.org/pdf/2512.22626
Copy Paste: [[2512.22626]] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion(https://arxiv.org/abs/2512.22626)
Keywords: generation
Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
摘要：具身视觉规划旨在通过想象场景如何朝着期望的目标演变并使用想象的轨迹来指导行动来实现操纵任务。视频扩散模型通过其图像到视频的生成能力，为这种视觉想象提供了有希望的基础。然而，现有的方法主要是前向预测，根据初始观察生成轨迹，没有明确的目标建模，因此常常导致空间漂移和目标错位。为了应对这些挑战，我们提出了 Envision，一个基于扩散的框架，可以为具体代理执行视觉规划。通过使用目标图像显式约束生成，我们的方法在整个生成的轨迹中强制执行物理合理性和目标一致性。具体来说，远景能源分两个阶段运营。首先，目标图像模型识别任务相关区域，在场景和指令之间执行区域感知交叉注意，并合成捕获期望结果的连贯目标图像。然后，基于第一帧和最后一帧条件视频扩散模型 (FL2V) 构建的环境目标视频模型在初始观察和目标图像之间进行插值，产生连接起始状态和目标状态的平滑且物理上合理的视频轨迹。对象操作和图像编辑基准测试表明，与基线相比，Envision 实现了卓越的目标对齐、空间一致性和对象保存。由此产生的视觉计划可以直接支持下游机器人规划和控制，为实体代理提供可靠的指导。

Title: FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Authors: Yidi Liu, Zihao Fan, Jie Huang, Jie Xiao, Dong Li, Wenlong Zhang, Lei Bai, Xueyang Fu, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22647
Pdf URL: https://arxiv.org/pdf/2512.22647
Copy Paste: [[2512.22647]] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution(https://arxiv.org/abs/2512.22647)
Keywords: super-resolution, generation, quality assessment
Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
摘要：事实证明，带有人类反馈的强化学习（RLHF）在图像生成领域是有效的，以奖励模型为指导来调整人类偏好。受此推动，将 RLHF 应用于图像超分辨率 (ISR) 任务在以图像质量评估 (IQA) 模型作为奖励模型来优化感知质量方面显示出前景。然而，传统的 IQA 模型通常输出单个全局分数，这对局部和细粒度的失真异常不敏感。这种不敏感性使得 ISR 模型产生感知上不受欢迎的伪影，从而产生虚假的高分，使优化目标与感知质量不一致，并导致奖励黑客攻击。为了解决这个问题，我们提出了一种基于编码器-解码器架构的细粒度感知奖励模型（FinPercep-RM）。在提供全局质量评分的同时，它还生成感知退化图，对局部缺陷进行空间定位和量化。我们特别引入了 FGR-30k 数据集来训练该模型，该数据集包含来自现实世界超分辨率模型的各种细微失真。尽管 FinPercep-RM 模型取得了成功，但其复杂性给生成器策略学习带来了重大挑战，导致训练不稳定。为了解决这个问题，我们提出了一种共同进化课程学习（CCL）机制，其中奖励模型和 ISR 模型都经历同步课程。奖励模型的复杂性逐渐增加，而 ISR 模型从更简单的全局奖励开始，以实现快速收敛，逐渐过渡到更复杂的模型输出。这种由易到难的策略可以实现稳定的训练，同时抑制奖励黑客行为。实验验证了我们的方法在 ISR 模型中在 RLHF 方法的全局质量和局部真实性方面的有效性。

Title: Scaling Unverifiable Rewards: A Case Study on Visual Insights

Authors: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.22650
Pdf URL: https://arxiv.org/pdf/2512.22650
Copy Paste: [[2512.22650]] Scaling Unverifiable Rewards: A Case Study on Visual Insights(https://arxiv.org/abs/2512.22650)
Keywords: generation
Abstract: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's {\tau}=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
摘要：大型语言模型 (LLM) 代理可以通过测试时间缩放 (TTS) 以及奖励信号引导的迭代细化来日益自动化复杂的推理。然而，许多现实世界的任务涉及多阶段管道，其最终结果缺乏可验证的奖励或足够的数据来训练强大的奖励模型，使得基于判断的细化容易在各个阶段累积错误。我们提出了选择性 TTS，这是一种基于流程的细化框架，可在多智能体管道中的不同阶段扩展推理，而不是通过先前的工作随着时间的推移进行重复细化。通过跨阶段分配计算并使用特定于流程的判断器尽早修剪低质量分支，选择性 TTS 可以减轻判断漂移并稳定细化。以数据科学管道为基础，我们构建了一个端到端的多智能体管道，用于生成具有视觉洞察力的图表和给定数据集的报告，并设计了一个可靠的基于 LLM 的判断模型，与人类专家保持一致（Kendall 的 {\tau}=0.55）。然后，我们提出的选择性 TTS 在固定计算预算下提高洞察质量，将平均分数从 61.64 提高到 65.86，同时减少方差。我们希望我们的发现能够成为解决复杂、开放式任务并获得无法验证的奖励的第一步，例如科学发现和故事生成。

Title: Visual Autoregressive Modelling for Monocular Depth Estimation

Authors: Amir El-Ghoussani, André Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22653
Pdf URL: https://arxiv.org/pdf/2512.22653
Copy Paste: [[2512.22653]] Visual Autoregressive Modelling for Monocular Depth Estimation(https://arxiv.org/abs/2512.22653)
Keywords: generative
Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "this https URL.
摘要：我们提出了一种基于视觉自回归（VAR）先验的单目深度估计方法，为基于扩散的方法提供了替代方案。我们的方法采用大规模文本到图像 VAR 模型，并引入了具有无分类器指导的按比例条件上采样机制。我们的方法在十个固定自回归阶段进行推理，仅需要 74K 合成样本进行微调，并取得了有竞争力的结果。我们报告了在受限训练条件下室内基准测试中最先进的性能，以及应用于室外数据集时的强大性能。这项工作将自回归先验建立为用于深度估计的几何感知生成模型的补充系列，突出了数据可扩展性和对 3D 视觉任务的适应性方面的优势。代码可在“此 https URL.

Title: Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations

Authors: Achraf Hsain, Fouad Mohammed Abbou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22672
Pdf URL: https://arxiv.org/pdf/2512.22672
Copy Paste: [[2512.22672]] Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations(https://arxiv.org/abs/2512.22672)
Keywords: generative
Abstract: This paper presents the first application of quantum generative models to learned latent space representations of computational fluid dynamics (CFD) data. While recent work has explored quantum models for learning statistical properties of fluid systems, the combination of discrete latent space compression with quantum generative sampling for CFD remains unexplored. We develop a GPU-accelerated Lattice Boltzmann Method (LBM) simulator to generate fluid vorticity fields, which are compressed into a discrete 7-dimensional latent space using a Vector Quantized Variational Autoencoder (VQ-VAE). The central contribution is a comparative analysis of quantum and classical generative approaches for modeling this physics-derived latent distribution: we evaluate a Quantum Circuit Born Machine (QCBM) and Quantum Generative Adversarial Network (QGAN) against a classical Long Short-Term Memory (LSTM) baseline. Under our experimental conditions, both quantum models produced samples with lower average minimum distances to the true distribution compared to the LSTM, with the QCBM achieving the most favorable metrics. This work provides: (1)~a complete open-source pipeline bridging CFD simulation and quantum machine learning, (2)~the first empirical study of quantum generative modeling on compressed latent representations of physics simulations, and (3)~a foundation for future rigorous investigation at this intersection.
摘要：本文首次应用量子生成模型来学习计算流体动力学 (CFD) 数据的潜在空间表示。虽然最近的工作探索了用于学习流体系统统计特性的量子模型，但离散潜在空间压缩与 CFD 量子生成采样的组合仍未得到探索。我们开发了 GPU 加速的格子玻尔兹曼方法 (LBM) 模拟器来生成流体涡度场，使用矢量量化变分自动编码器 (VQ-VAE) 将其压缩到离散的 7 维潜在空间。核心贡献是对量子和经典生成方法的比较分析，用于对这种物理衍生的潜在分布进行建模：我们根据经典的长短期记忆（LSTM）基线评估了量子电路生成机（QCBM）和量子生成对抗网络（QGAN）。在我们的实验条件下，与 LSTM 相比，两种量子模型生成的样本与真实分布的平均最小距离较低，而 QCBM 实现了最有利的指标。这项工作提供了：(1)~连接 CFD 模拟和量子机器学习的完整开源管道，(2)~首次对物理模拟的压缩潜在表示进行量子生成建模的实证研究，以及 (3)~为未来在该交叉点进行严格研究奠定了基础。

Title: CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

Authors: ZhenQi Chen, TsaiChing Ni, YuanFu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22681
Pdf URL: https://arxiv.org/pdf/2512.22681
Copy Paste: [[2512.22681]] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation(https://arxiv.org/abs/2512.22681)
Keywords: generation
Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
摘要：最近的文本到图像扩散模型已经实现了显着的视觉保真度，但常常难以与复杂提示进行语义对齐。我们引入了 CritiFusion，这是一种新颖的推理时间框架，它将多模态语义批评机制与频域细化相结合，以提高文本到图像的一致性和细节。所提出的 CritiCore 模块利用视觉语言模型和多个大型语言模型来丰富提示上下文并产生高级语义反馈，指导扩散过程更好地将生成的内容与提示的意图保持一致。此外，SpecFusion 合并了谱域中的中间生成状态，注入粗略的结构信息，同时保留高频细节。不需要额外的模型训练。 CritiFusion 充当与现有扩散主干兼容的插件细化阶段。标准基准的实验表明，我们的方法显着改善了文本到图像对应和视觉质量的人类对齐指标。 CritiFusion 不断提高人类偏好评分和审美评估的表现，取得与最先进的奖励优化方法相当的结果。定性结果进一步证明了卓越的细节、真实性和即时保真度，表明我们的语义批评和光谱对齐策略的有效性。

Title: Autoregressive Flow Matching for Motion Prediction

Authors: Johnathan Xie, Stefan Stojanov, Cristobal Eyzaguirre, Daniel L. K. Yamins, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22688
Pdf URL: https://arxiv.org/pdf/2512.22688
Copy Paste: [[2512.22688]] Autoregressive Flow Matching for Motion Prediction(https://arxiv.org/abs/2512.22688)
Keywords: generation
Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: this https URL.
摘要：人们在不同的背景下研究了运动预测，使用在窄分布上训练的模型并将其应用于人体运动预测和机器人技术的下游任务。与此同时，最近在缩放视频预测方面的努力已经展示了令人印象深刻的视觉真实感，但尽管规模巨大，但它们仍难以准确地建模复杂的运动。受视频生成规模的启发，我们开发了自回归流匹配（ARFM），这是一种对连续连续数据进行概率建模的新方法，并在不同的视频数据集上对其进行训练，以在长范围内生成未来的点轨迹位置。为了评估我们的模型，我们开发了评估运动预测模型预测人类和机器人运动能力的基准。我们的模型能够预测复杂的运动，并且我们证明，在预测的未来轨迹上调节机器人动作预测和人体运动预测可以显着提高下游任务性能。代码和模型可公开获取：此 https URL。

Title: SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

Authors: Paul Dobre, Jackson Cooper, Xin Wang, Hongzhou Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22706
Pdf URL: https://arxiv.org/pdf/2512.22706
Copy Paste: [[2512.22706]] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis(https://arxiv.org/abs/2512.22706)
Keywords: generation
Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.
摘要：3D 资产插入和新颖视图合成 (NVS) 是自动驾驶模拟的关键组成部分，可增强训练数据的多样性。有了更好的多样化、涵盖多种情况（包括长尾驾驶场景）的训练数据，自动驾驶模型可以变得更加稳健和安全。这催生了一个统一的模拟框架，可以共同处理插入的 3D 资产和 NVS 的实际集成。最近的 3D 资产重建方法可以从视频中重建动态演员，支持将其重新插入到模拟驾驶场景中。虽然整体结构和外观可能是准确的，但它仍然难以通过灯光或阴影捕捉 3D 资产的真实感，特别是在插入场景时。与此同时，NVS 方法的最新进展在综合超出原始记录轨迹的观点方面取得了有希望的结果。然而，现有方法在很大程度上孤立地对待资产插入和 NVS 功能。为了允许与场景的其余部分进行交互并能够创建更多样化的新训练场景，真实的 3D 资产插入应与 NVS 相结合。为了解决这个问题，我们推出了 SCPainter（Street Car Painter），这是一个统一的框架，它将 3D Gaussian Splat (GS) 汽车资产表示和 3D 场景点云与基于扩散的生成集成在一起，共同实现真实的 3D 资产插入和 NVS。 3D GS 资产和 3D 场景点云一起投影成新颖的视图，这些投影用于调节扩散模型以生成高质量图像。对 Waymo 开放数据集的评估证明了我们的框架能够实现 3D 资产插入和 NVS，从而促进创建多样化且真实的驾驶数据。

Title: GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks

Authors: Xuyan Li, Jie Wang, Zheng Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22772
Pdf URL: https://arxiv.org/pdf/2512.22772
Copy Paste: [[2512.22772]] GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks(https://arxiv.org/abs/2512.22772)
Keywords: generation, generative
Abstract: Dynamic graphs are widely used to represent evolving real-world networks. Temporal Graph Neural Networks (TGNNs) have emerged as a powerful tool for processing such graphs, but the lack of transparency and explainability limits their practical adoption. Research on TGNN explainability is still in its early stages and faces several key issues: (i) Current methods are tailored to specific TGNN types, restricting generality. (ii) They suffer from high computational costs, making them unsuitable for large-scale networks. (iii) They often overlook the structural connectivity of explanations and require prior knowledge, reducing user-friendliness. To address these issues, we propose GRExplainer, the first universal, efficient, and user-friendly explanation method for TGNNs. GRExplainer extracts node sequences as a unified feature representation, making it independent of specific input formats and thus applicable to both snapshot-based and event-based TGNNs (the major types of TGNNs). By utilizing breadth-first search and temporal information to construct input node sequences, GRExplainer reduces redundant computation and improves efficiency. To enhance user-friendliness, we design a generative model based on Recurrent Neural Networks (RNNs), enabling automated and continuous explanation generation. Experiments on six real-world datasets with three target TGNNs show that GRExplainer outperforms existing baseline methods in generality, efficiency, and user-friendliness.
摘要：动态图被广泛用于表示不断发展的现实世界网络。时态图神经网络（TGNN）已成为处理此类图的强大工具，但缺乏透明度和可解释性限制了其实际采用。 TGNN 可解释性研究仍处于早期阶段，面临几个关键问题：（i）当前方法是针对特定 TGNN 类型定制的，限制了通用性。 (ii) 它们的计算成本很高，不适合大规模网络。 (iii) 它们常常忽视解释的结构连通性，并且需要先验知识，从而降低了用户友好性。为了解决这些问题，我们提出了 GRExplainer，这是第一个通用、高效且用户友好的 TGNN 解释方法。 GRExplainer 将节点序列提取为统一的特征表示，使其独立于特定的输入格式，因此适用于基于快照和基于事件的 TGNN（TGNN 的主要类型）。通过利用广度优先搜索和时间信息来构造输入节点序列，GRExplainer 减少了冗余计算并提高了效率。为了增强用户友好性，我们设计了一个基于循环神经网络（RNN）的生成模型，实现自动、连续的解释生成。对六个真实数据集和三个目标 TGNN 的实验表明，GRExplainer 在通用性、效率和用户友好性方面优于现有的基线方法。

Title: Plug In, Grade Right: Psychology-Inspired AGIQA

Authors: Zhicheng Liao, Baoliang Chen, Hanwei Zhu, Lingyu Zhu, Shiqi Wang, Weisi Lin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.22780
Pdf URL: https://arxiv.org/pdf/2512.22780
Copy Paste: [[2512.22780]] Plug In, Grade Right: Psychology-Inspired AGIQA(https://arxiv.org/abs/2512.22780)
Keywords: generation, quality assessment
Abstract: Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
摘要：现有的 AGIQA 模型通常通过测量和聚合从多级质量描述导出的图像嵌入和文本嵌入之间的相似性来估计图像质量。尽管有效，但我们观察到这种跨年级的相似性分布通常表现出多模态模式。例如，图像嵌入可能与“优秀”和“差”等级描述高度相似，但与“好”等级描述存在偏差。我们将这种现象称为“语义漂移”，其中文本嵌入与其预期描述之间的语义不一致破坏了文本图像共享空间学习的可靠性。为了缓解这个问题，我们从心理测量学中汲取灵感，提出了一种改进的 AGIQA 分级响应模型 (GRM)。 GRM 是一种经典的评估模型，它使用不同难度级别的测试项目对不同年级的受试者能力进行分类。这种范例与人类质量评级非常吻合，其中图像质量可以解释为图像满足各种质量等级的能力。基于这一理念，我们设计了一个两分支质量分级模块：一个分支评估图像能力，而另一个分支构建多个难度级别。为了确保难度级别的单调性，我们进一步以算术方式对难度生成进行建模，这本质上强制执行单峰且可解释的质量分布。我们基于算术 GRM 的质量分级 (AGQG) 模块具有即插即用的优势，在集成到各种最先进的 AGIQA 框架中时，可以持续提高性能。此外，它还有效地推广到自然和屏幕内容图像质量评估，揭示了其作为未来 IQA 模型关键组成部分的潜力。

Title: Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

Authors: Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22796
Pdf URL: https://arxiv.org/pdf/2512.22796
Copy Paste: [[2512.22796]] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization(https://arxiv.org/abs/2512.22796)
Keywords: generation, generative
Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
摘要：扩散模型 (DM) 已经实现了最先进的生成性能，但由于其顺序去噪特性，采样延迟较高。现有的基于求解器的加速方法通常在低延迟预算下面临显着的图像质量下降，这主要是由于无法捕获高曲率轨迹段而产生的累积截断误差。在本文中，我们提出了 Ensemble Parallel Direction 求解器（称为 EPD-Solver），这是一种新颖的 ODE 求解器，通过在每个步骤中合并多个并行梯度评估来减轻这些误差。受采样轨迹很大程度上局限于低维流形这一几何见解的启发，EPD-Solver 利用向量值函数的均值定理来更准确地逼近积分解。重要的是，由于额外的梯度计算是独立的，因此它们可以完全并行化，从而保留低延迟采样特性。我们引入一个两阶段优化框架。最初，EPD-Solver 通过基于蒸馏的方法优化一小组可学习参数。我们进一步提出了一种参数高效的强化学习（RL）微调方案，将求解器重新表述为随机狄利克雷策略。与微调大规模骨干网的传统方法不同，我们的强化学习方法严格在低维求解器空间内运行，有效减少奖励黑客攻击，同时增强复杂文本到图像 (T2I) 生成任务的性能。此外，我们的方法很灵活，可以作为插件（EPD-Plugin）来改进现有的 ODE 采样器。

Title: ReDiF: Reinforced Distillation for Few Step Diffusion

Authors: Amirhossein Tighkhorshid, Zahra Dehghanian, Gholamali Aminian, Chengchun Shi, Hamid R. Rabiee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.22802
Pdf URL: https://arxiv.org/pdf/2512.22802
Copy Paste: [[2512.22802]] ReDiF: Reinforced Distillation for Few Step Diffusion(https://arxiv.org/abs/2512.22802)
Keywords: generative
Abstract: Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher's outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.
摘要：蒸馏通过创建尺寸更小或步数更少的模型来近似高步教师的行为，从而解决了扩散模型中的缓慢采样问题。在这项工作中，我们提出了一种基于强化学习的扩散模型蒸馏框架。我们不依赖固定的重建或一致性损失，而是将蒸馏过程视为策略优化问题，其中使用源自与教师输出的对齐的奖励信号来训练学生。这种强化学习驱动的方法动态地引导学生探索多个去噪路径，使其能够采取更长的优化步骤来实现数据分布的高概率区域，而不是依赖于增量改进。我们的框架利用扩散模型的固有能力来处理更大的步骤并有效地管理生成过程。实验结果表明，与现有的蒸馏技术相比，我们的方法以显着更少的推理步骤和计算资源实现了卓越的性能。此外，该框架与模型无关，适用于具有合适奖励函数的任何类型的扩散模型，为有效的扩散学习提供了通用的优化范例。

Title: EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Authors: Libo Zhang, Zekun Li, Tianyu Li, Zeyu Cao, Rui Xu, Xiaoxiao Long, Wenjia Wang, Jingbo Wang, Yuan Liu, Wenping Wang, Daquan Zhou, Taku Komura, Zhiyang Dou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.22808
Pdf URL: https://arxiv.org/pdf/2512.22808
Copy Paste: [[2512.22808]] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation(https://arxiv.org/abs/2512.22808)
Keywords: generation, generative
Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.
摘要：人类对以自我为中心的视觉输入表现出适应性、上下文敏感的反应。然而，由于严格因果生成和精确 3D 空间对齐的双重要求，忠实地对以自我为中心的视频中的此类反应进行建模仍然具有挑战性。为了解决这个问题，我们首先构建人类反应数据集（HRD），通过构建空间对齐的自我中心视频反应数据集来解决数据稀缺和错位问题，因为现有数据集（例如，ViMo）在自我中心视频和反应运动之间存在显着的空间不一致，例如，动态移动的运动总是与固定摄像机视频配对。利用 HRD，我们推出了 EgoReAct，这是第一个自回归框架，可以从以自我为中心的视频流实时生成 3D 对齐的人类反应运动。我们首先通过矢量量化变分自动编码器将反应运动压缩到紧凑但富有表现力的潜在空间中，然后训练生成预训练变压器以根据视觉输入生成反应。 EgoReAct在生成过程中融入了3D动态特征，即公制深度和头部动态，有效增强了空间基础。大量实验表明，与之前的方法相比，EgoReAct 实现了显着更高的真实感、空间一致性和生成效率，同时在生成过程中保持了严格的因果关系。我们将在接受后发布代码、模型和数据。

Title: KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution

Authors: Chenyu Li, Danfeng Hong, Bing Zhang, Zhaojie Pan, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22822
Pdf URL: https://arxiv.org/pdf/2512.22822
Copy Paste: [[2512.22822]] KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution(https://arxiv.org/abs/2512.22822)
Keywords: super-resolution
Abstract: The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.
摘要：高度非线性的退化过程、复杂的物理相互作用以及各种不确定性来源使得单图像超分辨率（SR）成为一项特别具有挑战性的任务。现有的可解释 SR 方法，无论是基于先验学习还是深度展开优化框架，通常都依赖黑盒深度网络来建模潜在变量，这使得退化过程在很大程度上是未知且不可控的。受柯尔莫哥洛夫-阿诺德定理（KAT）的启发，我们首次提出了一种新颖的可解释算子，称为柯尔莫哥洛夫-阿诺德神经算子（KANO），并将其应用于图像超分辨率。 KANO 提供了潜在退化拟合过程的透明且结构化的表示。具体来说，我们采用由有限数量的 B 样条函数组成的加法结构以分段方式近似连续谱曲线。通过在定义的区间内学习和优化这些样条函数的形状参数，我们的 KANO 准确捕获关键光谱特征，例如局部线性趋势和非线性拐点处的峰谷结构，从而赋予 SR 结果物理可解释性。此外，通过自然图像、航空照片和卫星遥感数据的理论建模和实验评估，我们系统地比较了多层感知器（MLP）和柯尔莫哥洛夫-阿诺德网络（KAN）在处理复杂序列拟合任务时的效果。这项比较研究阐明了这些模型在表征复杂降解机制方面各自的优势和局限性，为可解释的 SR 技术的开发提供了宝贵的见解。

Title: ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Authors: Bangya Liu, Xinyu Gong, Zelin Zhao, Ziyang Song, Yulei Lu, Suhui Wu, Jun Zhang, Suman Banerjee, Hao Zhang
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.22854
Pdf URL: https://arxiv.org/pdf/2512.22854
Copy Paste: [[2512.22854]] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning(https://arxiv.org/abs/2512.22854)
Keywords: generation
Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.
摘要：人机交互（HOI）视频生成因其在数字人类、电子商务、广告和机器人模仿学习中的前景广阔的应用而受到越来越多的关注。然而，现有方法面临两个关键限制：（1）缺乏有效的机制将对象的多视图信息注入模型，导致跨视图一致性差；（2）严重依赖细粒度的手部网格注释来建模交互遮挡。为了应对这些挑战，我们引入了 ByteLoom，这是一种基于 Diffusion Transformer (DiT) 的框架，可使用简化的人体调节和 3D 对象输入生成具有几何一致的对象插图的逼真 HOI 视频。我们首先提出了一种 RCM 缓存机制，该机制利用相对坐标图（RCM）作为通用表示来保持对象的几何一致性，同时精确控制 6-DoF 对象变换。为了弥补 HOI 数据集的稀缺性并利用现有数据集，我们进一步设计了培训课程，以渐进的方式增强模型能力并放宽手网格的要求。大量的实验表明，我们的方法忠实地保留了人类身份和物体的多视图几何形状，同时保持平滑的运动和物体操纵。

Title: M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Authors: Ju-Hsuan Weng, Jia-Wei Liao, Cheng-Fu Chou, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22877
Pdf URL: https://arxiv.org/pdf/2512.22877
Copy Paste: [[2512.22877]] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models(https://arxiv.org/abs/2512.22877)
Keywords: generation, generative
Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
摘要：文本到图像的扩散模型可能会产生有害或受版权保护的内容，从而激发了概念擦除的研究。然而，现有的方法主要侧重于从文本提示中删除概念，而忽略了在现实应用中越来越重要的其他输入方式，例如图像编辑和个性化生成。这些模式可能成为攻击面，尽管有防御，但被抹去的概念仍会重新出现。为了弥补这一差距，我们引入了 M-ErasureBench，这是一种新颖的多模式评估框架，它系统地对三种输入模式的概念擦除方法进行基准测试：文本提示、学习嵌入和反向潜在。对于后两者，我们评估白盒和黑盒访问，产生五个评估场景。我们的分析表明，现有方法对文本提示实现了强大的擦除性能，但在学习嵌入和反向潜在特征下很大程度上失败了，在白盒设置中概念再现率（CRR）超过 90%。为了解决这些漏洞，我们提出了 IRECE（概念擦除的推理时间鲁棒性增强），这是一个即插即用的模块，可以通过交叉注意力来定位目标概念，并在去噪过程中扰乱相关的潜在变量。实验表明，IRECE 始终如一地恢复鲁棒性，在最具挑战性的白盒潜在反转场景下将 CRR 降低高达 40%，同时保持视觉质量。据我们所知，M-ErasureBench 提供了第一个超越文本提示的概念擦除综合基准测试。我们的基准测试与 IRECE 一起为构建更可靠的保护性生成模型提供了实用的保障。

Title: Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

Authors: Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Haozhe Jia, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22881
Pdf URL: https://arxiv.org/pdf/2512.22881
Copy Paste: [[2512.22881]] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance(https://arxiv.org/abs/2512.22881)
Keywords: generation
Abstract: Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.
摘要：基于去噪反演循环的迭代细化方法是增强扩散模型的质量和控制的强大工具。然而，当与标准无分类器指导 (CFG) 结合使用时，它们的有效性受到严重限制。我们发现了一个基本限制：CFG 的外推性质系统地将采样路径推离数据流形，导致近似误差发散并破坏细化过程。为了解决这个问题，我们提出了引导路径采样（GPS），这是一种迭代细化的新范例。 GPS 用有原则的流形约束插值代替不稳定的外推，确保采样路径保留在数据流形上。我们从理论上证明，这种校正将误差序列从无界放大转变为严格有界放大，从而保证了稳定性。此外，我们设计了一种最佳调度策略，可以动态调整指导强度，使语义注入与模型的自然从粗到细的生成过程保持一致。对 SDXL 和 Hunyuan-DiT 等现代骨干网络的大量实验表明，GPS 在感知质量和复杂提示依从性方面均优于现有方法。例如，GPS 在 SDXL 上实现了 0.79 的卓越 ImageReward 和 0.2995 的 HPS v2，同时将 GenEval 上的整体语义对齐精度提高到 57.45%。我们的工作表明，路径稳定性是有效迭代细化的先决条件，而 GPS 提供了实现这一目标的强大框架。

Title: JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22905
Pdf URL: https://arxiv.org/pdf/2512.22905
Copy Paste: [[2512.22905]] JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation(https://arxiv.org/abs/2512.22905)
Keywords: generation
Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
摘要：本文介绍了 JavisGPT，这是第一个用于联合音频视频 (JAV) 理解和生成的统一多模态大语言模型 (MLLM)。 JavisGPT 采用简洁的编码器-LLM-解码器架构，具有用于时空音频视频融合的 SyncFusion 模块和同步感知的可学习查询，以桥接预训练的 JAV-DiT 生成器。这种设计使得能够从多模态指令中理解和生成时间连贯的视频音频。我们设计了一个有效的三阶段训练流程，包括多模态预训练、音频视频微调和大规模指令调整，以逐步构建现有视觉语言模型的多模态理解和生成。为了支持这一点，我们进一步构建了 JavisInst-Omni，这是一个高质量的指令数据集，包含超过 200K GPT-4o 策划的音频-视频-文本对话，涵盖多样化、多层次的理解和生成场景。关于 JAV 理解和生成基准的大量实验表明，JavisGPT 优于现有的 MLLM，特别是在复杂和时间同步的设置中。

Title: ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Authors: Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22939
Pdf URL: https://arxiv.org/pdf/2512.22939
Copy Paste: [[2512.22939]] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving(https://arxiv.org/abs/2512.22939)
Keywords: generation
Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
摘要：自动驾驶需要从复杂的多模式输入生成安全可靠的轨迹。传统的模块化管道将感知、预测和规划分开，而最近的端到端（E2E）系统联合学习它们。视觉语言模型（VLM）通过引入跨模态先验和常识推理进一步丰富了这种范式，但当前基于 VLM 的规划器面临三个关键挑战：（i）离散文本推理和连续控制之间的不匹配，（ii）自回归思想链解码的高延迟，以及（iii）限制实时部署的低效或非因果规划器。我们提出了 ColaVLA，一个统一的视觉-语言-动作框架，它将推理从文本转移到统一的潜在空间，并将其与分层的并行轨迹解码器耦合。认知潜在推理器通过自我自适应选择和仅两次 VLM 前向传递，将场景理解压缩为紧凑的、面向决策的元动作嵌入。然后，分层并行规划器在一次前向传递中生成多尺度、因果一致的轨迹。这些组件共同保留了 VLM 的泛化性和可解释性，同时实现高效、准确和安全的轨迹生成。 nuScenes 基准测试表明，ColaVLA 在开环和闭环设置中均实现了最先进的性能，并且具有良好的效率和鲁棒性。

Title: Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects

Authors: Zhicheng Zhao, Xuanang Fan, Lingma Sun, Chenglong Li, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22949
Pdf URL: https://arxiv.org/pdf/2512.22949
Copy Paste: [[2512.22949]] Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects(https://arxiv.org/abs/2512.22949)
Keywords: generation
Abstract: High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.
摘要：高分辨率遥感图像越来越多地包含密集的微小物体簇，由于严重的相互遮挡和有限的像素足迹，这些物体的检测极具挑战性。现有的检测方法通常统一分配计算资源，无法自适应地关注这些密度集中的区域，这阻碍了特征学习的有效性。为了解决这些限制，我们提出了密集区域挖掘网络（DRMNet），它利用密度图作为显式空间先验来指导自适应特征学习。首先，我们设计了一个密度生成分支（DGB）来建模对象分布模式，提供可量化的先验，引导网络走向密集区域。其次，为了解决全局注意力的计算瓶颈，我们的密集区域聚焦模块（DAFM）使用这些密度图来识别和聚焦密集区域，从而实现高效的局部-全局特征交互。最后，为了减轻分层提取过程中的特征退化，我们引入了双滤波器融合模块（DFFM）。它使用离散余弦变换将多尺度特征分解为高频和低频分量，然后执行密度引导的交叉注意力以增强互补性，同时抑制背景干扰。对 AI-TOD 和 DTOD 数据集的大量实验表明，DRMNet 超越了最先进的方法，特别是在对象密度高和遮挡严重的复杂场景中。

Title: FLOW: A Feedback-Driven Synthetic Longitudinal Dataset of Work and Wellbeing

Authors: Wafaa El Husseini
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.22956
Pdf URL: https://arxiv.org/pdf/2512.22956
Copy Paste: [[2512.22956]] FLOW: A Feedback-Driven Synthetic Longitudinal Dataset of Work and Wellbeing(https://arxiv.org/abs/2512.22956)
Keywords: generation
Abstract: Access to longitudinal, individual-level data on work-life balance and wellbeing is limited by privacy, ethical, and logistical constraints. This poses challenges for reproducible research, methodological benchmarking, and education in domains such as stress modeling, behavioral analysis, and machine learning. We introduce FLOW, a synthetic longitudinal dataset designed to model daily interactions between workload, lifestyle behaviors, and wellbeing. FLOW is generated using a rule-based, feedback-driven simulation that produces coherent temporal dynamics across variables such as stress, sleep, mood, physical activity, and body weight. The dataset simulates 1{,}000 individuals over a two-year period with daily resolution and is released as a publicly available resource. In addition to the static dataset, we describe a configurable data generation tool that enables reproducible experimentation under adjustable behavioral and contextual assumptions. FLOW is intended as a controlled experimental environment rather than a proxy for observed human populations, supporting exploratory analysis, methodological development, and benchmarking where real-world data are inaccessible.
摘要：获取有关工作与生活平衡和福祉的纵向个人层面数据受到隐私、道德和后勤方面的限制。这给压力建模、行为分析和机器学习等领域的可重复研究、方法基准测试和教育带来了挑战。我们引入了 FLOW，这是一个综合纵向数据集，旨在模拟工作负载、生活方式行为和幸福感之间的日常相互作用。 FLOW 是使用基于规则、反馈驱动的模拟生成的，该模拟可在压力、睡眠、情绪、体力活动和体重等变量之间产生连贯的时间动态。该数据集以每日分辨率模拟了两年期间的 1{,}000 个人，并作为公开资源发布。除了静态数据集之外，我们还描述了一个可配置的数据生成工具，该工具可以在可调整的行为和上下文假设下进行可重复的实验。 FLOW 旨在作为受控实验环境，而不是观察人群的代理，支持探索性分析、方法开发以及无法访问现实世界数据的基准测试。

Title: RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

Authors: Chunyuan Chen, Yunuo Cai, Shujuan Li, Weiyun Liang, Bin Wang, Jing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22974
Pdf URL: https://arxiv.org/pdf/2512.22974
Copy Paste: [[2512.22974]] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance(https://arxiv.org/abs/2512.22974)
Keywords: generation
Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.
摘要：伪装图像生成 (CIG) 最近已成为获取伪装目标检测 (COD) 的高质量训练数据的有效替代方案。然而，现有的 CIG 方法仍然与真实的伪装图像存在很大差距：生成的图像要么由于视觉相似性弱而缺乏足够的伪装，要么表现出与前景目标语义不一致的杂乱背景。为了解决这些限制，我们提出了 ReamCamo，这是一个基于统一外画的框架，用于生成逼真的迷彩图像。 ReamCamo 明确引入了额外的布局控件来调节全局图像结构，从而提高前景对象和生成的背景之间的语义一致性。此外，我们通过将统一的细粒度文本任务描述与面向纹理的背景检索相结合，构建了多模态文本视觉条件，共同指导生成过程，以增强视觉保真度和真实感。为了定量评估伪装质量，我们进一步引入了背景-前景分布散度度量，用于衡量生成图像中伪装的有效性。大量的实验和可视化证明了我们提出的框架的有效性。

Title: Reverse Personalization

Authors: Han-Wei Kung, Tuomas Varanka, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22984
Pdf URL: https://arxiv.org/pdf/2512.22984
Copy Paste: [[2512.22984]] Reverse Personalization(https://arxiv.org/abs/2512.22984)
Keywords: generation
Abstract: Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at this https URL .
摘要：最近的文本到图像扩散模型已经证明，可以根据文本提示和人类身份生成逼真的面部图像，从而能够创建个性化的面部图像。然而，现有的用于删除或修改特定于身份的特征的基于提示的方法要么依赖于预训练模型中充分代表的主题，要么需要针对特定身份进行模型微调。在这项工作中，我们分析了身份生成过程，并引入了用于面部匿名化的反向个性化框架。我们的方法利用条件扩散反演，允许直接操作图像而不使用文本提示。为了概括模型训练数据中的主题之外的内容，我们引入了身份引导调节分支。与之前的匿名方法缺乏对面部属性的控制不同，我们的框架支持属性可控的匿名化。我们证明我们的方法在身份去除、属性保留和图像质量之间实现了最先进的平衡。源代码和数据可在此 https URL 获取。

Title: A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection

Authors: Soham Dutta, Soham Banerjee, Sneha Mahata, Anindya Sen, Sayantani Datta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.22990
Pdf URL: https://arxiv.org/pdf/2512.22990
Copy Paste: [[2512.22990]] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection(https://arxiv.org/abs/2512.22990)
Keywords: quality assessment
Abstract: Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.
摘要：苹果园需要及时的疾病检测、水果质量评估和产量估算，但现有的基于无人机的系统孤立地处理这些任务，并且通常依赖昂贵的多光谱传感器。本文提出了一种统一、低成本、基于 RGB 无人机的果园智能管道，集成了用于叶病检测的 ResNet50、用于苹果新鲜度测定的 VGG 16 以及用于实时苹果检测和定位的 YOLOv8。该系统在 ESP32-CAM 和 Raspberry Pi 上运行，无需云支持即可提供完全离线的现场推理。实验表明，叶病分类准确率为 98.9%，新鲜度分类准确率为 97.4%，苹果检测 F1 得分为 0.857。该框架为多光谱无人机解决方案提供了一种可访问且可扩展的替代方案，支持在经济实惠的硬件上进行实用的精准农业。

Title: How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure

Authors: Paul M. Thompson
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.23109
Pdf URL: https://arxiv.org/pdf/2512.23109
Copy Paste: [[2512.23109]] How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure(https://arxiv.org/abs/2512.23109)
Keywords: generative
Abstract: Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration.
摘要：现代生成和视觉语言模型 (VLM) 越来越多地用于科学和医疗决策支持，其中预测的概率必须准确且经过良好校准。尽管使用适度的数据得出了强有力的实证结果，但尚不清楚此类预测何时在输入、类别或亚群中统一推广，而不仅仅是平均 - 这是生物医学中的一个关键问题，在生物医学中，即使总体损失较低，罕见的条件和特定群体也可能表现出较大的错误。我们从有限样本的角度研究这个问题，并问：在什么结构假设下，生成式和基于 VLM 的预测器可以在实际样本量下实现一致准确和校准的行为？我们不是分析任意参数化，而是关注通过在受限表示空间内改变提示或语义嵌入而获得的诱导分类器族。当模型输出平滑地依赖于低维语义表示（由文本中的谱结构和联合图像文本嵌入支持的假设）时，经典一致收敛工具会产生有意义的非渐近保证。我们的主要结果给出了在即时嵌入的 Lipschitz 稳定性下 VLM 诱导分类器的精度和校准函数的有限样本一致收敛范围。隐含的样本复杂性取决于内在/有效维度，而不是环境嵌入维度，并且我们进一步推导了与频谱相关的界限，明确了特征值衰减如何控制数据要求。我们总结了对数据有限的生物医学建模的影响，包括当前数据集大小何时可以支持一致可靠的预测以及为什么平均校准指标可能会错过最坏情况的错误校准。

Title: PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

Authors: Jian Wang, Sixing Rong, Jiarui Xing, Yuling Xu, Weide Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23130
Pdf URL: https://arxiv.org/pdf/2512.23130
Copy Paste: [[2512.23130]] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion(https://arxiv.org/abs/2512.23130)
Keywords: generative
Abstract: We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.
摘要：我们提出了 PathoSyn，一个用于磁共振成像 (MRI) 图像合成的统一生成框架，它将成像病理学重新表述为稳定解剖流形上的解开的附加偏差。当前的生成模型通常在全局像素域中运行或依赖于二进制掩模，这些范例经常遭受特征纠缠，导致解剖基质损坏或结构不连续。 PathoSyn 通过将合成任务分解为确定性解剖重建和随机偏差建模来解决这些限制。我们框架的核心是偏差空间扩散模型，旨在学习病理残差的条件分布，从而捕获局部强度变化，同时通过构建保持全局结构完整性。为了确保空间一致性，扩散过程与缝感知融合策略和推理时间稳定模块相结合，共同抑制边界伪影并产生高保真内部病变异质性。 PathoSyn 提供了一个数学原理管道，用于生成高保真患者特定的合成数据集，促进在低数据情况下开发稳健的诊断算法。通过允许可解释的反事实疾病进展建模，该框架支持精确的干预计划，并为临床决策支持系统的基准测试提供受控环境。对肿瘤成像基准的定量和定性评估表明，PathoSyn 在感知真实性和解剖保真度方面均显着优于整体扩散和掩模条件基线。这项工作的源代码将公开。

Title: GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23180
Pdf URL: https://arxiv.org/pdf/2512.23180
Copy Paste: [[2512.23180]] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation(https://arxiv.org/abs/2512.23180)
Keywords: generation, generative
Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub this https URL.
摘要：随着生成模型的进步，驾驶世界模型（DWM）一直在快速发展。然而，现有的 DWM 缺乏 3D 场景理解能力，只能根据输入数据生成内容，而无法解释或推理驾驶环境。此外，当前使用点云或 BEV 特征表示 3D 空间信息的方法无法准确地将文本信息与底层 3D 场景对齐。为了解决这些限制，我们提出了一种基于 3D 高斯场景表示的新型统一 DWM 框架，该框架既可以实现 3D 场景理解和多模态场景生成，同时还可以丰富理解和生成任务的上下文。我们的方法通过将丰富的语言特征嵌入到每个高斯基元中，直接将文本信息与 3D 场景对齐，从而实现早期模态对齐。此外，我们设计了一种新颖的任务感知语言引导采样策略，可以消除冗余的 3D 高斯分布，并将准确且紧凑的 3D 标记注入 LLM。此外，我们设计了一种双条件多模态生成模型，其中视觉语言模型捕获的信息被用作高级语言条件与低级图像条件相结合，共同指导多模态生成过程。我们对 nuScenes 和 NuInteract 数据集进行全面研究，以验证我们框架的有效性。我们的方法实现了最先进的性能。我们将在 GitHub 这个 https URL 上公开发布代码。

Title: Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Authors: Changgyoon Oh, Jongoh Jeong, Jegyeong Cho, Kuk-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23210
Pdf URL: https://arxiv.org/pdf/2512.23210
Copy Paste: [[2512.23210]] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks(https://arxiv.org/abs/2512.23210)
Keywords: generative
Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.
摘要：去噪扩散概率模型在生成任务方面带来了巨大进步，迄今为止实现了最先进的性能。当前基于扩散模型的应用程序通过附加特定于任务的解码器，利用从多步前向-后向马尔可夫过程学习的视觉表示的力量来执行单任务预测任务。然而，扩散时间步长特征的启发式选择仍然严重依赖于经验直觉，通常导致偏向某些任务的次优性能。为了缓解这一限制，我们通过自适应地选择最适合少样本密集预测任务的时间步长来研究通用扩散时间步长特征的重要性，并在任意未见过的任务上进行评估。为此，我们提出了两个模块：任务感知时间步长选择（TTS），用于根据时间步损失和相似性得分选择理想的扩散时间步长；时间步长特征合并（TFC），用于合并所选时间步长特征，以提高几次镜头设置中的密集预测性能。伴随着我们的参数高效微调适配器，我们的框架仅在少量支持查询的情况下就有效地实现了密集预测性能的优越性。我们在大规模具有挑战性的 Taskonomy 数据集上凭经验验证了我们的可学习时间步合并方法，以进行密集预测，特别是对于实际的通用和小样本学习场景。

Title: Bridging Your Imagination with Audio-Video Generation via a Unified Director

Authors: Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2512.23222
Pdf URL: https://arxiv.org/pdf/2512.23222
Copy Paste: [[2512.23222]] Bridging Your Imagination with Audio-Video Generation via a Unified Director(https://arxiv.org/abs/2512.23222)
Keywords: generation
Abstract: Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
摘要：现有的人工智能驱动的视频创作系统通常将剧本起草和关键镜头设计视为两个互不相交的任务：前者依赖于大型语言模型，而后者依赖于图像生成模型。我们认为这两项任务应该统一在一个框架内，因为逻辑推理和想象力都是电影导演的基本素质。在这项工作中，我们提出了 UniMAGE，这是一种统一的导演模型，它将用户提示与结构良好的脚本联系起来，从而使非专家能够利用现有的音视频生成模型来制作长上下文、多镜头的电影。为了实现这一目标，我们采用了 Mixture-of-Transformers 架构来统一文本和图像生成。为了进一步增强叙事逻辑和关键帧一致性，我们引入了“先交错，然后解开”的训练范例。具体来说，我们首先执行交错概念学习，它利用交错的文本图像数据来促进模型对脚本的更深入理解和富有想象力的解释。然后，我们进行解开专家学习，将脚本编写与关键帧生成分离，从而在讲故事时实现更大的灵活性和创造力。大量实验表明，UniMAGE 在开源模型中实现了最先进的性能，生成逻辑连贯的视频脚本和视觉一致的关键帧图像。

Title: Anomaly Detection by Effectively Leveraging Synthetic Images

Authors: Sungho Kang, Hyunkyu Park, Yeonho Lee, Hanbyul Lee, Mijoo Jeong, YeongHyeon Park, Injae Lee, Juneho Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23227
Pdf URL: https://arxiv.org/pdf/2512.23227
Copy Paste: [[2512.23227]] Anomaly Detection by Effectively Leveraging Synthetic Images(https://arxiv.org/abs/2512.23227)
Keywords: generative
Abstract: Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.
摘要：异常检测在工业制造中起着至关重要的作用。由于真实缺陷图像的稀缺，仅依赖正常图像的无监督方法已被广泛研究。最近，基于扩散的生成模型引起了人们对训练数据合成作为替代解决方案的关注。在这项工作中，我们专注于有效利用合成图像来最大限度地提高异常检测性能的策略。以前的合成策略大致分为两组，呈现出明显的权衡。基于规则的合成（例如注入噪声或粘贴补丁）具有成本效益，但通常无法生成真实的缺陷图像。另一方面，基于生成模型的合成可以创建高质量的缺陷图像，但需要大量成本。为了解决这个问题，我们提出了一种新颖的框架，利用预先训练的文本引导图像到图像翻译模型和图像检索模型来有效生成合成缺陷图像。具体来说，图像检索模型评估生成的图像与真实正常图像的相似度并过滤掉不相关的输出，从而提高生成的缺陷图像的质量和相关性。为了有效地利用合成图像，我们还引入了两阶段训练策略。在此策略中，模型首先对来自基于规则的合成的大量图像进行预训练，然后对较小的高质量图像集进行微调。该方法显着降低了数据收集成本，同时提高了异常检测性能。 MVTec AD 数据集上的实验证明了我们方法的有效性。

Title: KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Authors: Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu
Subjects: cs.LG, cs.AI, cs.AR, cs.MA, cs.PF
Abstract URL: https://arxiv.org/abs/2512.23236
Pdf URL: https://arxiv.org/pdf/2512.23236
Copy Paste: [[2512.23236]] KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta(https://arxiv.org/abs/2512.23236)
Keywords: generation
Abstract: Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
摘要：快速高效地进行深度学习推荐模型 (DLRM) 训练和推理非常重要。然而，这提出了三个关键的系统挑战：模型架构多样性、内核原语多样性以及硬件生成和架构异构性。本文提出了 KernelEvolve（一种代理内核编码框架），用于解决 DLRM 的大规模异构性问题。 KernelEvolve 旨在以内核规范作为输入，并自动化跨异构硬件架构的推荐模型的内核生成和优化过程。 KernelEvolve 通过在多个编程抽象上运行来实现这一点，从 Triton 和 CuTe DSL 到低级硬件无关语言，跨越完整的硬件-软件优化堆栈。内核优化过程被描述为基于图的搜索，具有选择策略、通用运算符、适应度函数和终止规则，通过检索增强提示合成动态适应运行时执行上下文。我们设计、实施和部署 KernelEvolve 来优化跨代 NVIDIA 和 AMD GPU 以及 Meta 的 AI 加速器的各种生产推荐模型。我们在公开的 KernelBench 套件上验证了 KernelEvolve，在三个难度级别的所有 250 个问题上实现了 100% 的通过率，并在三个异构硬件平台上验证了 160 个 PyTorch ATen 运算符，证明了 100% 的正确性。 KernelEvolve 将开发时间从几周缩短到几小时，并在不同的生产用例和大规模异构 AI 系统中实现了比 PyTorch 基线显着的性能改进。除了性能效率的提高之外，KernelEvolve 还可以为内部开发的 AI 硬件实现自动内核生成，从而显着缓解新 AI 硬件的可编程性障碍。

Title: RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Authors: Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23239
Pdf URL: https://arxiv.org/pdf/2512.23239
Copy Paste: [[2512.23239]] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models(https://arxiv.org/abs/2512.23239)
Keywords: super-resolution, generation, generative
Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
摘要：基于扩散的遥感 (RS) 生成基础模型对于下游任务至关重要。然而，这些模型依赖于大量具有全局代表性的数据，这些数据通常包含冗余、噪声和类别不平衡，从而降低了训练效率并阻碍了收敛。现有的遥感扩散基础模型通常聚合多个分类数据集或应用简单的重复数据删除，忽略了生成建模的分布要求和遥感图像的异质性。为了解决这些限制，我们提出了一种免训练的两阶段数据修剪方法，该方法可以在高修剪率下快速选择高质量子集，使初步基础模型能够快速收敛，并作为生成、下游微调和其他应用程序的多功能骨干。我们的方法联合考虑具有全局场景级多样性和代表性的局部信息内容。首先，基于熵的标准有效地去除了低信息样本。接下来，利用RS场景分类数据集作为参考基准，通过分层采样进行场景感知聚类，以提高聚类效果，同时降低大规模未标记数据的计算成本。最后，通过平衡簇级均匀性和样本代表性，该方法能够在高剪枝率下进行细粒度选择，同时保持整体多样性和代表性。实验表明，即使在修剪 85% 的训练数据后，我们的方法仍显着提高了收敛性和生成质量。此外，使用我们的方法训练的扩散基础模型在下游任务中始终实现最先进的性能，包括超分辨率和语义图像合成。这种数据修剪范式为开发 RS 生成基础模型提供了实用指导。

Title: ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Authors: Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23245
Pdf URL: https://arxiv.org/pdf/2512.23245
Copy Paste: [[2512.23245]] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation(https://arxiv.org/abs/2512.23245)
Keywords: generation
Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: this https URL
摘要：最近的文本到图像扩散模型显着提高了视觉质量和文本对齐。然而，生成图像序列，同时在不同的场景描述中保持一致的角色身份仍然是一项具有挑战性的任务。现有的方法经常在保持身份一致性和确保每个图像的及时对齐之间进行权衡。在本文中，我们介绍了一种新颖的框架 ASemconsist，它通过选择性文本嵌入修改来解决这一挑战，从而在不牺牲提示对齐的情况下实现对字符身份的显式语义控制。此外，基于我们对 FLUX 中填充嵌入的分析，我们提出了一种语义控制策略，将填充嵌入重新用作语义容器。此外，我们引入了一种自适应特征共享策略，该策略自动评估文本歧义性并仅对模糊的身份提示应用约束。最后，我们提出了一个统一的评估协议，即一致性质量得分（CQS），它将身份保留和每图像文本对齐集成到一个综合指标中，明确捕获两个指标之间的性能不平衡。我们的框架实现了最先进的性能，有效地克服了之前的权衡。项目页面：此 https URL

Title: Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23258
Pdf URL: https://arxiv.org/pdf/2512.23258
Copy Paste: [[2512.23258]] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization(https://arxiv.org/abs/2512.23258)
Keywords: generation
Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$\alpha$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
摘要：尽管扩散变压器（DiT）已成为图像和视频生成的主要架构，但其迭代去噪过程导致推理缓慢，从而阻碍了更广泛的适用性和发展。基于缓存的方法实现了免训练加速，但会产生相当大的计算误差。现有的方法通常结合纠错策略，例如修剪或预测来减轻它。然而，它们的固定缓存策略无法适应去噪过程中复杂的误差变化，这限制了纠错的全部潜力。为了应对这一挑战，我们通过累积误差最小化为现有的纠错方法提出了一种新颖的保真度优化插件，名为 CEM。 CEM预定义了误差来表征模型对受时间步长和缓存间隔共同影响的加速的敏感性。在此先验的指导下，我们制定了一种具有累积误差近似的动态规划算法来进行策略优化，实现了缓存误差最小化，从而显着提高了生成保真度。 CEM 与模型无关，并且具有很强的泛化性，可适应任意加速预算。它可以无缝集成到现有的纠错框架和量化模型中，而不会引入任何额外的计算开销。对九个生成模型和跨三个任务的量化方法进行的大量实验表明，CEM 显着提高了现有加速模型的生成保真度，并且在 FLUX.1-dev、PixArt-$\alpha$、StableDiffusion1.5 和 Hunyuan 上优于原始生成性能。该代码将公开。

Title: On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases

Authors: Alexander Korotin, Gudmund Pammer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.23265
Pdf URL: https://arxiv.org/pdf/2512.23265
Copy Paste: [[2512.23265]] On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases(https://arxiv.org/abs/2512.23265)
Keywords: generative
Abstract: This paper studies the inverse problem of flow matching (FM) between distributions with finite exponential moment, a problem motivated by modern generative AI applications such as the distillation of flow matching models. Uniqueness of the solution is established in two cases - the one-dimensional setting and the Gaussian case. The general multidimensional problem remains open for future studies.
摘要：本文研究了具有有限指数矩的分布之间的流匹配（FM）逆问题，这是由现代生成式人工智能应用（例如流匹配模型的蒸馏）引发的问题。解决方案的唯一性在两种情况下成立 - 一维设置和高斯情况。一般的多维问题仍然有待未来的研究。

Title: CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Authors: Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, Weitao Jia, Xiaodong Ge, Jingqun Tang, Benlei Cui, Bin Li, Xiangyang Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23333
Pdf URL: https://arxiv.org/pdf/2512.23333
Copy Paste: [[2512.23333]] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation(https://arxiv.org/abs/2512.23333)
Keywords: generation
Abstract: Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
摘要：计算机辅助设计 (CAD) 在工业设计中至关重要，但传统 CAD 建模和工作流程的复杂性对自动生成高精度、可编辑 CAD 模型提出了重大挑战。现有的从草图重建 3D 模型的方法通常会产生不可编辑的近似模型，无法满足工业设计中对精度和可编辑性的严格要求。此外，对文本或基于图像的输入的依赖通常需要大量的手动注释，限制了它们在工业环境中的可扩展性和适用性。为了克服这些挑战，我们提出了异构协作多专家强化学习 (CME-CAD) 范式，这是一种用于 CAD 代码生成的新型训练范式。我们的方法集成了这些模型的互补优势，促进协作学习并提高模型生成准确、约束兼容且完全可编辑的 CAD 模型的能力。我们引入了一个两阶段的训练过程：多专家微调（MEFT）和多专家强化学习（MERL）。此外，我们还推出了 CADExpert，这是一个由 17,299 个实例组成的开源基准测试，包括具有精确尺寸注释的正交投影、专家生成的思想链 (CoT) 流程、可执行 CADQuery 代码和渲染的 3D 模型。

Title: SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Authors: Kanghee Lee, Injae Lee, Minseok Kwak, Kwonyoung Ryu, Jungi Hong, Jaesik Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23365
Pdf URL: https://arxiv.org/pdf/2512.23365
Copy Paste: [[2512.23365]] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility(https://arxiv.org/abs/2512.23365)
Keywords: generation
Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
摘要：多模态大语言模型 (MLLM) 的快速发展释放了增强 3D 场景理解和空间推理的潜力。然而，现有方法通常依赖于预先构建的 3D 表示或现成的重建管道，这限制了可扩展性和现实世界的适用性。最近的一项工作探索直接从多视图图像学习空间推理，使视觉语言模型 (VLM) 能够理解 3D 场景，而无需显式 3D 重建。然而，现实世界环境中经常出现的关键挑战，例如部分可见性、遮挡和需要根据碎片视觉线索进行空间推理的低重叠条件，仍然有待探索。为了解决这些限制，我们提出了一种可扩展的多视图数据生成和注释管道，用于构建现实的空间推理 QA，从而产生 SpatialMosaic，这是一个具有 2M QA 对的综合指令调整数据集。我们进一步介绍了 SpatialMosaic-Bench，这是一个具有挑战性的基准，用于在现实和具有挑战性的场景下评估多视图空间推理，由跨 6 个任务的 1M QA 对组成。此外，我们还提出了 SpatialMosaicVLM，这是一种混合框架，它将 3D 重建模型作为几何编码器集成到 VLM 中，以实现稳健的空间推理。大量实验表明，我们提出的数据集和 VQA 任务在具有挑战性的多视图条件下有效增强了空间推理，验证了我们的数据生成管道在构建真实且多样化的 QA 对方面的有效性。代码和数据集即将推出。

Title: Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

Authors: Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23367
Pdf URL: https://arxiv.org/pdf/2512.23367
Copy Paste: [[2512.23367]] Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2(https://arxiv.org/abs/2512.23367)
Keywords: generation
Abstract: Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B, variants of the openPangu large language model, integrate three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think. While these CoT modes enhance reasoning capabilities, their generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation, conducted across all three CoT modes on code generation benchmarks (HumanEval and MBPP), demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90\% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.
摘要：华为的openPangu-Embedded-1B和openPangu-Embedded-7B是openPangu大语言模型的变体，集成了三种不同的思想链（CoT）推理范式，即slow_think、auto_think和no_think。虽然这些 CoT 模式增强了推理能力，但它们生成的扩展推理轨迹会带来大量内存和延迟开销，给升腾 NPU 上的实际部署带来挑战。本文通过利用低位量化来解决这些计算限制，将 FP16 计算转换为更高效的整数算术。我们引入了统一的低位推理框架，支持 INT8 (W8A8) 和 W4A8 量化，专门针对 Atlas A2 上的 openPangu-Embedded 模型进行了优化。我们在代码生成基准（HumanEval 和 MBPP）上对所有三种 CoT 模式进行了全面评估，证明了这种方法的有效性。 INT8 量化始终保持超过 90% 的 FP16 基线精度，并在 Atlas A2 上实现 1.5 倍的预填充加速。此外，W4A8 量化显着减少了内存消耗，尽管在准确性方面有一定的折衷。这些发现共同表明，低位量化可有效促进 Ascend NPU 上的高效 CoT 推理，保持高模型保真度。

Title: NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Authors: Yifei Li, Haoyuan He, Yu Zheng, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23374
Pdf URL: https://arxiv.org/pdf/2512.23374
Copy Paste: [[2512.23374]] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization(https://arxiv.org/abs/2512.23374)
Keywords: generation
Abstract: The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.
摘要：用户友好的图像编辑模型的可访问性激增和滥用风险迫切需要通用的、最新的图像操纵检测和定位（IMDL）方法。当前的 IMDL 研究通常使用跨数据集评估，其中在一个基准上训练的模型在其他基准上进行测试。然而，这种简化的评估方法掩盖了现有方法在处理各种人工智能生成的内容时的脆弱性，导致对进展的误导性印象。本文通过提出 NeXT-IMDL 来挑战这种幻想，这是一种大规模诊断基准，其设计不仅用于收集数据，还用于系统地探索当前检测器的泛化边界。具体来说，NeXT-IMDL 沿着四个基本轴对基于 AIGC 的操作进行分类：编辑模型、操作类型、内容语义和伪造粒度。在此基础上，NeXT-IMDL 实施了五种严格的跨维度评估协议。我们对 11 个代表性模型进行的广泛实验揭示了一个重要的见解：虽然这些模型在原始设置中表现良好，但在我们设计的模拟现实世界的各种泛化场景的协议下进行评估时，它们表现出系统性故障和显着的性能下降。通过提供此诊断工具包和新发现，我们的目标是推动构建真正强大的下一代 IMDL 模型的发展。

Title: Diffusion priors enhanced velocity model building from time-lag images using a neural operator

Authors: Xiao Ma, Mohammad Hasyim Taufik, Tariq Alkhalifah
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.23375
Pdf URL: https://arxiv.org/pdf/2512.23375
Copy Paste: [[2512.23375]] Diffusion priors enhanced velocity model building from time-lag images using a neural operator(https://arxiv.org/abs/2512.23375)
Keywords: generative
Abstract: Velocity model building serves as a crucial component for achieving high precision subsurface imaging. However, conventional velocity model building methods are often computationally expensive and time consuming. In recent years, with the rapid advancement of deep learning, particularly the success of generative models and neural operators, deep learning based approaches that integrate data and their statistics have attracted increasing attention in addressing the limitations of traditional methods. In this study, we propose a novel framework that combines generative models with neural operators to obtain high resolution velocity models efficiently. Within this workflow, the neural operator functions as a forward mapping operator to rapidly generate time lag reverse time migration (RTM) extended images from the true and migration velocity models. In this framework, the neural operator is acting as a surrogate for modeling followed by migration, which uses the true and migration velocities, respectively. The trained neural operator is then employed, through automatic differentiation, to gradually update the migration velocity placed in the true velocity input channel with high resolution components so that the output of the network matches the time lag images of observed data obtained using the migration velocity. By embedding a generative model, trained on a high-resolution velocity model distribution, which corresponds to the true velocity model distribution used to train the neural operator, as a regularizer, the resulting predictions are cleaner with higher resolution information. Both synthetic and field data experiments demonstrate the effectiveness of the proposed generative neural operator based velocity model building approach.
摘要：速度模型构建是实现高精度地下成像的关键组成部分。然而，传统的速度模型构建方法通常计算量大且耗时。近年来，随着深度学习的快速发展，特别是生成模型和神经算子的成功，基于深度学习的整合数据及其统计的方法在解决传统方法的局限性方面受到越来越多的关注。在这项研究中，我们提出了一种新颖的框架，它将生成模型与神经算子相结合，以有效地获得高分辨率速度模型。在此工作流程中，神经算子充当正向映射算子，根据真实速度模型和偏移速度模型快速生成时滞逆时偏移 (RTM) 扩展图像。在此框架中，神经算子充当建模的代理，然后进行迁移，分别使用真实速度和迁移速度。然后使用经过训练的神经算子，通过自动微分，逐步更新具有高分辨率分量的真实速度输入通道中的偏移速度，使得网络的输出与使用偏移速度获得的观测数据的时滞图像相匹配。通过嵌入在高分辨率速度模型分布上训练的生成模型（对应于用于训练神经算子的真实速度模型分布）作为正则化器，得到的预测更加清晰，具有更高分辨率的信息。合成实验和现场数据实验都证明了所提出的基于生成神经算子的速度模型构建方法的有效性。

Title: SoulX-LiveTalk Technical Report

Authors: Le Shen, Qiao Qian, Tan Yu, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Ming Tao, Shunshun Yin, Siyuan Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23379
Pdf URL: https://arxiv.org/pdf/2512.23379
Copy Paste: [[2512.23379]] SoulX-LiveTalk Technical Report(https://arxiv.org/abs/2512.23379)
Keywords: generation
Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.
摘要：为实时、无限持续时间、音频驱动的化身生成部署大规模扩散模型提出了重大的工程挑战，这主要是由于计算负载和严格的延迟限制之间的冲突。现有方法经常通过强制执行严格的单向注意力机制或降低模型容量来损害视觉保真度。为了解决这个问题，我们引入了 \textbf{SoulX-LiveTalk}，这是一个针对高保真实时流媒体优化的 14B 参数框架。与传统的单向范例不同，我们使用 \textbf{自校正双向蒸馏} 策略来保留视频块内的双向注意力。这种设计保留了关键的时空相关性，显着增强了运动连贯性和视觉细节。为了确保无限生成过程中的稳定性，我们采用了\textbf{多步回顾性自我修正机制}，使模型能够从累积的错误中自主恢复并防止崩溃。此外，我们设计了一个包含混合序列并行性、并行 VAE 和内核级优化的全栈推理加速套件。广泛的评估证实，SoulX-LiveTalk 是第一个实现 \textbf{亚秒级启动延迟（0.87s）}同时达到 \textbf{32 FPS} 实时吞吐量的 14B 规模系统，为高保真交互式数字人合成树立了新标准。

Title: Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Authors: Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23413
Pdf URL: https://arxiv.org/pdf/2512.23413
Copy Paste: [[2512.23413]] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment(https://arxiv.org/abs/2512.23413)
Keywords: generation, quality assessment
Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
摘要：审美质量评估任务对于开发 AIGC 的人性化定量评估系统至关重要。然而，其固有的复杂性，涵盖视觉感知、认知和情感，带来了根本性的挑战。尽管美学描述提供了这种复杂性的可行表示，但仍然存在两个关键挑战：（1）数据稀缺和不平衡：现有数据集过度关注视觉感知，而由于昂贵的手动注释而忽略了更深的维度；（2）模型碎片化：当前的视觉网络通过多分支编码器隔离美学属性，而以对比学习为代表的多模态方法难以有效处理长篇文本描述。为了解决挑战（1），我们首先提出精炼美学描述（RAD）数据集，这是一个大规模（70k）、多维结构化数据集，通过迭代管道生成，无需大量注释成本且易于扩展。为了解决挑战（2），我们提出了 ArtQuant，一种艺术图像的美学评估框架，它不仅通过联合描述生成耦合孤立的美学维度，而且在 LLM 解码器的帮助下更好地建模长文本语义。此外，理论分析证实了这种共生关系：RAD 的语义充分性（数据）和生成范式（模型）共同最小化了预测熵，为该框架提供了数学基础。我们的方法在多个数据集上实现了最先进的性能，同时只需要传统训练周期的 33%，缩小了艺术图像和审美判断之间的认知差距。我们将发布代码和数据集以支持未来的研究。

Title: DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Authors: Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Hangjun Ye, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23421
Pdf URL: https://arxiv.org/pdf/2512.23421
Copy Paste: [[2512.23421]] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World(https://arxiv.org/abs/2512.23421)
Keywords: generation
Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
摘要：世界模型对于自动驾驶至关重要，因为它们可以了解场景如何随着时间的推移而演变，以解决现实世界的长尾挑战。然而，当前的方法将世界模型限制在有限的角色中：它们在表面上统一的架构中运行，而这些架构仍然将世界预测和运动规划保持为解耦的过程。为了弥补这一差距，我们提出了 DriveLaW，这是一种统一视频生成和运动规划的新颖范例。通过直接将视频生成器中的潜在表示注入规划器中，DriveLaW 确保了高保真未来生成与可靠轨迹规划之间的内在一致性。具体来说，DriveLaW 由两个核心组件组成：DriveLaW-Video，我们强大的世界模型，可通过富有表现力的潜在表示生成高保真度预测；DriveLaW-Act，一种扩散规划器，可从 DriveLaW-Video 的潜在特征生成一致且可靠的轨迹，这两个组件均通过三阶段渐进训练策略进行优化。我们统一范式的力量通过这两项任务的新的最先进的结果得到了证明。 DriveLaW 不仅显着提高了视频预测的性能，在 FID 中超越了最佳表现 33.3%，在 FVD 中超越了 1.8%，而且还在 NAVSIM 规划基准上创造了新记录。

Title: Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Authors: Dohyun Kim, Seungwoo Lyu, Seung Wook Kim, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23426
Pdf URL: https://arxiv.org/pdf/2512.23426
Copy Paste: [[2512.23426]] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision(https://arxiv.org/abs/2512.23426)
Keywords: generative
Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: this https URL
摘要：扩散模型在文本到图像合成等生成任务中取得了令人印象深刻的结果，但它们常常难以使输出与细致入微的用户意图完全一致并保持一致的美学质量。现有的基于偏好的训练方法（例如扩散直接偏好优化）有助于解决这些问题，但依赖于昂贵且可能有噪音的人类标记数据集。在这项工作中，我们引入了直接扩散分数偏好优化（DDSPO），当此类策略可用时，它直接从获胜和失败的策略中导出每个时间步长的监督。与仅对最终样本进行操作的现有方法不同，DDSPO 在整个去噪轨迹上提供密集的过渡级信号。在实践中，我们通过使用预先训练的参考模型自动生成偏好信号来避免对标记数据的依赖：我们将原始提示与语义退化变体的输出进行对比。这种实用的策略可以实现有效的分数空间偏好监督，而无需明确的奖励建模或手动注释。实证结果表明，DDSPO 提高了文本图像对齐和视觉质量，优于或匹配现有的基于偏好的方法，同时需要的监督显着减少。我们的实现可在以下位置找到：此 https URL

Title: RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

Authors: Shuhong Liu, Chenyu Bao, Ziteng Cui, Yun Liu, Xuangeng Chu, Lin Gu, Marcos V. Conde, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Tianhan Xu, Yuan Gan, Yusuke Kurose, Tatsuya Harada
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2512.23437
Pdf URL: https://arxiv.org/pdf/2512.23437
Copy Paste: [[2512.23437]] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction(https://arxiv.org/abs/2512.23437)
Keywords: restoration
Abstract: We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.
摘要：我们推出 RealX3D，这是一种在不同物理退化下进行多视图视觉恢复和 3D 重建的真实捕获基准。 RealX3D 将损坏分为四个系列，包括照明、散射、遮挡和模糊，并使用统一的采集协议以多个严重级别捕获每个系列，从而生成像素对齐的 LQ/GT 视图。每个场景都包含高分辨率捕获、原始图像和密集激光扫描，我们从中得出世界范围的网格和公制深度。对各种基于优化和前馈方法的基准测试表明，物理损坏下重建质量大幅下降，凸显了当前多视图管道在现实世界充满挑战的环境中的脆弱性。

Title: Stochastic Siamese MAE Pretraining for Longitudinal Medical Images

Authors: Taha Emre, Arunava Chakravarty, Thomas Pinetz, Dmitrii Lachinov, Martin J. Menten, Hendrik Scholl, Sobha Sivaprasad, Daniel Rueckert, Andrew Lotery, Stefan Sacu, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.23441
Pdf URL: https://arxiv.org/pdf/2512.23441
Copy Paste: [[2512.23441]] Stochastic Siamese MAE Pretraining for Longitudinal Medical Images(https://arxiv.org/abs/2512.23441)
Keywords: generation
Abstract: Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer's Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.
摘要：时间感知图像表示对于捕获 3D 纵向医学数据集中的疾病进展至关重要。然而，最近最先进的自监督学习方法，如掩码自动编码（MAE），尽管具有强大的表示学习能力，但缺乏时间意识。在本文中，我们提出了 STAMP（带掩模预训练的随机时间自动编码器），这是一种 Siamese MAE 框架，它通过随机过程通过调节 2 个输入量之间的时间差来编码时间信息。确定性连体方法比较不同时间点的扫描结果，但无法解释疾病进化中固有的不确定性，而 STAMP 与此不同，STAMP 通过将 MAE 重建损失重新定义为条件变分推理目标来随机学习时间动态。我们在两个 OCT 和一个 MRI 数据集上评估了 STAMP，每位患者多次就诊。 STAMP 预训练的 ViT 模型在不同晚期年龄相关性黄斑变性和阿尔茨海默病进展预测方面优于现有的时间 MAE 方法和基础模型，这需要模型了解疾病的潜在非确定性时间动态。

Title: CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23453
Pdf URL: https://arxiv.org/pdf/2512.23453
Copy Paste: [[2512.23453]] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models(https://arxiv.org/abs/2512.23453)
Keywords: generation, generative
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at this https URL.
摘要：大视觉语言模型（LVLM）在多模态理解和生成方面取得了令人瞩目的进展。然而，它们仍然倾向于产生与视觉输入不一致的幻觉内容，这限制了它们在现实世界应用中的可靠性。我们提出了 \textbf{CoFi-Dec}，这是一种免训练的解码框架，通过将生成性自我反馈与从粗到细的视觉调节相结合来减轻幻觉。受到从全局场景感知到详细检查的人类视觉过程的启发，CoFi-Dec 首先根据原始图像的粗粒度和细粒度视图生成两个中间文本响应。然后使用文本到图像模型将这些响应转换为合成图像，形成多层次的视觉假设，丰富基础线索。为了统一这些多种视觉条件的预测，我们引入了一种基于 Wasserstein 的融合机制，将它们的预测分布调整为几何一致的解码轨迹。这种原则性的融合将高级语义一致性与细粒度视觉基础相协调，从而产生更稳健和忠实的输出。对六个以幻觉为中心的基准的广泛实验表明，CoFi-Dec 大大减少了实体级和语义级幻觉，优于现有的解码策略。该框架与模型无关，不需要额外的培训，并且可以无缝应用于各种 LVLM。此 https URL 提供了该实现。

Title: Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

Authors: Kayathri Vigneswaran, Hugo Retief, Jai Clifford Holmes, Mariangel Garcia Andarcia, Hansaka Tennakoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23454
Pdf URL: https://arxiv.org/pdf/2512.23454
Copy Paste: [[2512.23454]] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin(https://arxiv.org/abs/2512.23454)
Keywords: generative
Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.
摘要：准确、持续的河流水位监测对于洪水预报、水资源管理和生态保护至关重要。传统的水文观测方法往往受到人工测量误差和环境限制的限制。本研究提出了一个混合框架，集成了基于视觉的水线检测、YOLOv8 姿态尺度提取和大型多模态语言模型（GPT 4o 和 Gemini 2.0 Flash），用于自动河流水位板读取。该方法涉及图像预处理、注释、水线检测、刻度间隙估计和数值读数提取的连续阶段。实验表明，水线检测精度达到94.24%，F1分数达到83.64%，而刻度间隙检测为后续读数提取提供了精确的几何校准。纳入尺度差距元数据大大提高了 LLM 的预测性能，Gemini Stage 2 实现了最高的准确度，在最佳图像条件下平均绝对误差为 5.43 cm，均方根误差为 8.58 cm，R 平方为 0.84。结果突显了法学硕士对图像质量的敏感性，图像质量下降会产生更高的误差，并强调了将几何元数据与多模态人工智能相结合以实现稳健的水位估计的重要性。总体而言，所提出的方法为自动水文监测提供了可扩展、高效且可靠的解决方案，展示了实时河流测量数字化和改进水资源管理的潜力。

Title: Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Authors: Bohan Xiao, Peiyong Wang, Qisheng He, Ming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23463
Pdf URL: https://arxiv.org/pdf/2512.23463
Copy Paste: [[2512.23463]] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators(https://arxiv.org/abs/2512.23463)
Keywords: super-resolution, generation, generative
Abstract: Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: this https URL
摘要：图像到图像 (I2I) 转换涉及将图像从一个域转换到另一个域。确定性 I2I 转换（例如图像超分辨率）通过保证每个输入生成一致且可预测的输出，以高保真度紧密匹配地面实况 (GT)，从而扩展了这一概念。在本文中，我们提出了一种具有双逼近器的去噪布朗桥模型（Dual-approx Bridge），这是一种新颖的生成模型，利用布朗桥动力学和两个基于神经网络的逼近器（一个用于正向过程，一个用于反向过程），以在 I2I 转换中产生方差可忽略不计的忠实输出和高图像质量。我们对基准数据集（包括图像生成和超分辨率）进行的广泛实验证明，与随机和确定性基线相比，Dual-approx Bridge 在图像质量和 GT 忠实度方面具有一致且卓越的性能。项目页面和代码：此 https URL

Title: HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Authors: Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, Jinkun Guo, Dongyue An, Han Liu, Yanyu Tong, Chao Zhang, Qing Guo, Juan Chen, Qiao Zhang, Youyi Zhang, Zihao Yao, Cheng Zhang, Hong Duan, Xiaoping Wu, Qi Chen, Fei Cheng, Liang Dong, Peng He, Hao Zhang, Jiaxin Lin, Chao Zhang, Zhongyi Fan, Yifan Li, Zhichao Hu, Yuhong Liu, Linus, Jie Jiang, Xiaolong Li, Linchao Bao
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.23464
Pdf URL: https://arxiv.org/pdf/2512.23464
Copy Paste: [[2512.23464]] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation(https://arxiv.org/abs/2512.23464)
Keywords: generation
Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.
摘要：我们推出了 HY-Motion 1.0，这是一系列最先进的大规模运动生成模型，能够根据文本描述生成 3D 人体运动。 HY-Motion 1.0 代表了在运动生成领域将基于扩散变压器 (DiT) 的流匹配模型扩展到十亿参数规模的首次成功尝试，提供了显着优于当前开源基准的指令跟踪功能。独特的是，我们引入了全面的全阶段训练范例，包括对 3,000 多个小时的运动数据进行大规模预训练，对 400 小时的精选数据进行高质量微调，以及根据人类反馈和奖励模型进行强化学习，以确保与文本指令的精确对齐和高运动质量。该框架由我们细致的数据处理管道支持，该管道执行严格的运动清理和字幕。因此，我们的模型实现了最广泛的覆盖范围，涵盖 6 个主要类别的 200 多个运动类别。我们向开源社区发布 HY-Motion 1.0，以促进未来的研究并加速 3D 人体运动生成模型向商业成熟的过渡。

Title: SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context

Authors: Shuyuan Lin, Hailiang Liao, Qiang Qi, Junjie Huang, Taotao Lai, Jian Weng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23473
Pdf URL: https://arxiv.org/pdf/2512.23473
Copy Paste: [[2512.23473]] SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context(https://arxiv.org/abs/2512.23473)
Keywords: generation
Abstract: Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model's position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at this http URL.
摘要：最近的研究重点是使用卷积神经网络（CNN）作为双视图对应学习的骨干，证明了相对于基于多层感知器的方法的显着优越性。然而，不适合特定任务的 CNN 主干网络可能无法有效聚合全局上下文，并且在视差较大的场景中无法实现过于平滑的密集运动场。为了解决这些问题，我们提出了一种名为 SC-Net 的新型网络，它从空间和通道的角度有效地整合了双边上下文。具体来说，我们设计了一个自适应聚焦正则化模块（AFR）来增强模型的位置感知和针对虚假运动样本的鲁棒性，从而有助于生成更准确的运动场。然后，我们提出了双边场调整模块（BFA），通过同时建模远程关系并促进跨空间和通道维度的交互来细化运动场。最后，我们使用位置感知恢复模块（PAR）从细化场中恢复运动矢量，以确保一致性和精度。大量实验表明，SC-Net 在 YFCC100M 和 SUN3D 数据集上的相对位姿估计和异常值去除任务中优于最先进的方法。源代码可从此 http URL 获取。

Title: IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Authors: Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23519
Pdf URL: https://arxiv.org/pdf/2512.23519
Copy Paste: [[2512.23519]] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation(https://arxiv.org/abs/2512.23519)
Keywords: generation, generative
Abstract: Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.
摘要：最近的视觉生成模型使得故事生成具有一致的文本字符，但以人为中心的故事生成面临着额外的挑战，例如保持详细和多样化的人脸一致性以及协调不同图像中的多个角色。本文介绍了 IdentityStory，这是一个以人为中心的故事生成框架，可确保多个连续图像中角色身份的一致性。通过驯服身份保留生成器，该框架具有两个关键组件：迭代身份发现（提取有凝聚力的字符身份）和重新去噪身份注入（对图像重新去噪以注入身份，同时保留所需的上下文）。 ConsiStory-Human 基准测试表明，IdentityStory 优于现有方法，特别是在面部一致性方面，并且支持多字符组合。该框架还显示出强大的应用潜力，例如无限长度的故事生成和动态角色构成。

Title: Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

Authors: Hexin Zhang, Dong Li, Jie Huang, Bingzhou Wang, Xueyang Fu, Zhengjun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23532
Pdf URL: https://arxiv.org/pdf/2512.23532
Copy Paste: [[2512.23532]] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution(https://arxiv.org/abs/2512.23532)
Keywords: super-resolution
Abstract: Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
摘要：扩散模型已成为图像超分辨率（SR）的领先范例，但现有方法很难保证生成图像的高频感知质量和低频结构保真度。尽管推理时间缩放理论上可以通过分配更多计算来改善这种权衡，但现有策略仍然不是最优的：奖励驱动的粒子优化通常会导致感知过度平滑，而最优路径搜索往往会失去结构一致性。为了克服这些困难，我们提出了具有自适应频率引导的迭代扩散推理时间缩放（IAFS），这是一种免训练框架，联合利用迭代细化和频率感知粒子融合。 IAFS 通过结构偏差的迭代校正逐步细化生成的图像，解决了平衡感知质量和结构保真度的挑战。同时，它通过自适应地将高频感知线索与低频结构信息集成来确保有效的频率融合，从而能够在不同的图像细节上进行更准确和平衡的重建。跨多个基于扩散的 SR 模型的大量实验表明，IAFS 有效解决了感知保真度冲突，产生持续改进的感知细节和结构准确性，并且优于现有的推理时间缩放方法。

Title: AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23537
Pdf URL: https://arxiv.org/pdf/2512.23537
Copy Paste: [[2512.23537]] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization(https://arxiv.org/abs/2512.23537)
Keywords: generation
Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
摘要：多主题定制旨在将多个用户指定的主题合成为连贯的图像。为了解决主题缺失或冲突等问题，最近的作品结合了布局指导以提供明确的空间限制。然而，现有的方法仍然难以平衡三个关键目标：文本对齐、主题身份保留和布局控制，而对额外训练的依赖进一步限制了其可扩展性和效率。在本文中，我们提出了 AnyMS，一种新颖的免培训框架，用于布局引导的多主题定制。 AnyMS利用三种输入条件：文本提示、主题图像和布局约束，并引入自下而上的双层注意力解耦机制来协调它们在生成过程中的集成。具体来说，全局解耦分离了文本和视觉条件之间的交叉注意力，以确保文本对齐。局部解耦将每个主体的注意力限制在其指定区域，从而防止主体冲突，从而保证身份保存和布局控制。此外，AnyMS 采用预先训练的图像适配器来提取与扩散模型一致的特定于主题的特征，从而无需进行主题学习或适配器调整。大量实验表明 AnyMS 实现了最先进的性能，支持复杂的组合并扩展到更多的主题。

Title: PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Authors: Zongsheng Cao, Yangfan He, Anran Liu, Jun Xie, Feng Chen, Zepeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23546
Pdf URL: https://arxiv.org/pdf/2512.23546
Copy Paste: [[2512.23546]] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation(https://arxiv.org/abs/2512.23546)
Keywords: generation
Abstract: Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to this https URL.
摘要：扩散模型的最新进展显着提高了文本到图像 (T2I) 的生成质量，但也增加了生成不安全内容的风险。文本黑名单或有害内容分类等传统安全方法具有显着的缺点：它们很容易被规避或需要大量数据集和额外培训。为了克服这些挑战，我们引入了 PurifyGen，这是一种新颖的、免训练的安全 T2I 生成方法，可保留模型的原始权重。 PurifyGen 引入了快速纯化的双阶段策略。首先，我们通过计算提示中每个标记的互补语义距离来评估其安全性，该距离测量提示标记与预定义的有毒列表和干净列表中的概念嵌入之间的语义接近度。这可以实现细粒度的提示分类，而无需显式关键字匹配或重新训练。接近有毒概念的代币被标记为有风险。其次，对于有风险的提示，我们应用了双空间转换：我们将有毒对齐的嵌入投影到有毒概念矩阵的零空间中，有效地去除有害的语义成分，同时将它们对齐到干净概念的范围空间中。这种双重对齐通过减少不安全语义和增强安全语义来净化风险提示，同时保留原始意图和连贯性。我们进一步定义了一种明智的代币策略，以选择性地仅替换有风险的代币嵌入，确保对安全内容的干扰最小化。 PurifyGen 提供了一种即插即用的解决方案，具有理论基础和对未见过的提示和模型的强大泛化能力。广泛的测试表明，PurifyGen 在减少五个数据集的不安全内容方面超越了当前的方法，并且与依赖训练的方法相媲美。代码可以参考这个https URL。

Title: ThinkGen: Generalized Thinking for Visual Generation

Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23568
Pdf URL: https://arxiv.org/pdf/2512.23568
Copy Paste: [[2512.23568]] ThinkGen: Generalized Thinking for Visual Generation(https://arxiv.org/abs/2512.23568)
Keywords: generation, generative
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: this https URL
摘要：多模态大型语言模型 (MLLM) 的最新进展表明，思想链 (CoT) 推理可以为复杂的理解任务提供系统的解决方案。然而，它对发电任务的扩展仍然处于新生阶段，并且受到特定场景机制的限制，阻碍了泛化和适应。在这项工作中，我们提出了 ThinkGen，这是第一个思维驱动的视觉生成框架，它在各种生成场景中明确利用 MLLM 的 CoT 推理。 ThinkGen 采用解耦架构，包括预训练的 MLLM 和扩散变压器 (DiT)，其中 MLLM 根据用户意图生成定制指令，DiT 在这些指令的指导下生成高质量图像。我们进一步提出了一种可分离的基于 GRPO 的训练范式（SepGRPO），在 MLLM 和 DiT 模块之间交替强化学习。这种灵活的设计支持跨不同数据集的联合训练，促进对各种生成场景进行有效的 CoT 推理。大量实验表明，ThinkGen 在多代基准测试中实现了强大、最先进的性能。代码可用：此 https URL

Title: ProGuard: Towards Proactive Multimodal Safeguard

Authors: Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23573
Pdf URL: https://arxiv.org/pdf/2512.23573
Copy Paste: [[2512.23573]] ProGuard: Towards Proactive Multimodal Safeguard(https://arxiv.org/abs/2512.23573)
Keywords: generative
Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
摘要：生成模型的快速演变导致多模式安全风险不断出现，暴露出现有防御方法的局限性。为了应对这些挑战，我们提出了 ProGuard，这是一种视觉语言主动防护，可以识别和描述分布外 (OOD) 安全风险，而无需传统反应方法所需的模型调整。我们首先构建了一个包含 87K 个样本的模态平衡数据集，每个样本都在分层多模态安全分类法下用二进制安全标签和风险类别进行了注释，有效地减轻了模态偏差并确保文本、图像和文本图像输入之间的一致调节。基于这个数据集，我们纯粹通过强化学习（RL）来训练我们的视觉语言基础模型，以实现高效、简洁的推理。为了在受控环境中近似主动安全场景，我们进一步引入了 OOD 安全类别推理任务，并通过基于同义词库的相似性奖励来增强 RL 目标，鼓励模型为看不见的不安全类别生成简洁的描述。实验结果表明，ProGuard 在二进制安全分类上实现了与闭源大型模型相当的性能，在不安全内容分类上大大优于现有的开源 Guard 模型。最值得注意的是，ProGuard 具有强大的主动审核能力，将 OOD 风险检测提高了 52.6%，将 OOD 风险描述提高了 64.8%。

Title: LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Authors: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23576
Pdf URL: https://arxiv.org/pdf/2512.23576
Copy Paste: [[2512.23576]] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation(https://arxiv.org/abs/2512.23576)
Keywords: generation
Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
摘要：通过扩散生成实时视频对于构建通用多模式交互式人工智能系统至关重要。然而，通过扩散模型中的迭代过程对所有视频帧进行双向关注的同时去噪会阻碍实时交互。虽然现有的蒸馏方法可以使模型自回归并减少采样步骤来缓解这种情况，但它们主要关注文本到视频的生成，导致人机交互不自然且效率较低。本文的目标是在多模态环境（包括文本、图像和音频）下进行实时交互式视频传播，以弥补这一差距。鉴于观察到领先的策略蒸馏方法 Self Forcing 在多模式条件下遇到了挑战（闪烁、黑框和质量下降等视觉伪影），我们研究了一种改进的蒸馏方法，重点关注条件输入的质量以及策略优化的初始化和时间表。在包括 HDTF、AVSpeech 和 CelebV-HQ 在内的多模态条件（音频、图像和文本）头像视频生成基准上，我们的蒸馏模型与相似或更大尺寸的全步双向基线的视觉质量相匹配，推理成本和延迟降低了 20 倍。此外，我们将我们的模型与音频语言模型和长格式视频推理技术 Anchor-Heavy Identity Sinks 相结合，构建了 LiveTalk，一个实时多模式交互式化身系统。对我们策划的多轮交互基准的系统级评估表明，LiveTalk 在多轮视频一致性和内容质量方面优于最先进的模型（Sora2、Veo3），同时将响应延迟从 1 到 2 分钟缩短到实时生成，从而实现无缝的人机人工智能多模式交互。

Title: Memorization in 3D Shape Generation: An Empirical Study

Authors: Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, Zhuang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23628
Pdf URL: https://arxiv.org/pdf/2512.23628
Copy Paste: [[2512.23628]] Memorization in 3D Shape Generation: An Empirical Study(https://arxiv.org/abs/2512.23628)
Keywords: generation, generative
Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at this https URL.
摘要：生成模型越来越多地用于 3D 视觉中来合成新颖的形状，但尚不清楚它们的生成是否依赖于记忆训练形状。了解他们的记忆有助于防止训练数据泄漏并提高生成结果的多样性。在本文中，我们设计了一个评估框架来量化 3D 生成模型中的记忆，并研究不同数据和建模设计对记忆的影响。我们首先应用我们的框架来量化现有方法中的记忆。接下来，通过潜在向量集（Vecset）扩散模型的对照实验，我们发现，在数据方面，记忆取决于数据模态，并且随着数据多样性和更细粒度的调节而增加；在建模方面，它在适度的指导尺度上达到峰值，并且可以通过更长的向量集和简单的旋转增强来缓解。我们的框架和分析共同提供了对 3D 生成模型中记忆的实证理解，并提出了简单而有效的策略来减少记忆而不降低生成质量。我们的代码可以在这个 https URL 上找到。

Title: OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Authors: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23646
Pdf URL: https://arxiv.org/pdf/2512.23646
Copy Paste: [[2512.23646]] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding(https://arxiv.org/abs/2512.23646)
Keywords: generation
Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
摘要：全模态大语言模型在统一音频和视觉模态方面取得了重大进展；然而，他们往往缺乏细粒度的跨模式理解，并且难以进行多模式对齐。为了解决这些限制，我们引入了 OmniAgent，这是一种完全音频引导的主动感知代理，可以动态编排专用工具以实现更细粒度的视听推理。与之前依赖于僵化、静态工作流程和密集帧字幕的作品不同，本文展示了从被动响应生成到主动多模态查询的范式转变。 OmniAgent 采用动态规划来按需自主编排工具调用，战略性地将感知注意力集中在与任务相关的线索上。我们方法的核心是一种新颖的从粗到细的音频引导感知范例，它利用音频提示来定位时间事件并指导后续推理。对三个音视频理解基准的广泛实证评估表明，OmniAgent 实现了最先进的性能，在准确度上大幅超越领先的开源和专有模型，其准确率高出 10% - 20%。

Title: IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Authors: Kang Du, Yirui Guan, Zeyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23667
Pdf URL: https://arxiv.org/pdf/2512.23667
Copy Paste: [[2512.23667]] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition(https://arxiv.org/abs/2512.23667)
Keywords: generative
Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.
摘要：内在图像分解是视觉理解的基础，因为 RGB 图像涉及材料属性、照明和依赖于视图的效果。最近基于扩散的方法在单视图内在分解方面取得了很好的成果；然而，将这些方法扩展到多视图设置仍然具有挑战性，通常会导致严重的视图不一致。我们提出\textbf{本征分解变换器（IDT）}，一种用于多视图本征图像分解的前馈框架。通过利用基于 Transformer 的注意力对多个输入图像进行联合推理，IDT 在单次前向传递中产生视图一致的内在因子，而无需迭代生成采样。 IDT 采用物理基础的图像形成模型，将图像明确分解为漫反射、漫反射阴影和镜面阴影。这种结构化分解将朗伯光传输和非朗伯光传输分开，从而实现跨视图的材质和照明效果的可解释和可控分解。对合成数据集和真实数据集的实验表明，与之前的内在分解方法相比，IDT 实现了更清晰的漫反射、更连贯的漫反射阴影和更好的隔离镜面反射分量，同时显着提高了多视图一致性。

Title: Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Authors: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23705
Pdf URL: https://arxiv.org/pdf/2512.23705
Copy Paste: [[2512.23705]] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation(https://arxiv.org/abs/2512.23705)
Keywords: generative
Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
摘要：透明物体对于感知系统来说仍然是出了名的困难：折射、反射和透射打破了立体、ToF 和纯粹辨别单眼深度背后的假设，导致空洞和暂时不稳定的估计。我们的主要观察结果是，现代视频扩散模型已经合成了令人信服的透明现象，这表明它们已经内化了光学规则。我们构建了 TransPhy3D，一个透明/反射场景的合成视频语料库：使用 Blender/Cycles 渲染的 11k 序列。场景由一组精心策划的类别丰富的静态资产和形状丰富的程序资产与玻璃/塑料/金属材料搭配而成。我们使用基于物理的光线追踪和 OptiX 去噪来渲染 RGB + 深度 + 法线。从大型视频扩散模型开始，我们通过轻量级 LoRA 适配器学习深度（和法线）的视频到视频转换器。在训练过程中，我们连接 DiT 主干中的 RGB 和（噪声）深度潜值，并在 TransPhy3D 和现有的逐帧合成数据集上进行联合训练，从而为任意长度的输入视频生成时间一致的预测。由此产生的模型 DKT 在涉及透明度的真实和合成视频基准上实现了零样本 SOTA：ClearPose、DREDS (CatKnown/CatNovel) 和 TransPhy3D-Test。它提高了强图像/视频基线的准确性和时间一致性，并且法线变体在 ClearPose 上设置了最佳视频法线估计结果。紧凑型 1.3B 版本的运行速度约为 0.17 秒/帧。 DKT 的深度集成到抓取堆栈中，提高了半透明、反射和漫射表面的成功率，优于之前的估计器。总之，这些结果支持了一个更广泛的主张：“扩散知道透明度。”生成视频先验可以有效且无标签地重新利用，形成强大的、时间连贯的感知，以应对现实世界的操纵。

Title: Training AI Co-Scientists Using Rubric Rewards

Authors: Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse
Subjects: cs.LG, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2512.23707
Pdf URL: https://arxiv.org/pdf/2512.23707
Copy Paste: [[2512.23707]] Training AI Co-Scientists Using Rubric Rewards(https://arxiv.org/abs/2512.23707)
Keywords: generation
Abstract: AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.
摘要：人工智能联合科学家正在成为帮助人类研究人员实现研究目标的工具。这些人工智能联合科学家的一个重要特征是能够在给定一组目标和约束的情况下制定研究计划。该计划可能被研究人员用来进行头脑风暴，甚至可能在进一步细化后实施。然而，语言模型目前很难生成遵循所有约束和隐含要求的研究计划。在这项工作中，我们研究如何利用现有研究论文的大量语料库来训练语言模型，从而生成更好的研究计划。我们通过从多个领域的论文中自动提取研究目标和特定目标的评分标准来构建可扩展的、多样化的训练语料库。然后，我们通过强化学习和自我评分来训练用于研究计划生成的模型。初始策略的冻结副本在训练期间充当评分器，其规则创建了生成器与验证器之间的差距，无需外部人工监督即可进行改进。为了验证这种方法，我们与人类专家一起针对机器学习研究目标进行了一项为期 225 小时的研究。对于 70% 的研究目标，专家们更喜欢由我们经过微调的 Qwen3-30B-A3B 模型生成的计划，而不是初始模型，并批准了 84% 自动提取的特定目标评分标准。为了评估普遍性，我们还将我们的方法扩展到医学论文和新的 arXiv 预印本的研究目标，并通过前沿模型评审团进行评估。我们的微调产生了 12-22% 的相对改进和显着的跨领域泛化，即使在执行反馈不可行的医学研究等问题环境中也被证明是有效的。总之，这些发现证明了可扩展的自动化训练方法的潜力，可以作为提高通用人工智能联合科学家的一个步骤。

Title: Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Authors: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.23709
Pdf URL: https://arxiv.org/pdf/2512.23709
Copy Paste: [[2512.23709]] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion(https://arxiv.org/abs/2512.23709)
Keywords: super-resolution
Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: this https URL
摘要：基于扩散的视频超分辨率（VSR）方法实现了很强的感知质量，但由于依赖未来帧和昂贵的多步去噪，对于延迟敏感的设置仍然不切实际。我们提出了 Stream-DiffVSR，一种用于高效在线 VSR 的因果条件扩散框架。它严格在过去的帧上运行，结合了用于快速推理的四步蒸馏降噪器、在潜在去噪期间注入运动对齐线索的自回归时间引导 (ARTG) 模块，以及带有时间处理器模块 (TPM) 的轻量级时间感知解码器，可增强细节和时间一致性。 Stream-DiffVSR 在 RTX4090 GPU 上处理 720p 帧只需 0.328 秒，显着优于之前基于扩散的方法。与在线 SOTA TMP 相比，它提高了感知质量 (LPIPS +0.095)，同时将延迟降低了 130 倍以上。 Stream-DiffVSR 实现了基于扩散的 VSR 报告的最低延迟，将初始延迟从 4600 秒以上减少到 0.328 秒，从而使其成为第一个适合低延迟在线部署的扩散 VSR 方法。项目页面：此 https URL