2026-03-16

Title: From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness

Authors: Terrence J. Lee-St. John, Jordan L. Lawson, Bartlomiej Piechowski-Jozwiak
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2603.12288
Pdf URL: https://arxiv.org/pdf/2603.12288
Copy Paste: [[2603.12288]] From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness(https://arxiv.org/abs/2603.12288)
Keywords: generative
Abstract: Tabular machine learning presents a paradox: modern models achieve state-of-the-art performance using high-dimensional (high-D), collinear, error-prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor-space "noise" into "Predictor Error" and "Structural Uncertainty" (informational deficits from stochastic generative mappings), we prove that leveraging high-D sets of error-prone predictors asymptotically overcomes both types of noise, whereas cleaning a low-D set is fundamentally bounded by Structural Uncertainty. We demonstrate why "Informative Collinearity" (dependencies from shared latent causes) enhances reliability and convergence efficiency, and explain why increased dimensionality reduces the latent inference burden, enabling feasibility with finite samples. To address practical constraints, we propose "Proactive Data-Centric AI" to identify predictors that enable robustness efficiently. We also derive boundaries for Systematic Error Regimes and show why models that absorb "rogue" dependencies can mitigate assumption violations. Linking latent architecture to Benign Overfitting, we offer a first step towards a unified view of robustness to Outcome Error and predictor-space noise, while also delineating when traditional DCAI's focus on label cleaning remains powerful. By redefining data quality from item-level perfection to portfolio-level architecture, we provide a theoretical rationale for "Local Factories" -- learning from live, uncurated enterprise "data swamps" -- supporting a deployment paradigm shift from "Model Transfer" to "Methodology Transfer'' to overcome static generalizability limitations.
摘要：表格机器学习提出了一个悖论：现代模型使用高维（高维）、共线、容易出错的数据实现了最先进的性能，违背了“垃圾输入，垃圾输出”的口头禅。为了帮助解决这个问题，我们综合了信息论、潜在因素模型和心理测量学的原理，阐明预测稳健性不仅来自数据清洁度，还来自数据架构和模型能力之间的协同作用。将预测器空间“噪声”划分为“预测器误差”和“结构不确定性”（随机生成映射的信息缺陷），我们证明利用易错预测器的高维集可以渐近克服这两种类型的噪声，而清理低维集从根本上受到结构不确定性的限制。我们证明了为什么“信息共线性”（共享潜在原因的依赖性）可以提高可靠性和收敛效率，并解释为什么增加维度可以减少潜在的推理负担，从而实现有限样本的可行性。为了解决实际限制，我们提出“主动的以数据为中心的人工智能”来识别能够有效实现稳健性的预测因素。我们还推导了系统错误机制的边界，并展示了为什么吸收“流氓”依赖性的模型可以减轻假设违规。将潜在架构与良性过度拟合联系起来，我们为实现结果误差和预测空间噪声的鲁棒性统一视图迈出了第一步，同时还描述了传统 DCAI 对标签清理的关注何时仍然强大。通过将数据质量从项目级完美重新定义为组合级架构，我们为“本地工厂”提供了理论依据——从实时的、未经策划的企业“数据沼泽”中学习——支持从“模型转移”到“方法转移”的部署范式转变，以克服静态通用性限制。

Title: Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions

Authors: Ziwei Wang, Zhentao He, Xingyi He, Hongbin Wang, Tianwang Jia, Jingwei Luo, Siyang Li, Xiaoqing Chen, Dongrui Wu
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2603.12296
Pdf URL: https://arxiv.org/pdf/2603.12296
Copy Paste: [[2603.12296]] Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions(https://arxiv.org/abs/2603.12296)
Keywords: generation, generative
Abstract: Deep learning has achieved transformative performance across diverse domains, largely driven by the large-scale, high-quality training data. In contrast, the development of brain-computer interfaces (BCIs) is fundamentally constrained by the limited, heterogeneous, and privacy-sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a compelling way to mitigate data scarcity and enhance model capacity. This survey provides a comprehensive review of brain signal generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, and key applications. We systematically categorize existing generative algorithms into four types: knowledge-based, feature-based, model-based, and translation-based approaches. Furthermore, we benchmark existing brain signal generation approaches across four representative BCI paradigms to provide an objective performance comparison. Finally, we discuss the potentials and challenges of current generation approaches and prospect future research on accurate, data-efficient, and privacy-aware BCI systems. The benchmark codebase is publicized at this https URL.
摘要：深度学习在不同领域取得了变革性的表现，这在很大程度上是由大规模、高质量的训练数据驱动的。相比之下，脑机接口（BCIs）的发展从根本上受到有限、异构和隐私敏感的神经记录的限制。因此，生成合成且生理上合理的大脑信号已成为缓解数据稀缺和增强模型能力的一种引人注目的方法。这项调查对脑机接口的大脑信号生成进行了全面的回顾，涵盖方法分类、基准实验、评估指标和关键应用。我们系统地将现有的生成算法分为四种类型：基于知识的方法、基于特征的方法、基于模型的方法和基于翻译的方法。此外，我们对四种代表性 BCI 范式的现有大脑信号生成方法进行了基准测试，以提供客观的性能比较。最后，我们讨论了当前一代方法的潜力和挑战，并展望了准确、数据高效和隐私意识 BCI 系统的未来研究。基准代码库在此 https URL 上公布。

Title: VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Authors: Yiwen Song, Tomas Pfister, Yale Song
Subjects: cs.CV, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2603.12310
Pdf URL: https://arxiv.org/pdf/2603.12310
Copy Paste: [[2603.12310]] VQQA: An Agentic Approach for Video Evaluation and Quality Improvement(https://arxiv.org/abs/2603.12310)
Keywords: generation
Abstract: Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.
摘要：尽管视频生成模型取得了快速进步，但将其输出与复杂的用户意图保持一致仍然具有挑战性。现有的测试时优化方法通常要么计算成本昂贵，要么需要对模型内部进行白盒访问。为了解决这个问题，我们提出了 VQQA（视频质量问答），这是一个统一的多代理框架，可跨不同的输入模式和视频生成任务进行推广。通过动态生成视觉问题并使用生成的视觉语言模型 (VLM) 评论作为语义梯度，VQQA 用人类可解释的、可操作的反馈取代了传统的被动评估指标。这通过黑盒自然语言界面实现了高效、闭环提示优化过程。大量实验表明，VQQA 可以有效地隔离和解决视觉伪影，只需几个细化步骤即可显着提高生成质量。我们的方法适用于文本到视频 (T2V) 和图像到视频 (I2V) 任务，与普通生成相比，在 T2V-CompBench 上实现了 +11.57% 的绝对改进，在 VBench2 上实现了 +8.43% 的绝对改进，显着优于最先进的随机搜索和提示优化技术。

Title: Sinkhorn-Drifting Generative Models

Authors: Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.12366
Pdf URL: https://arxiv.org/pdf/2603.12366
Copy Paste: [[2603.12366]] Sinkhorn-Drifting Generative Models(https://arxiv.org/abs/2603.12366)
Keywords: generative
Abstract: We establish a theoretical link between the recently proposed "drifting" generative dynamics and gradient flows induced by the Sinkhorn divergence. In a particle discretization, the drift field admits a cross-minus-self decomposition: an attractive term toward the target distribution and a repulsive/self-correction term toward the current model, both expressed via one-sided normalized Gibbs kernels. We show that Sinkhorn divergence yields an analogous cross-minus-self structure, but with each term defined by entropic optimal-transport couplings obtained through two-sided Sinkhorn scaling (i.e., enforcing both marginals). This provides a precise sense in which drifting acts as a surrogate for a Sinkhorn-divergence gradient flow, interpolating between one-sided normalization and full two-sided Sinkhorn scaling. Crucially, this connection resolves an identifiability gap in prior drifting formulations: leveraging the definiteness of the Sinkhorn divergence, we show that zero drift (equilibrium of the dynamics) implies that the model and target measures match. Experiments show that Sinkhorn drifting reduces sensitivity to kernel temperature and improves one-step generative quality, trading off additional training time for a more stable optimization, without altering the inference procedure used by drift methods. These theoretical gains translate to strong low-temperature improvements in practice: on FFHQ-ALAE at the lowest temperature setting we evaluate, Sinkhorn drifting reduces mean FID from 187.7 to 37.1 and mean latent EMD from 453.3 to 144.4, while on MNIST it preserves full class coverage across the temperature sweep. Project page: this https URL
摘要：我们在最近提出的“漂移”生成动力学和辛霍恩散度引起的梯度流之间建立了理论联系。在粒子离散化中，漂移场允许交叉负自分解：对目标分布的吸引项和对当前模型的排斥/自校正项，两者均通过单侧归一化吉布斯核表示。我们证明 Sinkhorn 散度产生了类似的交叉负自结构，但每个项都由通过两侧 Sinkhorn 缩放（即强制执行两个边际）获得的熵最优传输耦合定义。这提供了一种精确的感觉，其中漂移充当 Sinkhorn 发散梯度流的替代，在单侧归一化和完全两侧 Sinkhorn 缩放之间进行插值。至关重要的是，这种联系解决了先前漂移公式中的可识别性差距：利用 Sinkhorn 散度的确定性，我们表明零漂移（动态平衡）意味着模型和目标测量匹配。实验表明，Sinkhorn 漂移降低了对内核温度的敏感性并提高了一步生成质量，以额外的训练时间换取更稳定的优化，而无需改变漂移方法使用的推理过程。这些理论收益转化为实践中的强大低温改进：在我们评估的最低温度设置下的 FFHQ-ALAE 上，Sinkhorn 漂移将平均 FID 从 187.7 降低至 37.1，将平均潜在 EMD 从 453.3 降低至 144.4，而在 MNIST 上，它在整个温度扫描范围内保持了完整的类别覆盖。项目页面：此 https URL

Title: Unleashing Video Language Models for Fine-grained HRCT Report Generation

Authors: Yingying Fang, Huichi Zhou, KinHei Lee, Yijia Wang, Zhenxuan Zhang, Jiahao Huang, Guang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12469
Pdf URL: https://arxiv.org/pdf/2603.12469
Copy Paste: [[2603.12469]] Unleashing Video Language Models for Fine-grained HRCT Report Generation(https://arxiv.org/abs/2603.12469)
Keywords: generation
Abstract: Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at this https URL
摘要：从高分辨率计算机断层扫描 (HRCT) 生成精确的诊断报告对于临床工作流程至关重要，但由于 3D 体积内的高度病理多样性和空间稀疏性，它仍然是一个艰巨的挑战。虽然视频语言模型 (VideoLM) 在一般领域表现出了卓越的时空推理能力，但它们对特定领域、大容量医学解释的适应性仍有待探索。在这项工作中，我们提出了 AbSteering，这是一个以异常为中心的框架，可引导 VideoLM 生成精确的 HRCT 报告。具体来说，AbSteering 引入了：(i) 一种以异常为中心的思想链方案，强制执行异常推理；(ii) 直接偏好优化目标，利用临床上易混淆的异常作为硬阴性来增强细粒度区分。我们的结果表明，在这种范式的指导下，通用 VideoLM 具有向大容量医学成像的强大可移植性。值得注意的是，AbSteering 的性能优于最先进的特定领域 CT 基础模型，后者经过大规模 CT 预训练，实现了卓越的检测灵敏度，同时减轻了幻觉。我们的数据和模型权重在此 https URL 发布

Title: CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

Authors: Tianshuo Xu, Tiantian Hong, Zhifei Chen, Fei Chao, Ying-cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12482
Pdf URL: https://arxiv.org/pdf/2603.12482
Copy Paste: [[2603.12482]] CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning(https://arxiv.org/abs/2603.12482)
Keywords: restoration, generation
Abstract: Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.
摘要：页面级书法合成需要平衡字形精度和布局构成。现有的角色模型缺乏空间上下文，而页面级方法往往会损害笔触细节。在本文中，我们提出了 \textbf{CalliMaster}，这是一个用于可控生成和编辑的统一框架，它通过将空间规划与内容合成解耦来解决这一冲突。受人类“先规划后写作”认知过程的启发，我们引入了一个从粗到细的管道 \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} 来解决页面规模合成的组合复杂性。在单个多模态扩散变压器中运行时，空间规划阶段首先预测字符边界框以建立全局空间排列。然后，该中间布局充当内容合成阶段的几何提示，其中同一网络利用流程匹配来渲染高保真笔触。除了实现最先进的发电质量之外，这种解开还支持多种下游功能。通过将布局视为可修改的约束，CalliMaster 实现了可控的语义重新规划：用户可以调整角色大小或重新定位角色，同时模型自动协调周围的空隙空间和笔刷动量。此外，我们还展示了该框架对文物修复和法医分析的可扩展性，为数字文化遗产提供了全面的工具。

Title: RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

Authors: Ali Mosleh, Faraz Ali, Fengjia Zhang, Stavros Tsogkas, Junyong Lee, Alex Levinshtein, Michael S. Brown
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12493
Pdf URL: https://arxiv.org/pdf/2603.12493
Copy Paste: [[2603.12493]] RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution(https://arxiv.org/abs/2603.12493)
Keywords: super-resolution, generation
Abstract: Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.
摘要：智能手机上的数字变焦依赖于基于学习的超分辨率（SR）模型，该模型在原始传感器图像上运行，但由于缺乏真实图像，获取特定于传感器的训练数据具有挑战性。通过“未处理”管道生成合成数据通过模拟将高分辨率（HR）图像转换为低分辨率（LR）图像的退化，提供了一种潜在的解决方案。然而，由于退化建模不完整或不切实际，这些管道可能会引入域间隙。在本文中，我们证明了有原则且精心设计的退化模型可以增强现实条件下的 SR 性能。我们不依赖相机模糊和噪声的通用先验，而是通过校准对特定于设备的退化进行建模，并将公开可用的渲染图像未处理到不同智能手机的 RAW 域中。使用这些图像对，我们训练单图像 RAW 到 RGB SR 模型，并根据来自手持设备的真实数据对其进行评估。我们的实验表明，准确的退化建模可以带来显着的改进，我们的 SR 模型优于在大量任意选择的退化上训练的基线。

Title: Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

Authors: Joong Ho Kim, Nicholas Thai, Souhardya Saha Dip, Dong Lao, Keith G. Mills
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12506
Pdf URL: https://arxiv.org/pdf/2603.12506
Copy Paste: [[2603.12506]] Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation(https://arxiv.org/abs/2603.12506)
Keywords: generation, generative
Abstract: Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.
摘要：文本到图像 (T2I) 的生成主要由依赖于随机高斯噪声的扩散模型 (DM) 驱动。因此，就像在赌场玩老虎机一样，给定相同的用户定义输入，DM 将产生不同的结果。这给赌徒带来了负担：执行多个生成周期以获得满意的结果。然而，即使 DM 使用随机采样来生成种子，生成的内容质量的分布在很大程度上取决于 DM 的提示和生成能力。考虑到这一点，我们建议 Naïve PAINE 通过利用 T2I 偏好基准来提高扩散模型的生成质量。我们根据初始噪声和给定的提示直接预测图像的数值质量。 Naïve PAINE 然后选择一些高质量的噪声并将它们转发给 DM 进行生成。此外，Naïve PAINE 根据提示提供有关 DM 生成质量的反馈，并且足够轻量，可以无缝地融入现有的 DM 管道。实验结果表明，Naïve PAINE 在多个即时语料库基准上优于现有方法。

Title: MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

Authors: Youngrae Kim, Qixin Hu, C.-C. Jay Kuo, Peter A. Beerel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12513
Pdf URL: https://arxiv.org/pdf/2603.12513
Copy Paste: [[2603.12513]] MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens(https://arxiv.org/abs/2603.12513)
Keywords: generation
Abstract: Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.
摘要：自回归扩散可以实现实时帧流，但现有的滑动窗口缓存会丢弃过去的上下文，导致保真度下降、身份漂移和长期运动停滞。当前的方法在注意力下降时保留了一组固定的早期标记，但是这个静态锚点无法反映不断增长的视频的不断变化的内容。我们推出 MemRoPE，这是一个免培训框架，具有两个共同设计的组件。内存令牌通过指数移动平均线不断地将所有过去的密钥压缩为长期和短期的双重流，从而在固定大小的缓存中维护全局身份和最近的动态。在线 RoPE 索引缓存未旋转的键并在关注时动态应用位置嵌入，确保聚合不存在冲突的位置阶段。这两种机制是相互促进的：位置解耦使得时间聚合得到明确定义，而聚合使得固定大小的缓存对于无限生成来说是可行的。大量实验验证了 MemRoPE 在时间连贯性、视觉保真度和分钟到小时尺度生成的主题一致性方面优于现有方法。

Title: Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Authors: Yura Choi, Roy Miles, Rolandos Alexandros Potamias, Ismail Elezi, Jiankang Deng, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12533
Pdf URL: https://arxiv.org/pdf/2603.12533
Copy Paste: [[2603.12533]] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering(https://arxiv.org/abs/2603.12533)
Keywords: generation
Abstract: Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: this https URL
摘要：根据用户的指向手势理解并回答问题对于下一代以自我为中心的人工智能助手至关重要。然而，由于缺乏丰富的手势数据以及从以自我为中心的视频推断细粒度指向意图的能力有限，当前的多模态大型语言模型（MLLM）难以完成此类任务。为了解决这个问题，我们引入了 EgoPointVQA，这是一个基于手势的自我中心问答的数据集和基准，包含跨多个指示推理任务的 4000 个合成视频和 400 个真实世界视频。在此基础上，我们进一步提出了手部意图标记（HINT），它使用现成的重建模型对从 3D 手部关键点派生的标记进行编码，并将它们与模型输入交织在一起，以提供明确的空间和时间上下文来解释指向意图。我们表明，我们的模型在不同的骨干网和模型大小方面优于其他模型。特别是，HINT-14B 在 6 个任务上平均达到 68.1% 的准确率，比最先进的 InternVL3-14B 提高了 6.6%。为了进一步促进开放研究，我们将发布代码、模型和数据集。项目页面：此 https URL

Title: Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A., Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12545
Pdf URL: https://arxiv.org/pdf/2603.12545
Copy Paste: [[2603.12545]] Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA(https://arxiv.org/abs/2603.12545)
Keywords: generative
Abstract: Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.
摘要：视觉语言模型（VLM）发展迅速，但在基本的空间推理方面仍然存在困难。尽管在一般基准测试中表现强劲，但现代 VLM 在理解 2D 空间关系（例如相对位置、布局和计数）方面仍然很脆弱。我们认为，这种失败不仅仅是一个数据问题，而且与当前 VLM 管道中的主导设计选择密切相关：对 CLIP 式图像编码器的依赖以及使用 1D 位置编码将图像扁平化为 1D 标记序列。我们在 LLaVA 框架内提出了一项受控诊断研究，以隔离这些选择如何影响空间接地。我们在一系列空间基准上评估前沿模型和 LLaVA 变体，将基于 CLIP 的编码器与使用更密集或生成目标训练的替代方案以及通过 2D 位置编码增强的变体进行比较。我们的结果显示了模型之间一致的空间性能差距，并表明编码器目标和位置结构塑造了空间行为，但并没有完全解决它。

Title: Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Authors: Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.12554
Pdf URL: https://arxiv.org/pdf/2603.12554
Copy Paste: [[2603.12554]] Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages(https://arxiv.org/abs/2603.12554)
Keywords: generation
Abstract: Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at this https URL.
摘要：强化学习 (RL) 对于训练后自回归 (AR) 语言模型非常有效，但由于难以处理的序列级似然性，将这些方法扩展到扩散语言模型 (DLM) 具有挑战性。因此，现有方法依赖于替代可能性或启发式近似，这可能会引入偏差并模糊去噪的顺序结构。我们将基于扩散的序列生成公式化为去噪轨迹上的有限水平马尔可夫决策过程，并导出精确的、无偏的策略梯度，该梯度可分解去噪步骤并以中间优势表示，而不需要对序列似然进行显式评估。为了获得实用且计算高效的估计器，我们（i）通过熵引导的近似界限选择策略更新的去噪步骤，以及（ii）使用扩散模型自然提供的一步去噪奖励来估计中间优势，避免昂贵的多步部署。编码和逻辑推理基准测试展示了最先进的结果，在数学推理方面具有强大的竞争性能，优于现有的 DLM 强化学习后训练方法。代码可从此 https URL 获取。

Title: AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Authors: Xuanhua Yin, Chuanzhi Xu, Haoxian Zhou, Boyu Wei, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12575
Pdf URL: https://arxiv.org/pdf/2603.12575
Copy Paste: [[2603.12575]] AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation(https://arxiv.org/abs/2603.12575)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11$\times$ speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at this https URL.
摘要：由于具有强大的可扩展性和高分辨率对齐能力，扩散变压器 (DiT) 是高保真文本到图像生成的主要支柱。然而，密集空间令牌上的二次自注意力会导致高推理延迟并限制部署。我们观察到，就提示中的美学描述符而言，去噪在空间上是不均匀的。与审美标记相关的区域受到集中的交叉关注并表现出较大的时间变化，而低亲和力区域则通过冗余计算平稳演化。基于这一见解，我们提出了 AccelAes，这是一个免训练的框架，通过美学感知的时空缩减来加速 DiT，同时提高感知美感。 AccelAes 构建了 AesMask，这是一种源自提示语义和交叉注意信号的一次性美学焦点蒙版。当局部计算可行时，SkipSparse 会将计算和指导重新分配给屏蔽区域。我们使用轻量级的步级预测缓存进一步减少时间冗余，该缓存定期替换完整的 Transformer 评估。对代表性 DiT 系列的实验显示出一致的加速和改进的美学导向质量。在 Lumina-Next 上，AccelAes 实现了 2.11$\times$ 的加速，并将 ImageReward 比密集基线提高了 +11.9%。代码可从此 https URL 获取。

Title: DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration

Authors: Youngjin Oh, Junhyeong Kwon, Nam Ik Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12579
Pdf URL: https://arxiv.org/pdf/2603.12579
Copy Paste: [[2603.12579]] DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration(https://arxiv.org/abs/2603.12579)
Keywords: restoration
Abstract: This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2's image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.
摘要：本文提出了一种新的环境光归一化框架 DINOLight，它将自监督模型 DINOv2 的图像理解能力作为视觉先验集成到恢复过程中。环境光归一化旨在恢复因多个光源和复杂场景几何形状引起的不均匀阴影和照明而降低的图像。我们观察到 DINOv2 可以从退化图像中可靠地提取语义和几何信息。基于这一观察，我们开发了一个新颖的框架来利用 DINOv2 功能进行照明标准化。首先，我们提出了一种自适应特征融合模块，该模块使用逐点 softmax 掩模组合来自不同 DINOv2 层的特征。接下来，通过辅助交叉注意机制将融合特征集成到我们提出的空间和频域恢复网络中。实验表明，DINOLight 在 Ambient6K 数据集上取得了优异的性能，并且 DINOv2 特征对于增强环境光归一化是有效的。我们还将我们的方法应用于阴影去除基准数据集，与使用掩模先验的方法相比，取得了有竞争力的结果。代码将在接受后发布。

Title: Maximizing Incremental Information Entropy for Contrastive Learning

Authors: Jiansong Zhang, Zhuoqin Yang, Xu Wu, Xiaoling Luo, Peizhong Liu, Linlin Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.12594
Pdf URL: https://arxiv.org/pdf/2603.12594
Copy Paste: [[2603.12594]] Maximizing Incremental Information Entropy for Contrastive Learning(https://arxiv.org/abs/2603.12594)
Keywords: generation
Abstract: Contrastive learning has achieved remarkable success in self-supervised representation learning, often guided by information-theoretic objectives such as mutual information maximization. Motivated by the limitations of static augmentations and rigid invariance constraints, we propose IE-CL (Incremental-Entropy Contrastive Learning), a framework that explicitly optimizes the entropy gain between augmented views while preserving semantic consistency. Our theoretical framework reframes the challenge by identifying the encoder as an information bottleneck and proposes a joint optimization of two components: a learnable transformation for entropy generation and an encoder regularizer for its preservation. Experiments on CIFAR-10/100, STL-10, and ImageNet demonstrate that IE-CL consistently improves performance under small-batch settings. Moreover, our core modules can be seamlessly integrated into existing frameworks. This work bridges theoretical principles and practice, offering a new perspective in contrastive learning.
摘要：对比学习在自监督表征学习中取得了显着的成功，通常以互信息最大化等信息论目标为指导。受静态增强和严格不变性约束的限制，我们提出了 IE-CL（增量熵对比学习），这是一个显式优化增强视图之间的熵增益，同时保持语义一致性的框架。我们的理论框架通过将编码器识别为信息瓶颈来重新构建挑战，并提出两个组件的联合优化：用于熵生成的可学习变换和用于保存熵的编码器正则化器。在 CIFAR-10/100、STL-10 和 ImageNet 上的实验表明，IE-CL 在小批量设置下持续提高性能。此外，我们的核心模块可以无缝集成到现有框架中。这项工作将理论原理与实践联系起来，为对比学习提供了新的视角。

Title: Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs

Authors: Zixin Wen, Yifu Cai, Kyle Lee, Sam Estep, Josh Sunshine, Aarti Singh, Yuejie Chi, Wode Ni
Subjects: cs.LG, cs.AI, cs.HC, cs.MA, cs.SE
Abstract URL: https://arxiv.org/abs/2603.12597
Pdf URL: https://arxiv.org/pdf/2603.12597
Copy Paste: [[2603.12597]] Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs(https://arxiv.org/abs/2603.12597)
Keywords: generation
Abstract: Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.
摘要：视觉设计是最先进的多模式人工智能系统的重要应用。改进这些系统需要大规模的高质量视觉语言数据。尽管互联网图像和文本数据丰富，但知识丰富且对齐良好的图像文本对却很少见。在本文中，我们提出了一个使用我们的代理 Feynman 构建的可扩展图表生成管道。为了创建图表，费曼首先枚举特定领域的知识组件（“想法”），并根据这些想法进行代码规划。根据计划，费曼将想法转化为简单的声明性程序，并迭代接收反馈并在视觉上完善图表。最后，声明性程序由彭罗斯图表系统呈现。 Penrose 基于优化的渲染保留了视觉语义，同时为布局注入了新鲜的随机性，从而生成具有视觉一致性和多样性的图表。因此，费曼可以用很少的成本和时间来绘制图表和接地说明。使用 Feynman，我们合成了一个包含超过 10 万个对齐良好的图表标题对的数据集。我们还根据新生成的数据策划了一个视觉语言基准，Diagramma。图解可用于评估视觉语言模型的视觉推理能力。我们计划将数据集、基准测试和完整的代理管道作为开源项目发布。

Title: Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Authors: Guodong Sun, Qihang Liang, Xingyu Pan, Moyun Liu, Yang Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2603.12624
Pdf URL: https://arxiv.org/pdf/2603.12624
Copy Paste: [[2603.12624]] Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains(https://arxiv.org/abs/2603.12624)
Keywords: generation
Abstract: Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: this https URL
摘要：由于复杂的运行环境、结构重复的组件以及安全关键区域频繁的遮挡或污染，货运列车中准确的视觉故障检测仍然是智能交通系统维护的关键挑战。在这种情况下，基于卷积神经网络和 Transformer 的传统实例分割方法往往泛化性差且边界精度有限。为了应对这些挑战，我们提出了一种专为货运列车故障检测量身定制的轻量级自我提示实例分割框架。我们的方法通过引入自动生成特定任务提示的自提示生成模块来利用分段任何模型，从而实现从基础模型到特定领域检查任务的有效知识转移。此外，我们采用Tiny Vision Transformer主干来降低计算成本，使该框架适合在铁路监控系统的边缘设备上实时部署。我们构建了从现实世界货运检验站收集的特定领域数据集，并进行了广泛的评估。实验结果表明，我们的方法在数据集上达到了 74.6 $AP^{\text{box}}$ 和 74.2 $AP^{\text{mask}}$，在准确性和鲁棒性方面优于现有的最先进方法，同时保持较低的计算开销。这项工作为自动化货运列车检查提供了一种可部署且高效的视觉解决方案，展示了基础模型适应在工业规模故障诊断场景中的潜力。项目页面：此 https URL

Title: Adaptive Diffusion Posterior Sampling for Data and Model Fusion of Complex Nonlinear Dynamical Systems

Authors: Dibyajyoti Chakraborty, Hojin Kim, Romit Maulik
Subjects: cs.LG, nlin.CD, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2603.12635
Pdf URL: https://arxiv.org/pdf/2603.12635
Copy Paste: [[2603.12635]] Adaptive Diffusion Posterior Sampling for Data and Model Fusion of Complex Nonlinear Dynamical Systems(https://arxiv.org/abs/2603.12635)
Keywords: generative
Abstract: High-fidelity numerical simulations of chaotic, high dimensional nonlinear dynamical systems are computationally expensive, necessitating the development of efficient surrogate models. Most surrogate models for such systems are deterministic, for example when neural operators are involved. However, deterministic models often fail to capture the intrinsic distributional uncertainty of chaotic systems. This work presents a surrogate modeling formulation that leverages generative machine learning, where a deep learning diffusion model is used to probabilistically forecast turbulent flows over long horizons. We introduce a multi-step autoregressive diffusion objective that significantly enhances long-rollout stability compared to standard single-step training. To handle complex, unstructured geometries, we utilize a multi-scale graph transformer architecture incorporating diffusion preconditioning and voxel-grid pooling. More importantly, our modeling framework provides a unified platform that also predicts spatiotemporally important locations for sensor placement, either via uncertainty estimates or through an error-estimation module. Finally, the observations of the ground truth state at these dynamically varying sensor locations are assimilated using diffusion posterior sampling requiring no retraining of the surrogate model. We present our methodology on two-dimensional homogeneous and isotropic turbulence and for a flow over a backwards-facing step, demonstrating its utility in forecasting, adaptive sensor placement, and data assimilation for high dimensional chaotic systems.
摘要：混沌、高维非线性动力系统的高保真数值模拟的计算成本很高，因此需要开发高效的替代模型。此类系统的大多数替代模型都是确定性的，例如当涉及神经算子时。然而，确定性模型通常无法捕捉混沌系统的内在分布不确定性。这项工作提出了一种利用生成机器学习的替代建模公式，其中深度学习扩散模型用于概率预测长期的湍流。我们引入了多步自回归扩散目标，与标准单步训练相比，它显着增强了长期部署稳定性。为了处理复杂的、非结构化的几何形状，我们利用结合了扩散预处理和体素网格池的多尺度图形转换器架构。更重要的是，我们的建模框架提供了一个统一的平台，还可以通过不确定性估计或通过误差估计模块来预测传感器放置的时空重要位置。最后，使用扩散后验采样来同化这些动态变化的传感器位置处的地面真实状态的观察结果，无需重新训练替代模型。我们提出了关于二维均匀和各向同性湍流以及向后步骤流动的方法，展示了其在高维混沌系统的预测、自适应传感器放置和数据同化中的实用性。

Title: RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Authors: Ruicheng Zhang, Guangyu Chen, Zunnan Xu, Zihao Liu, Zhizhou Zhong, Mingyang Zhang, Jun Zhou, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12639
Pdf URL: https://arxiv.org/pdf/2603.12639
Copy Paste: [[2603.12639]] RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization(https://arxiv.org/abs/2603.12639)
Keywords: generation
Abstract: Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.
摘要：由于现实世界交互的高昂成本和安全风险，可扩展的嵌入式人工智能面临着根本性的限制。虽然具体世界模型（EWM）通过想象中的推出提供了希望，但现有方法受到几何幻觉的影响，并且缺乏用于实际政策改进的统一优化框架。我们推出了 RoboStereo，这是一种对称双塔 4D 世界模型，采用双向跨模态增强来确保时空几何一致性并减轻物理幻觉。在此高保真 4D 模拟器的基础上，我们提出了第一个基于世界模型的策略优化的统一框架：(1) 用于执行前验证的测试时策略增强 (TTPA)，(2) 模仿进化策略学习 (IEPL)，利用视觉感知奖励从专家演示中学习，(3) 开放探索策略学习 (OEPL)，支持自主技能发现和自我纠正。综合实验表明，RoboStereo 实现了最先进的生成质量，我们的统一框架在细粒度操作任务上实现了 >97% 的平均相对改进。

Title: From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12648
Pdf URL: https://arxiv.org/pdf/2603.12648
Copy Paste: [[2603.12648]] From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space(https://arxiv.org/abs/2603.12648)
Keywords: generation
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.
摘要：组相对策略优化 (GRPO) 已成为文本到图像 (T2I) 流模型中偏好对齐的强大框架。然而，我们观察到，根据单一条件评估一组生成样本的标准范式对样本间关系的探索不足，限制了对齐效率和性能上限。为了解决这种稀疏的单视图评估方案，我们提出了多视图GRPO（MV-GRPO），这是一种通过扩大条件空间来创建密集的多视图奖励映射来增强关系探索的新方法。具体来说，对于从一个提示生成的一组样本，MV-GRPO 利用灵活的条件增强器来生成语义上相邻但多样化的标题。这些字幕可以实现多视图优势重新估计，捕获不同的语义属性并提供更丰富的优化信号。通过导出以这些新标题为条件的原始样本的概率分布，我们可以将它们合并到训练过程中，而无需昂贵的样本重新生成。大量实验表明，MV-GRPO 比最先进的方法具有更优越的对齐性能。

Title: VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Authors: Xiangyu Sun, Shijie Wang, Fengyi Zhang, Lin Liu, Caiyan Jia, Ziying Song, Zi Huang, Yadan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12655
Pdf URL: https://arxiv.org/pdf/2603.12655
Copy Paste: [[2603.12655]] VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model(https://arxiv.org/abs/2603.12655)
Keywords: generation
Abstract: World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.
摘要：通过生成未来视频帧来预测场景演变的世界模型将其大部分容量用于光度细节，但最终的预测通常在几何上保持不一致。我们提出了 VGGT-World，这是一种几何世界模型，它完全回避视频生成，而是预测冻结几何基础模型 (GFM) 特征的时间演化。具体来说，我们将冻结 VGGT 的潜在标记重新用作世界状态，并训练一个轻量级时间流变换器来自回归预测它们的未来轨迹。在这个高维 (d=1024) 特征空间中出现了两个技术挑战：(i) 标准速度预测流匹配崩溃，以及 (ii) 自回归推出受到复合曝光偏差的影响。我们通过干净目标（z 预测）参数化来解决第一个问题，该参数化可产生更高的信噪比，而第二个问题则通过两阶段潜在流量强制课程来解决，该课程逐步在模型自身的部分降噪推出上进行调节。 KITTI、Cityscapes 和 TartanAir 上的实验表明，VGGT-World 在深度预测方面显着优于最强的基线，同时仅使用 0.43B 可训练参数，运行速度提高了 3.6-5 倍，将冻结的 GFM 特征建立为 3D 世界建模的有效且高效的预测状态。

Title: Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Authors: Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12669
Pdf URL: https://arxiv.org/pdf/2603.12669
Copy Paste: [[2603.12669]] Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning(https://arxiv.org/abs/2603.12669)
Keywords: generative
Abstract: With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at this https URL.
摘要：随着视觉语言模型 (VLM) 的数量和多样性不断增加，许多工作探索跨多个 VLM 的基于语言的集成、协作和路由技术，以改进多模型推理。相比之下，我们使用视觉和语言模式来解决多样化的模型选择。我们引入焦点误差多样性来捕获 VLM 之间的互补推理，并引入基于 CKA 的焦点多样性度量（CKA-focal）来衡量视觉嵌入中的不一致。在从候选 VLM 池构建的集合表面上，我们应用遗传算法来有效地删除那些不会增加融合性能价值的组件 VLM。我们确定了每个任务的最佳组合，并融合了模型池中每个 VLM 的输出，并表明异构模型可以动态捕获认知不确定性并减轻幻觉。我们的 V3Fusion 方法能够生成具有高性能的双焦点分集融合预测，用于视觉语言推理，即使没有达成多数共识或大多数 VLM 做出了错误的预测。大量实验在四个流行的 VLM 基准（A-OKVQA、MMMU、MMMU-Pro 和 OCR-VQA）上验证了 V3Fusion。结果表明，V3Fusion 的准确度比 MMMU 上性能最佳的 VLM 提高了 8.09%，比 MMMU-Pro 提高了 4.87%。对于生成任务，V3Fusion 的性能优于 Intern-VL2-8b 和 Qwen2.5-VL-7b，它们在 A-OKVQA 和 OCR-VQA 上均排名前 2。我们的代码和数据集可在此 https URL 获取。

Title: RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

Authors: Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Chengtao Lv, Sam Kwong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12685
Pdf URL: https://arxiv.org/pdf/2603.12685
Copy Paste: [[2603.12685]] RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection(https://arxiv.org/abs/2603.12685)
Keywords: generation
Abstract: This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.
摘要：本文重点研究 RGB 图像和热图像之间显着区域的不一致。为了解决这个问题，我们提出了用于 RGB-T 显着目标检测的区域引导选择性优化网络，该网络由区域引导阶段和显着性生成阶段组成。在区域引导阶段，设计了三个具有相同编码器-解码器结构的并行分支，配备上下文交互（CI）模块和空间感知融合（SF）模块来生成引导图，用于计算相似性分数。然后，在显着性生成阶段，选择性优化（SO）模块根据先前获得的相似性值融合RGB和热特征，以减轻两种模态之间显着目标分布不一致的影响。之后，为了生成高质量的检测结果，采用多个密集连接和视觉状态空间块的密集细节增强（DDE）模块应用于低级特征以优化细节信息。此外，在高层特征中放置了相互交互语义（MIS）模块，通过相互融合策略来挖掘位置线索。我们对 RGB-T 数据集进行了广泛的实验，结果表明，所提出的 RSONet 与 27 种最先进的 SOD 方法相比，实现了具有竞争力的性能。

Title: Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Authors: Donglin Yu
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2603.12707
Pdf URL: https://arxiv.org/pdf/2603.12707
Copy Paste: [[2603.12707]] Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity(https://arxiv.org/abs/2603.12707)
Keywords: generation
Abstract: Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV caches under stage-level disaggregation) to $O(N_v * d)$ bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.
摘要：多模态大语言模型 (MLLM) 推理分为两个阶段，硬件需求相反：视觉编码受计算限制，而语言生成受内存带宽限制。我们证明，在标准转换器 KV 缓存下，模态边界（视觉编码器和语言模型之间）最大限度地减少了所有分区点之间的跨设备传输，从而保留了基于标准阶段的执行。这里的分区将传输复杂度从 $O(L * s_ctx)$ 字节（阶段级分解下的 GB 级 KV 缓存）降低到 $O(N_v * d)$ 字节（MB 级嵌入），减少了 O(L)，其中 L 是变压器深度。结果适用于注意力机制（MHA/GQA）、动态视觉分辨率和模型尺度，并且随着模型的加深，优势也会增强。直接的含义是，现有的级级分解系统仅限于高带宽互连（例如 NVLink），而模态级分解可通过商用 PCIe 实现跨层异构服务。封闭式成本模型表明，异构部署在可分阶段工作负载下是成本最优的（预测节省 31.4%；观察到节省 40.6%）。我们构建了 HeteroServe，一个具有模态级分区和跨层调度的阶段感知运行时，并在 LLaVA-1.5-7B 和 Qwen2.5-VL 上针对 vLLM v0.3.0 对其进行评估。在相同的 4xA100 硬件上，引擎优化可将吞吐量提高高达 54%。在固定预算下，异构集群 ($38k) 比同质基线 (\$64k) 提高了 37%，且不会降低延迟。

Title: SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

Authors: David van Dijk, Ivan Vrkic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.12724
Pdf URL: https://arxiv.org/pdf/2603.12724
Copy Paste: [[2603.12724]] SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design(https://arxiv.org/abs/2603.12724)
Keywords: generation
Abstract: Many of the most important problems in science and engineering are inverse problems: given a desired outcome, find a design that achieves it. Evaluating whether a candidate meets the spec is often routine; a binding energy can be computed, a reactor yield simulated, a pharmacokinetic profile predicted. But searching a combinatorial design space for inputs that satisfy those targets is fundamentally harder. We introduce SciDesignBench, a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization. On the 10-domain shared-core subset, the best zero-shot model reaches only 29.0% success despite substantially higher parse rates. Simulator feedback helps, but the leaderboard changes with horizon: Sonnet 4.5 is strongest in one-turn de novo design, whereas Opus 4.6 is strongest after 20 turns of simulator-grounded refinement. Providing a starting seed design reshuffles the leaderboard again, demonstrating that constrained modification requires a fundamentally different capability from unconstrained de novo generation. We then introduce RLSF, a simulator-feedback training recipe. An RLSF-tuned 8B model raises single-turn success rates by 8-17 percentage points across three domains. Together, these results position simulator-grounded inverse design as both a benchmark for scientific reasoning and a practical substrate for amortizing expensive test-time compute into model weights.
摘要：科学和工程中许多最重要的问题都是逆问题：给定期望的结果，找到实现它的设计。评估候选人是否符合规范通常是例行公事；可以计算结合能、模拟反应器产量、预测药代动力学曲线。但在组合设计空间中搜索满足这些目标的输入从根本上来说更加困难。我们推出了 SciDesignBench，它是跨 14 个科学领域的 520 个基于模拟器的任务的基准，以及涵盖单次设计、短视野反馈、长视野细化和种子设计优化的五种设置。在 10 域共享核心子集中，最佳零样本模型仅达到 29.0% 的成功率，尽管解析率要高得多。模拟器反馈会有所帮助，但排行榜会随着时间的推移而变化：Sonnet 4.5 在一回合从头设计中最强，而 Opus 4.6 在基于模拟器的 20 回合改进后最强。提供起始种子设计将再次重新洗牌排行榜，这表明受约束的修改需要与不受约束的从头生成完全不同的能力。然后我们介绍 RLSF，一种模拟器反馈训练方法。经过 RLSF 调整的 8B 模型将三个域的单转成功率提高了 8-17 个百分点。总之，这些结果将基于模拟器的逆向设计定位为科学推理的基准和将昂贵的测试时间计算分摊到模型权重的实用基础。

Title: MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

Authors: Chenyang Zhu, Hongxiang Li, Xiu Li, Long Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.12743
Pdf URL: https://arxiv.org/pdf/2603.12743
Copy Paste: [[2603.12743]] MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization(https://arxiv.org/abs/2603.12743)
Keywords: generation
Abstract: Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.
摘要：概念定制通常将稀有标记绑定到目标概念。不幸的是，这些方法通常会出现性能不稳定的问题，因为预训练数据很少包含这些稀有标记。同时，这些稀有标记无法传达目标概念的固有知识。因此，我们引入了知识感知概念定制，这是一项旨在将不同文本知识与目标视觉概念结合起来的新颖任务。该任务需要模型识别文本提示中的知识来执行高保真定制生成。同时，模型应有效地将所有文本知识与目标概念绑定。因此，我们提出了 MoKus，一种用于知识感知概念定制的新颖框架。我们的框架依赖于一个关键的观察：跨模态知识转移，其中在文本模态中修改知识在生成过程中自然地转移到视觉模态。受这一观察的启发，MoKus 包含两个阶段：（1）在视觉概念学习中，我们首先学习锚表示来存储目标概念的视觉信息。（2）在文本知识更新中，我们将知识查询的答案更新为锚表示，从而实现高保真定制生成。为了进一步全面评估我们在新任务上提出的 MoKus，我们引入了第一个知识感知概念定制基准：KnowCusBench。广泛的评估表明 MoKus 的性能优于最先进的方法。此外，跨模型知识转移使 MoKus 可以轻松扩展到其他知识感知应用程序，例如虚拟概念创建和概念擦除。我们还展示了我们的方法实现世界知识基准改进的能力。

Title: SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Authors: Zheng Gao, Yifan Yang, Xiaoyu Li, Xiaoyan Feng, Haoran Fan, Yang Song, Jiaojiao Jiang
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12749
Pdf URL: https://arxiv.org/pdf/2603.12749
Copy Paste: [[2603.12749]] SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking(https://arxiv.org/abs/2603.12749)
Keywords: generation
Abstract: Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection via $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding ($\textbf{SLICE}$). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.
摘要：对扩散模型的初始噪声进行水印已成为一种有前景的图像来源方法，但可以通过反转和再生攻击来伪造与内容无关的噪声模式。最近的语义感知水印方法通过对图像语义进行条件验证来提高鲁棒性。然而，它们对单一全局语义绑定的依赖使它们容易受到本地化但全局一致的语义编辑的影响。为了解决这个限制并提供值得信赖的语义感知水印，我们提出通过 $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection ($\textbf{切片}$)。我们的框架将图像语义解耦为四个语义因素（主题、环境、动作和细节），并将它们精确地锚定到初始高斯噪声中的不同区域。这种细粒度的语义绑定支持高级水印验证，其中语义篡改是可检测和可本地化的。我们从理论上证明了为什么 SLICE 能够实现稳健可靠的篡改定位，并提供错误接受率的统计保证。实验结果表明，针对高级语义引导再生攻击，SLICE 的性能显着优于现有基线，在保持图像质量和语义保真度的同时大大降低了攻击成功率。总体而言，SLICE 提供了一种实用的、无需培训的起源解决方案，该解决方案既具有细粒度的诊断功能，又对现实的对抗性操作具有鲁棒性。

Title: Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

Authors: Guodong Fan, Shengning Zhou, Genji Yuan, Huiyu Li, Jingchun Zhou, Jinjiang Li
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2603.12773
Pdf URL: https://arxiv.org/pdf/2603.12773
Copy Paste: [[2603.12773]] Empowering Semantic-Sensitive Underwater Image Enhancement with VLM(https://arxiv.org/abs/2603.12773)
Keywords: restoration
Abstract: In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.
摘要：近年来，基于学习的水下图像增强（UIE）技术迅速发展。然而，高质量增强输出和自然图像之间的分布变化可能会阻碍下游视觉任务的语义线索提取，从而限制现有增强模型的适应性。为了应对这一挑战，这项工作提出了一种新的学习机制，利用视觉语言模型 (VLM) 为 UIE 模型提供语义敏感的功能。具体来说，我们的策略首先通过 VLM 从退化图像中生成关键对象的文本描述。随后，文本-图像对齐模型将这些相关描述重新映射回图像上，以生成空间语义引导图。然后，该图通过结合了交叉注意力和显式对齐损失的双重引导机制来引导 UIE 网络。这迫使网络在图像重建过程中将其恢复能力集中在语义敏感区域，而不是追求全局统一的改进，从而确保关键对象特征的忠实恢复。实验证实，当我们的策略应用于不同的 UIE 基线时，显着提高了它们在感知质量指标上的性能，并增强了它们在检测和分割任务上的性能，验证了其有效性和适应性。

Title: Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Authors: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12793
Pdf URL: https://arxiv.org/pdf/2603.12793
Copy Paste: [[2603.12793]] Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation(https://arxiv.org/abs/2603.12793)
Keywords: generation
Abstract: A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
摘要：多模态建模的最新前沿主题是在单个模型中统一视觉理解和生成。然而，这两个任务需要不匹配的解码机制和视觉表示，使得在共享特征空间内联合优化变得非常重要。在这项工作中，我们提出了 Cheers，一种统一的多模态模型，它将补丁级细节与语义表示解耦，从而稳定多模态理解的语义，并通过门控细节残差提高图像生成的保真度。 Cheers 包括三个关键组件：(i) 一个统一的视觉分词器，将图像潜在状态编码并压缩为语义标记，以实现高效的 LLM 调节；(ii) 一个基于 LLM 的 Transformer，统一用于文本生成的自回归解码和用于图像生成的扩散解码；(iii) 一个级联流匹配头，首先解码视觉语义，然后从视觉分词器注入语义门控细节残差，以细化高频内容。对流行基准的实验表明，Cheers 在视觉理解和生成方面都匹配或超过了先进的 UMM。 Cheers 还实现了 4 倍令牌压缩，从而实现更高效的高分辨率图像编码和生成。值得注意的是，Cheers 在流行的基准 GenEval 和 MMBench 上优于 Tar-1.5B，同时只需要 20% 的训练成本，这表明统一多模态建模有效且高效（即 4 倍令牌压缩）。我们将发布所有代码和数据以供未来研究。

Title: OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

Authors: Shijie Zhao, Xuanyu Zhang, Bin Chen, Weiqi Li, Qunliang Xing, Kexin Zhang, Yan Wang, Junlin Li, Li Zhang, Jian Zhang, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12811
Pdf URL: https://arxiv.org/pdf/2603.12811
Copy Paste: [[2603.12811]] OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution(https://arxiv.org/abs/2603.12811)
Keywords: super-resolution, generative
Abstract: Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception--fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.
摘要：由于感知保真度权衡和多样化、未知的退化，将生成的现实世界图像超分辨率模型与人类视觉偏好保持一致具有挑战性。先前的方法依赖于离线偏好优化和静态度量聚合，这些方法通常是不可解释的，并且在强条件下容易出现伪多样性。我们提出了 OARS，一种基于 COMPASS 的流程感知在线对齐框架，COMPASS 是一种基于 MLLM 的奖励，通过对保真度保存和感知增益与输入质量自适应权衡联合建模来评估 LR 到 SR 的过渡。为了训练 COMPASS，我们策划了涵盖合成和真实降级的 COMPASS-20K，并引入了一个三阶段感知注释管道，可生成经过校准的细粒度训练标签。在 COMPASS 的指导下，OARS 通过浅层 LoRA 优化执行渐进式在线对齐，从冷启动流匹配到全参考，最后是无参考强化学习，以进行策略探索。大量的实验和用户研究表明，在保持保真度的同时，可以实现一致的感知改进，在 Real-ISR 基准上实现最先进的性能。

Title: coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

Authors: Chunhan Li, Qifeng Wu, Jia-Hui Pan, Ka-Hei Hui, Jingyu Hu, Yuming Jiang, Bin Sheng, Xihui Liu, Wenjuan Gong, Zhengzhe Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12829
Pdf URL: https://arxiv.org/pdf/2603.12829
Copy Paste: [[2603.12829]] coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation(https://arxiv.org/abs/2603.12829)
Keywords: generation
Abstract: Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.
摘要：文本到图像的生成技术发展迅速，但现有模型仍然难以忠实地组合多个对象并在复杂场景中保留其属性。我们提出了 coDrawAgents，这是一个交互式多智能体对话框架，具有四个专业智能体：解释器、规划器、检查器和画家，它们协作改进构图生成。解释器自适应地在直接文本到图像路径和布局感知多代理进程之间做出决定。在布局感知模式下，它将提示解析为属性丰富的对象描述符，按语义显着性对它们进行排序，并将具有相同语义优先级的对象分组以进行联合生成。在解释器的指导下，规划器采用分而治之的策略，逐步为具有相同语义优先级的对象提出布局，同时根据画布不断变化的视觉上下文做出决策。 Checker 通过验证空间一致性和属性对齐，并在渲染之前优化布局，引入了显式纠错机制。最后，Painter 逐步合成图像，将新规划的对象合并到画布中，为后续迭代提供更丰富的上下文。这些代理共同解决了三个关键挑战：降低布局复杂性、在视觉环境中进行规划以及实现显式错误纠正。 GenEval 和 DPG-Bench 基准测试的大量实验表明，与现有方法相比，coDrawAgents 显着改善了文本图像对齐、空间精度和属性绑定。

Title: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

Authors: Yifan Zhan, Zhengqing Chen, Qingjie Wang, Zhuo He, Muyao Niu, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12864
Pdf URL: https://arxiv.org/pdf/2603.12864
Copy Paste: [[2603.12864]] Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation(https://arxiv.org/abs/2603.12864)
Keywords: generation, generative
Abstract: A major challenge in autonomous driving is the "long tail" of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.
摘要：自动驾驶的一个主要挑战是安全关键边缘情况的“长尾”，这些边缘情况通常是由常见交通元素的不寻常组合产生的。综合这些场景至关重要，但当前的可控生成模型提供了不完整或纠缠的指导，阻碍了场景结构、物体身份和自我行为的独立操纵。我们推出了 CompoSIA，这是一种组合驾驶视频模拟器，可以理清这些交通因素，从而能够对各种对抗性驾驶场景进行细粒度控制。为了支持场景元素的可控身份替换，我们提出了一种噪声级身份注入，允许跨不同元素姿势生成与姿势无关的身份，所有这些都来自单个参考图像。此外，引入分层双分支动作控制机制，提高动作可控性。这种解开的控制使得对抗性场景合成成为可能——系统地将安全元素组合成纠缠发生器无法产生的危险配置。广泛的比较表明，与最先进的基线相比，其可控生成质量更高，身份编辑的 FVD 提高了 17%，动作控制的旋转和平移错误分别减少了 30% 和 47%。此外，下游压力测试揭示了严重的规划器失败：在编辑模式中，3 秒的平均冲突率增加了 173%。

Title: Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Authors: Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.12996
Pdf URL: https://arxiv.org/pdf/2603.12996
Copy Paste: [[2603.12996]] Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs(https://arxiv.org/abs/2603.12996)
Keywords: generation
Abstract: Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.
摘要：扩散 LLM (dLLM) 的并行解码很困难，因为每个去噪步骤仅提供令牌方式的边际分布，而同时揭开多个令牌需要考虑令牌间依赖性。我们提出了依赖感知并行解码（DAPD），这是一种简单的、免训练的解码方法，它使用自注意力来诱导屏蔽标记上的条件依赖图。在每次迭代中，该图中的边捕获强标记交互，而非边则表示弱依赖性。然后，并行解码被简化为选择图上的独立集合并并行地揭开所选标记的掩码。这避免了在没有辅助模型或重新训练的情况下共同更新强耦合令牌。 LLaDA 和 Dream 上的实验表明，DAPD 改进了现有方法的精度步骤权衡，并实现了更全局分布的并行更新，从而更好地利用 dLLM 的任意顺序生成能力。

Title: A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

Authors: Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.12998
Pdf URL: https://arxiv.org/pdf/2603.12998
Copy Paste: [[2603.12998]] A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks(https://arxiv.org/abs/2603.12998)
Keywords: generation
Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.
摘要：虽然视觉语言模型 (VLM) 在各种下游任务中取得了显着的性能，但最近的研究表明，它们可以从训练数据中继承社会偏见，并将其进一步传播到下游应用程序中。为了解决这个问题，人们提出了各种去偏差方法，但大多数方法都是为了提高公平性，而没有从理论上保证模型的实用性得到保留。在本文中，我们引入了一种去偏方法，该方法在跨模态空间中产生 \textbf{闭合形式} 解决方案，从而在 \textbf{有界效用损失} 的情况下实现帕累托最优公平性。我们的方法是 \textbf{无训练}，需要 \textbf{无注释数据}，并且可以联合消除下游任务中的视觉和文本模式的偏差。大量实验表明，在零样本图像分类、文本到图像检索和文本到图像生成等下游任务中，我们的方法在跨不同公平性指标和数据集的组公平性和交叉公平性上消除 VLM 偏差方面优于现有方法，同时保持任务性能。

Title: SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

Authors: Sampath Rapuri, Lalithkumar Seenivasan, Dominik Schneider, Roger Soberanis-Mukul, Yufan He, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Pengfei Guo, Daguang Xu, Mathias Unberath
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2603.13024
Pdf URL: https://arxiv.org/pdf/2603.13024
Copy Paste: [[2603.13024]] SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation(https://arxiv.org/abs/2603.13024)
Keywords: generation
Abstract: A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) -- a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.
摘要：能够生成逼真的手术动作视频并精确控制工具与组织相互作用的手术世界模型可以解决手术人工智能和模拟中的基本挑战——从数据稀缺和罕见事件合成到缩小手术自动化的模拟与真实差距。然而，当前的视频生成方法（此类手术世界模型的核心）需要昂贵的注释或复杂的结构化中间体作为推理时的条件信号，限制了其可扩展性。其他方法在复杂的腹腔镜场景中表现出有限的时间一致性，并且不具备足够的真实性。我们提出手术动作世界（SAW）——通过以四个轻量级信号为条件的视频扩散实现手术动作世界建模的一步：语言提示编码工具动作上下文、参考手术场景、组织可供性掩模和 2D 工具提示轨迹。我们设计了一种条件视频扩散方法，将视频到视频的扩散重新表述为轨迹条件手术动作合成。主干扩散模型在包含 12,044 个腹腔镜剪辑的定制数据集上进行微调，具有轻量级时空条件信号，利用深度一致性损失来增强几何合理性，而无需推理深度。 SAW 在保留的测试数据上实现了最先进的时间一致性（CD-FVD：199.19 与 546.82）和强大的视觉质量。此外，我们还展示了其在以下方面的下游实用性：(a) 手术人工智能，其中使用 SAW 生成的视频增强罕见动作，从而提高了对真实测试数据的动作识别（剪切 F1 分数：20.93% 至 43.14%；剪切：0.00% 至 8.33%）；以及 (b) 手术模拟，其中从模拟器导出的轨迹渲染工具与组织交互视频，指向视觉上忠实的模拟引擎。

Title: ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Authors: Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2603.13033
Pdf URL: https://arxiv.org/pdf/2603.13033
Copy Paste: [[2603.13033]] ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models(https://arxiv.org/abs/2603.13033)
Keywords: generative
Abstract: A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.
摘要：视觉语言模型（VLM）的最新趋势是增强其对具体领域的空间认知。尽管取得了进展，但现有的评估在范式和覆盖范围上都受到限制，阻碍了模型的快速迭代开发。为了解决这些限制，我们提出了 ESPIRE，一种体现空间推理的诊断基准。 ESPIRE 提供了一个模拟世界，该世界以 VLM 为物理基础，并在以空间推理为中心的机器人任务上对其进行评估，从而缩小了评估与现实世界部署之间的差距。为了使 VLM 适应机器人任务，我们将每个任务分解为定位和执行，并将两者都视为生成问题，这与依赖干扰因素并放弃执行的主要判别性评估（例如，通过视觉问答）形成鲜明对比。这种分解进一步实现了从被动空间推理到行动推理的细粒度分析。我们在指令层面和环境层面系统地设计了ESPIRE，确保了空间推理场景的广泛覆盖。我们使用 ESPIRE 来诊断一系列前沿 VLM，并对其空间推理行为进行深入分析。

Title: 3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting

Authors: Jun Liu, Xiaohui Zhong, Kai Zheng, Jiarui Li, Yifei Li, Tao Zhou, Wenxu Qian, Shun Dai, Ruian Tie, Yangyang Zhao, Hao Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.13049
Pdf URL: https://arxiv.org/pdf/2603.13049
Copy Paste: [[2603.13049]] 3DTCR: A Physics-Based Generative Framework for Vortex-Following 3D Reconstruction to Improve Tropical Cyclone Intensity Forecasting(https://arxiv.org/abs/2603.13049)
Keywords: generative
Abstract: Tropical cyclone (TC) intensity forecasting remains challenging as current numerical and AI-based weather models fail to satisfactorily represent extreme TC structure and intensity. Although intensity time-series forecasting has achieved significant advances, it outputs intensity sequences rather than the three-dimensional inner-core fine-scale structure and physical mechanisms governing TC evolution. High-resolution numerical simulations can capture these features but remain computationally expensive and inefficient for large-scale operational applications. Here we present 3DTCR, a physics-based generative framework combining physical constraints with generative AI efficiency for 3D TC structure reconstruction. Trained on a six-year, 3-km-resolution moving-domain WRF dataset, 3DTCR enables region-adaptive vortex-following reconstruction using conditional Flow Matching(CFM), optimized via latent domain adaptation and two-stage transfer learning. The framework mitigates limitations imposed by low-resolution targets and over-smoothed forecasts, improving the representation of TC inner-core structure and intensity while maintaining track stability. Results demonstrate that 3DTCR outperforms the ECMWF high-resolution forecasting system (ECMWF-HRES) in TC intensity prediction at nearly all lead times up to 5 days and reduces the RMSE of maximum WS10M by 36.5\% relative to its FuXi inputs. These findings highlight 3DTCR as a physics-based generative framework that efficiently resolves fine-scale structures at lower computational cost, which may offer a promising avenue for improving TC intensity forecasting.
摘要：热带气旋 (TC) 强度预报仍然具有挑战性，因为当前的数值和人工智能天气模型无法令人满意地代表极端的 TC 结构和强度。尽管强度时间序列预报取得了重大进展，但它输出的是强度序列，而不是三维内核精细尺度结构和控制TC演化的物理机制。高分辨率数值模拟可以捕获这些特征，但对于大规模操作应用来说，计算成本仍然很高且效率低下。在这里，我们提出了 3DTCR，一种基于物理的生成框架，将物理约束与生成 AI 效率相结合，用于 3D TC 结构重建。 3DTCR 在六年、3 公里分辨率的移动域 WRF 数据集上进行训练，可使用条件流匹配 (CFM) 实现区域自适应涡流跟踪重建，并通过潜在域自适应和两阶段迁移学习进行优化。该框架减轻了低分辨率目标和过度平滑预报带来的限制，改善了热带气旋内核结构和强度的表征，同时保持了轨迹稳定性。结果表明，3DTCR 在长达 5 天的几乎所有提前期的 TC 强度预测中均优于 ECMWF 高分辨率预报系统 (ECMWF-HRES)，并且相对于 FuXi 输入，最大 WS10M 的 RMSE 降低了 36.5%。这些发现凸显了 3DTCR 作为一种基于物理的生成框架，能够以较低的计算成本有效地解析精细尺度结构，这可能为改进 TC 强度预测提供有希望的途径。

Title: Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Authors: Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13054
Pdf URL: https://arxiv.org/pdf/2603.13054
Copy Paste: [[2603.13054]] Topo-R1: Detecting Topological Anomalies via Vision-Language Models(https://arxiv.org/abs/2603.13054)
Keywords: quality assessment
Abstract: Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.
摘要：拓扑正确性对于血管、神经纤维和道路网络等管状结构至关重要。现有的拓扑保留方法依赖于特定于域的基本事实，这是昂贵的并且很少跨域传输。当部署到没有注释的新领域时，出现了一个关键问题：我们如何在没有地面实况监督的情况下检测拓扑异常？我们将其重新定义为拓扑异常检测，这是一种结构化视觉推理任务，需要模型对预测分割掩模中的拓扑错误进行定位和分类。视觉语言模型（VLM）是自然的候选者；然而，我们发现最先进的 VLM 几乎是随机执行的，缺乏识别密集结构中稀疏连接错误所需的细粒度、拓扑感知的感知能力。为了弥补这一差距，我们开发了一个自动化数据管理管道，该管道可以在逐渐困难的级别上综合不同的拓扑异常和可验证的注释，从而为该任务构建第一个大规模、多领域的基准。然后，我们介绍 Topo-R1，这是一个框架，通过两阶段训练赋予 VLM 拓扑感知感知：监督微调，然后使用组相对策略优化 (GRPO) 进行强化学习。我们方法的核心是拓扑感知复合奖励，它集成了用于结构化错误分类的类型感知匈牙利匹配、空间定位评分和直接惩罚连接中断的中心线骰子（clDice）奖励，从而共同激励语义精度和结构保真度。大量实验表明，Topo-R1 建立了无注释拓扑质量评估的新范例，在所有评估协议中始终优于通用 VLM 和监督基线。

Title: Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

Authors: Yuki Hirakawa, Takashi Wada, Ryotaro Shimizu, Takuya Furusawa, Yuki Saito, Ryosuke Araki, Tianwei Chen, Fan Mo, Yoshimitsu Aoki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13057
Pdf URL: https://arxiv.org/pdf/2603.13057
Copy Paste: [[2603.13057]] Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback(https://arxiv.org/abs/2603.13057)
Keywords: quality assessment
Abstract: Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.
摘要：给定人物图像和服装图像，基于图像的虚拟试穿（VTON）会合成穿着目标服装的人的试穿图像。随着 VTON 系统在时尚电子商务等实际应用中变得越来越重要，对其输出的可靠评估已成为一项关键挑战。在现实场景中，同一个人穿着目标服装的真实图像通常无法获得，这使得基于参考的评估不切实际。此外，广泛使用的分布级指标（例如 Fréchet Inception Distance 和 Kernel Inception Distance）测量数据集级相似性，但无法反映单个生成图像的感知质量。为了解决这些限制，我们提出了虚拟试穿图像质量评估（VTON-IQA），这是一种无需参考真实图像的人性化、图像级质量评估的无参考框架。为了模拟人类感知判断，我们构建了 VTON-QBench，这是一个大规模的人工注释基准，包含由 14 个代表性 VTON 模型生成的 62,688 张试穿图像以及从 13,838 名合格注释者收集的 431,800 个质量注释。据我们所知，这是迄今为止虚拟试穿中人类主观评估的最大数据集。评估虚拟试穿质量需要验证服装保真度和个人特定细节的保留。为了明确地建模此类交互，我们引入了一个 Interleaved Cross-Attention 模块，该模块通过在后面的块中的 self-attention 和 MLP 之间插入交叉注意层来扩展标准 Transformer 块。大量实验表明，VTON-IQA 实现了可靠的人体对齐图像级质量预测。此外，我们使用VTON-IQA对14个代表性VTON模型进行了全面的基准评估。

Title: GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration

Authors: Yihao Ding, Yiran Zhang, Chris Gonzalez, Eun-Jung Holden, Wei Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13068
Pdf URL: https://arxiv.org/pdf/2603.13068
Copy Paste: [[2603.13068]] GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration(https://arxiv.org/abs/2603.13068)
Keywords: generative
Abstract: Geochemical anomaly detection plays a critical role in mineral exploration as deviations from regional geochemical baselines may indicate mineralization. Existing studies suffer from two key limitations: (1) single region scenarios which limit model generalizability; (2) proprietary datasets, which makes result reproduction unattainable. In this work, we introduce \textbf{GeoChemAD}, an open-source benchmark dataset compiled from government-led geological surveys, covering multiple regions, sampling sources, and target elements. The dataset comprises eight subsets representing diverse spatial scales and sampling conditions. To establish strong baselines, we reproduce and benchmark a range of unsupervised anomaly detection methods, including statistical models, generative and transformer-based approaches. Furthermore, we propose \textbf{GeoChemFormer}, a transformer-based framework that leverages self-supervised pretraining to learn target-element-aware geochemical representations for spatial samples. Extensive experiments demonstrate that GeoChemFormer consistently achieves superior and robust performance across all eight subsets, outperforming existing unsupervised methods in both anomaly detection accuracy and generalization capability. The proposed dataset and framework provide a foundation for reproducible research and future development in this direction.
摘要：地球化学异常检测在矿产勘探中起着至关重要的作用，因为与区域地球化学基线的偏差可能表明有矿化。现有研究存在两个主要局限性：（1）单一区域场景限制了模型的普遍性； (2) 专有数据集，导致结果再现无法实现。在这项工作中，我们引入了 \textbf{GeoChemAD}，这是一个由政府主导的地质调查编译而成的开源基准数据集，涵盖多个区域、采样源和目标元素。该数据集包含代表不同空间尺度和采样条件的八个子集。为了建立强大的基线，我们重现并基准化了一系列无监督异常检测方法，包括统计模型、生成和基于变压器的方法。此外，我们提出了 \textbf{GeoChemFormer}，一个基于变压器的框架，利用自监督预训练来学习空间样本的目标元素感知地球化学表示。大量实验表明，GeoChemFormer 在所有八个子集中始终实现卓越且稳健的性能，在异常检测精度和泛化能力方面均优于现有的无监督方法。所提出的数据集和框架为该方向的可重复研究和未来发展奠定了基础。

Title: Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

Authors: Yihang Zhou, Chao Lin, Hideki Kikumoto, Ryozo Ooka, Sibo Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13077
Pdf URL: https://arxiv.org/pdf/2603.13077
Copy Paste: [[2603.13077]] Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods(https://arxiv.org/abs/2603.13077)
Keywords: generative
Abstract: Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.
摘要：实时屋顶风速分布对于无人机和城市空中交通系统、风控系统和屋顶利用的安全运行具有重要意义。然而，屋顶流表现出很强的非线性、分离性和横向变化性，这使得稀疏传感器的流场重建变得困难。本研究使用粒子图像测速 (PIV) 获得的风洞实验数据开发了一个从观察中学习的框架，并将克里金插值与三种深度学习模型进行了比较：UNet、视觉变换器自动编码器 (ViTAE) 和条件 Wasserstein GAN (CWGAN)。我们评估两种训练策略，即单风向训练 (SDT) 和混合风向训练 (MDT)，传感器密度从 5 到 30，测试传感器位置扰动正负 1 网格下的鲁棒性，并通过适当正交分解和 QR 分解来优化传感器放置。结果表明，深度学习方法可以有效地从稀疏传感器数据中重建屋顶风场。与Kriging插值相比，深度学习模型将SSIM提高了32.7%，FAC2提高了24.2%，NMSE提高了27.8%。混合风向训练进一步提高了性能，与单向训练相比，SSIM 提升高达 173.7%，FAC2 提升 16.7%，MG 提升 98.3%。结果还表明，为了实现可靠部署，应共同考虑传感器配置、优化和训练策略。基于 QR 的优化在传感器扰动下将鲁棒性提高了 27.8%，尽管存在与度量相关的权衡。基于实验数据而非模拟数据的训练还为不同场景中的方法选择和传感器放置提供了实用指导。

Title: V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

Authors: Shenghe Zheng, Junpeng Jiang, Wenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13089
Pdf URL: https://arxiv.org/pdf/2603.13089
Copy Paste: [[2603.13089]] V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration(https://arxiv.org/abs/2603.13089)
Keywords: restoration, generative
Abstract: Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.
摘要：大规模视频生成模型接受海量且多样化的视觉数据的训练，使它们能够内化视觉世界丰富的结构、语义和动态先验。虽然这些模型表现出了令人印象深刻的生成能力，但它们作为通用视觉学习者的潜力在很大程度上尚未开发。在这项工作中，我们引入了 V-Bridge，这是一个将这种潜在能力与多功能的少样本图像恢复任务联系起来的框架。我们将图像恢复重新解释为一个渐进的生成过程，而不是一个静态回归问题，并利用视频模型来模拟从降级输入到高保真输出的逐步细化。令人惊讶的是，只需 1000 个多任务训练样本（不到现有恢复方法的 2%），就可以诱导预训练视频模型执行有竞争力的图像恢复，用单个模型实现多个任务，与专门为此目的设计的专用架构相媲美。我们的研究结果表明，视频生成模型隐式学习强大且可转移的恢复先验，只需极其有限的数据即可激活这些先验，挑战了生成建模和低级视觉之间的传统界限，并为视觉任务中的基础模型开辟了新的设计范式。

Title: FDeID-Toolbox: Face De-Identification Toolbox

Authors: Hui Wei, Hao Yu, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13121
Pdf URL: https://arxiv.org/pdf/2603.13121
Copy Paste: [[2603.13121]] FDeID-Toolbox: Face De-Identification Toolbox(https://arxiv.org/abs/2603.13121)
Keywords: generative
Abstract: Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.
摘要：人脸去识别（FDeID）旨在从面部图像中删除个人身份信息，同时保留与任务相关的实用属性，例如年龄、性别和表情。它对于保护隐私的计算机视觉至关重要，但该领域存在实施分散、评估协议不一致以及研究结果无法比较的问题。这些挑战源于任务固有的复杂性：FDeID跨越多个下游应用（例如年龄估计、性别识别、表达分析），需要跨三个维度（例如隐私保护、效用保存和视觉质量）进行评估，使得现有代码库难以使用和扩展。为了解决这些问题，我们推出了 FDeID-Toolbox，这是一个专为可重复的 FDeID 研究而设计的综合工具箱。我们的工具箱采用模块化架构，由四个核心组件组成：(1) 适用于主流基准数据集的标准化数据加载器，(2) 涵盖 SOTA 生成模型经典方法的统一方法实现，(3) 灵活的推理管道，以及 (4) 涵盖隐私、效用和质量指标的系统评估协议。通过实验，我们证明 FDeID-Toolbox 能够在一致的条件下对不同的 FDeID 方法进行公平且可重复的比较。

Title: Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Authors: Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.13185
Pdf URL: https://arxiv.org/pdf/2603.13185
Copy Paste: [[2603.13185]] Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos(https://arxiv.org/abs/2603.13185)
Keywords: generation
Abstract: Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.
摘要：时空场景图为建模不断演变的对象交互提供了原则性的表示，但现有方法基本上仍然以帧为中心：它们仅推理当前可见的对象，在遮挡时丢弃实体，并在 2D 中操作。为了解决这个问题，我们首先引入 ActionGenome4D，这是一个数据集，它通过前馈 3D 重建、动作中涉及的每个对象的面向世界框架的边界框以及密集关系注释（包括由于遮挡或相机运动而暂时无法观察到的对象）将动作基因组视频升级为 4D 场景。在此数据的基础上，我们形式化了世界场景图生成（WSGG），即在每个时间戳构建一个世界场景图的任务，其中包含场景中所有相互作用的对象，包括观察到的和未观察到的。然后，我们提出了三种补充方法，每种方法都探索了不同的归纳偏差来推理未观察到的对象：PWG（持久世界图），它通过零阶特征缓冲区实现对象持久性； MWAE（Masked World Auto-Encoder），它将未观察到的对象推理重新构建为具有跨视图关联检索的屏蔽完成； 4DST（4D 场景变换器），它用通过 3D 运动和相机姿势特征丰富的可微分的每个对象时间注意力取代静态缓冲区。我们通过一套基于图 RAG 的方法进一步设计和评估强大的开源视觉语言模型在 WSGG 任务上的性能，为非本地化关系预测建立基线。因此，WSGG 将视频场景理解推进到以世界为中心、时间持久且可解释的场景推理。

Title: Visual-ERM: Reward Modeling for Visual Equivalence

Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13224
Pdf URL: https://arxiv.org/pdf/2603.13224
Copy Paste: [[2603.13224]] Visual-ERM: Reward Modeling for Visual Equivalence(https://arxiv.org/abs/2603.13224)
Keywords: generative
Abstract: Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
摘要：视觉到代码任务需要模型将结构化视觉输入（例如图表、表格和 SVG）重建为具有高视觉保真度的可执行或结构化表示。虽然最近的大视觉语言模型（LVLM）通过监督微调取得了很好的成果，但由于奖励信号不一致，强化学习仍然具有挑战性。现有的奖励要么依赖于文本规则，要么依赖于粗略的视觉嵌入相似性，这两者都无法捕获细粒度的视觉差异，并且容易受到奖励黑客的攻击。我们提出了视觉等效奖励模型（Visual-ERM），这是一种多模式生成奖励模型，可提供细粒度、可解释且与任务无关的反馈，以直接在渲染的视觉空间中评估视觉到代码的质量。 Visual-ERM 集成到 RL 中，将 Qwen3-VL-8B-Instruct 从图表到代码的性能提高了 +8.4，并在表格和 SVG 解析上产生一致的增益（平均 +2.7，平均 +4.1），并通过反思和修订进一步增强了测试时间扩展。我们还引入了 VisualCritic-RewardBench (VC-RewardBench)，这是一个用于判断结构化视觉数据上细粒度图像到图像差异的基准，其中 8B 的 Visual-ERM 明显优于 Qwen3-VL-235B-Instruct，并接近领先的闭源模型。我们的结果表明，无论任务的特殊性如何，细粒度的视觉奖励监督对于视觉到代码的强化学习来说都是必要且充分的。

Title: PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization

Authors: Yangsong Zhang, Anujith Muraleedharan, Rikhat Akizhanov, Abdul Ahad Butt, Gül Varol, Pascal Fua, Fabio Pizzati, Ivan Laptev
Subjects: cs.LG, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.13228
Pdf URL: https://arxiv.org/pdf/2603.13228
Copy Paste: [[2603.13228]] PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization(https://arxiv.org/abs/2603.13228)
Keywords: generation
Abstract: Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.
摘要：文本条件人体运动生成的最新进展很大程度上是由大规模人体运动数据训练的扩散模型驱动的。在此进展的基础上，最近的方法尝试通过应用全身控制器（WBC）将扩散生成的运动转换为可执行轨迹，将此类模型转移到角色动画和真实机器人控制中。虽然 WBC 轨迹变得符合物理原理，但它们可能会暴露出与原始运动的重大偏差。为了解决这个问题，我们在这里提出了 PhysMoDPO，一个直接偏好优化框架。与之前依赖手工设计的物理感知启发法（例如滑脚惩罚）的工作不同，我们将 WBC 集成到我们的训练管道中并优化扩散模型，以便 WBC 的输出符合物理和原始文本指令。为了训练 PhysMoDPO，我们部署基于物理和特定于任务的奖励，并使用它们来分配对合成轨迹的偏好。我们对文本到运动和空间控制任务的广泛实验证明了 PhysMoDPO 在模拟机器人的物理真实性和任务相关指标方面的持续改进。此外，我们还证明了 PhysMoDPO 在应用于模拟中的零次运动传输以及在 G1 人形机器人上的实际部署时会带来显着的改进。