2026-03-11

Title: VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

Authors: Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08812
Pdf URL: https://arxiv.org/pdf/2603.08812
Copy Paste: [[2603.08812]] VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model(https://arxiv.org/abs/2603.08812)
Keywords: generation
Abstract: Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.
摘要：视觉内容生成已经从单图像工作流程发展到多图像工作流程，但现有代理仍然主要由计划驱动，并且缺乏系统的反映机制来纠正中间轨迹视觉错误。为了解决这个限制，我们提出了 VisionCreator-R1，一种具有显式反射的本机视觉生成代理，以及反射计划协同优化（RPCO）训练方法。通过广泛的实验和轨迹级分析，我们发现了强化学习（RL）中的反射计划优化不对称性：可以通过计划奖励可靠地优化计划，而反射学习则受到嘈杂的信用分配的阻碍。在这一见解的指导下，我们的 RPCO 首先在自建的具有反射强单图像轨迹和规划强多图像轨迹的 VCR-SFT 数据集上进行训练，然后通过 RL 对 VCR-RL 数据集进行协同优化。这产生了我们统一的 VisionCreator-R1 代理，它在现有基准测试和涵盖单图像和多图像任务的 VCR 基准测试中始终优于 Gemini2.5Pro。

Title: Are Expressive Encoders Necessary for Discrete Graph Generation?

Authors: Jay Revolinsky, Harry Shomer, Jiliang Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08825
Pdf URL: https://arxiv.org/pdf/2603.08825
Copy Paste: [[2603.08825]] Are Expressive Encoders Necessary for Discrete Graph Generation?(https://arxiv.org/abs/2603.08825)
Keywords: generation
Abstract: Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.
摘要：离散图生成已成为图数据建模的强大范例，通常依赖于高度表达的神经主干，例如变压器或高阶架构。我们通过引入 GenGNN（一种用于图形生成的模块化消息传递框架）来重新审视这种设计选择。使用 GenGNN 的扩散模型在树和平面数据集上实现了 90% 以上的有效性，在图形转换器的范围内，推理速度提高了 2-5 倍。对于分子生成，具有 GenGNN 主干的 DiGress 达到了 99.49% 的有效性。系统的消融研究显示了每个 GenGNN 组件提供的好处，表明需要残差连接来减轻复杂图结构上的过度平滑。通过尺度分析，我们应用原则性的度量空间视图来研究学习的扩散表示，并揭示 GNN 是否可以成为离散扩散的表达神经骨干。

Title: HECTOR: Hybrid Editable Compositional Object References for Video Generation

Authors: Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08850
Pdf URL: https://arxiv.org/pdf/2603.08850
Copy Paste: [[2603.08850]] HECTOR: Hybrid Editable Compositional Object References for Video Generation(https://arxiv.org/abs/2603.08850)
Keywords: generation, generative
Abstract: Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.
摘要：现实世界的视频自然地描绘了不同物理对象之间的复杂交互，有效地形成了视觉元素的动态组合。然而，当前大多数视频生成模型都是整体合成场景，因此缺乏显式组合操作的机制。为了解决这个限制，我们提出了 HECTOR，这是一种能够实现细粒度成分控制的生成管道。与之前的方法相比，HECTOR 支持混合参考调节，允许静态图像和/或动态视频同时引导生成。此外，用户可以明确指定每个引用元素的轨迹，精确控制其位置、比例和速度（见图1）。这种设计允许模型合成满足复杂时空约束的连贯视频，同时保持对参考的高保真度。大量实验表明，与现有方法相比，HECTOR 实现了卓越的视觉质量、更强的参考保留和改进的运动可控性。

Title: Towards Visual Query Segmentation in the Wild

Authors: Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu, Weishi Shi, Yunhe Feng, Yan Huang, Heng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08898
Pdf URL: https://arxiv.org/pdf/2603.08898
Copy Paste: [[2603.08898]] Towards Visual Query Segmentation in the Wild(https://arxiv.org/abs/2603.08898)
Keywords: generation
Abstract: In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
摘要：在本文中，我们介绍了视觉查询分割（VQS），这是一种视觉查询定位（VQL）的新范式，旨在在给定外部视觉查询的情况下，分割未修剪视频中感兴趣对象的所有像素级出现。与现有的仅使用边界框定位目标最后出现的 VQL 相比，VQS 能够实现更全面（即所有对象出现）和精确（即像素级掩模）定位，使其更适用于现实场景。为了促进这项任务的研究，我们推出了 VQS-4K，这是一个专用于 VQS 的大规模基准测试。具体来说，VQS-4K 包含 4,111 个视频，超过 130 万帧，涵盖 222 个对象类别。每个视频都与由搜索视频外部的帧及其目标掩码定义的视觉查询配对，并用与查询目标相对应的时空掩码进行注释。为了确保高质量，VQS-4K 中的所有视频均经过精心检查和迭代细化手动标记。据我们所知，VQS-4K 是第一个专门为 VQS 设计的基准测试。此外，为了刺激未来的研究，我们提出了一种简单而有效的方法，名为 VQ-SAM，它通过利用视频中的特定目标和背景干扰线索来扩展 SAM 2，通过带有 VQS 自适应记忆生成 (AMG) 模块的新型多阶段框架逐步进化记忆，从而显着提高性能。在我们对 VQS-4K 进行的广泛实验中，VQ-SAM 取得了有希望的结果并超越了所有现有方法，证明了其有效性。通过提出的 VQS-4K 和 VQ-SAM，我们期望超越当前的 VQL 范式，并激发更多关于 VQS 的未来研究和实际应用。我们的基准、代码和结果将公开。

Title: A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems

Authors: Mohammad Hossein Safarpour, Seyed Mohammad Alavi, Mohammad Izadikhah, Hossein Dibachi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08900
Pdf URL: https://arxiv.org/pdf/2603.08900
Copy Paste: [[2603.08900]] A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems(https://arxiv.org/abs/2603.08900)
Keywords: generation
Abstract: Considering the high volume, wide variety, and rapid speed of data generation, investigating feature selection methods for big data presents various applications and advantages. By removing irrelevant and redundant features, feature selection reduces data dimensions, thereby facilitating optimal decision-making within decision systems. One of the key tools for feature selection in hybrid information systems is fuzzy rough set theory. However, this theory faces two significant challenges: First, obtaining fuzzy equivalence relations through intersection operations in high-dimensional spaces can be both time-consuming and memory-intensive. Additionally, this method may produce noisy data, complicating the feature selection process. The purpose and innovation of this paper are to address these issues. We proposed a new feature selection model that calculates the combined distance between objects and subsequently used this information to derive the fuzzy equivalence relation. Rather than directly solving the feature selection problem, this approach reformulates it into an optimization problem that can be tackled using appropriate meta-heuristic algorithms. We have named this new approach FSbuHD. The FSbuHD model operates in two modes - normal and optimistic - based on the selection of one of the two introduced fuzzy equivalence relations. The model is then tested on standard datasets from the UCI repository and compared with other algorithms. The results of this research demonstrate that FSbuHD is one of the most efficient and effective methods for feature selection when compared to previous methods and algorithms.
摘要：考虑到数据量大、种类繁多、生成速度快，研究大数据的特征选择方法具有多种应用和优势。通过去除不相关和冗余的特征，特征选择减少了数据维度，从而促进决策系统内的最优决策。混合信息系统中特征选择的关键工具之一是模糊粗糙集理论。然而，该理论面临两个重大挑战：首先，通过高维空间中的交集运算获得模糊等价关系可能既耗时又占用内存。此外，该方法可能会产生噪声数据，使特征选择过程复杂化。本文的目的和创新就是为了解决这些问题。我们提出了一种新的特征选择模型，用于计算对象之间的组合距离，然后使用该信息来导出模糊等价关系。这种方法不是直接解决特征选择问题，而是将其重新表述为可以使用适当的元启发式算法来解决的优化问题。我们将这种新方法命名为 FSbuHD。 FSbuHD 模型以两种模式运行 - 正常模式和乐观模式 - 基于选择两种引入的模糊等价关系之一。然后在 UCI 存储库的标准数据集上测试该模型，并与其他算法进行比较。这项研究的结果表明，与以前的方法和算法相比，FSbuHD 是最高效、最有效的特征选择方法之一。

Title: MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

Authors: Xinqi Fan, Jingting Li, John See, Moi Hoon Yap, Su-Jing Wang, Adrian K. Davison
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2603.08927
Pdf URL: https://arxiv.org/pdf/2603.08927
Copy Paste: [[2603.08927]] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering(https://arxiv.org/abs/2603.08927)
Keywords: generation
Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at this https URL.
摘要：面部微表情 (ME) 是当一个人经历某种情绪但试图抑制或压制面部表情时自发发生的面部不自主运动，通常出现在高风险环境中。近年来，ME 识别、发现和生成领域取得了实质性进展。多模态大语言模型 (MLLM) 和大视觉语言模型 (LVLM) 的出现，通过其强大的多模态推理能力，为增强 ME 分析提供了有前景的新途径。 ME 大挑战 (MEGC) 2026 引入了反映这些不断发展的研究方向的两项任务：(1) ME 视频问答 (ME-VQA)，它通过在相对较短的视频序列上进行视觉问答来探索 ME 理解，利用 MLLM 或 LVLM 来解决与 ME 相关的各种问题类型； (2) ME 长视频问答 (ME-LVQA)，它将 VQA 扩展到现实环境中的长视频序列，要求模型能够在较长的时间段内处理时间推理和微妙的微表情检测。所有参与的算法都必须在公共排行榜上提交其结果。更多详细信息请访问此 https URL。

Title: TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers

Authors: Yihua Liu, Fanjiang Ye, Bowen Lin, Rongyu Fang, Chengming Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08928
Pdf URL: https://arxiv.org/pdf/2603.08928
Copy Paste: [[2603.08928]] TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers(https://arxiv.org/abs/2603.08928)
Keywords: generation
Abstract: Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
摘要：与训练分辨率相比，扩散变换器（DiT）在生成分辨率更高的图像时面临挑战，特别是由于注意力稀释而导致结构退化。以前的方法试图通过锐化注意力分布来缓解这一问题，但无法保留细粒度的语义细节并引入明显的伪影。在这项工作中，我们分析了 DiT 的特征并提出了 TIDE，这是一种无需训练的文本到图像 (T2I) 外推方法，可以生成任意分辨率和纵横比，而无需额外的采样开销。我们确定了导致信息丢失的核心因素，并引入了文本锚定机制来纠正文本和图像标记之间的不平衡。为了进一步消除伪影，我们设计了一种动态温度控制机制，利用扩散过程中的光谱级数模式。广泛的评估表明，TIDE 提供高质量的分辨率外推能力，并与现有的最先进方法无缝集成。

Title: Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Authors: Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario, Lars Lundqvist, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08930
Pdf URL: https://arxiv.org/pdf/2603.08930
Copy Paste: [[2603.08930]] Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning(https://arxiv.org/abs/2603.08930)
Keywords: generation
Abstract: This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
摘要：本文介绍了一种综合基准，用于评估视觉语言模型 (VLM) 在生成数字孪生工厂仿真配置方面的性能。虽然功能结构植物模型（FSPM）是模拟农业环境中生物物理过程的有用工具，但其高复杂性和低吞吐量造成了大规模部署的瓶颈。我们提出了一种新颖的方法，利用最先进的开源 VLM（Gemma 3 和 Qwen3-VL）直接从基于无人机的遥感图像生成 JSON 格式的模拟参数。使用通过 Helios 3D 程序植物生成库生成的合成豇豆图数据集，我们测试了五种上下文学习方法，并评估了三个类别的模型：JSON 完整性、几何评估和生物物理评估。我们的结果表明，虽然 VLM 可以解释结构元数据并估计植物数量和太阳方位角等参数，但它们经常因上下文偏差或在视觉线索不足时依赖数据集手段而表现出性能下降。对现实世界无人机正射影像数据集的验证和使用盲基线的消融研究进一步表征了模型的推理能力及其对上下文先验的依赖。据我们所知，这是第一项利用 VLM 生成用于植物模拟的结构 JSON 配置的研究，为农业数字孪生的重建 3D 图提供可扩展的框架。

Title: SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

Authors: Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin Cheung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08982
Pdf URL: https://arxiv.org/pdf/2603.08982
Copy Paste: [[2603.08982]] SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing(https://arxiv.org/abs/2603.08982)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
摘要：扩散变压器（DiT）已成为视频生成的主要支柱，但其二次注意力成本仍然是主要瓶颈。稀疏注意力通过仅计算注意力块的子集来降低这种成本。然而，现有的方法通常要么丢弃剩余的块，这会导致信息丢失，要么依赖学习的预测器来近似它们，从而引入训练开销和潜在的输出分布偏移。在本文中，我们表明可以在不训练的情况下恢复缺失的贡献：在语义聚类之后，每个块内的键和值表现出很强的相似性，并且可以通过一小组聚类质心很好地概括。基于这一观察，我们引入了 SVG-EAR，这是一种无参数线性补偿分支，它使用质心来近似跳过的块并恢复它们的贡献。虽然质心补偿对于大多数块来说是准确的，但在一小部分块上可能会失败。标准稀疏化通常通过注意力分数来选择块，注意力分数指示模型将注意力集中在何处，但不是近似误差最大的位置。因此，SVG-EAR 执行错误感知路由：轻量级探针估计每个块的补偿误差，我们精确计算具有最高错误成本比的块，同时补偿跳过的块。我们提供了将注意力重建误差与聚类质量联系起来的理论保证，并凭经验证明 SVG-EAR 改善了质量效率权衡，并在视频扩散任务的同一代保真度下提高了吞吐量。总体而言，SVG-EAR 与之前的方法相比建立了清晰的帕累托边界，在 Wan2.2 和 HunyuanVideo 上分别实现了高达 1.77$\times$ 和 1.93$\times$ 的加速，同时保持高达 29.759 和 31.043 的 PSNR。

Title: Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning

Authors: Bolutife Atoki, Iuliia Tkachenko, Bertrand Kerautret, Carlos Crispim-Junior
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.08998
Pdf URL: https://arxiv.org/pdf/2603.08998
Copy Paste: [[2603.08998]] Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning(https://arxiv.org/abs/2603.08998)
Keywords: generative
Abstract: Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.
摘要：假冒产品影响制药、电子和食品等多个行业，带来严重的健康和经济风险。可打印的不可克隆代码，例如复制检测图案（CDP），被广泛用作防伪措施，并应用于产品和包装。然而，高分辨率打印和扫描设备的日益普及，以及生成式深度学习的进步，破坏了传统的认证系统，而传统的认证系统往往无法区分高质量的赝品和真品。在这项工作中，我们提出了一种基于扩散的身份验证框架，该框架联合利用原始二进制模板、打印的 CDP 以及捕获相关语义信息的打印机身份表示。将身份验证制定为基于打印机签名的多类打印机分类，使我们的模型可以通过空间和文本条件捕获细粒度的、特定于设备的特征。我们通过重新利用去噪过程进行类条件噪声预测来扩展 ControlNet，从而实现有效的打印机分类。在 Indigo 1 x 1 Base 数据集上，我们的方法优于传统的相似性度量和先前的深度学习方法。结果表明，该框架适用于训练期间未见的假冒类型。

Title: The Coupling Within: Flow Matching via Distilled Normalizing Flows

Authors: David Berthelot, Tianrong Chen, Jiatao Gu, Marco Cuturi, Laurent Dinh, Bhavik Chandna, Michal Klein, Josh Susskind, Shuangfei Zhai
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.09014
Pdf URL: https://arxiv.org/pdf/2603.09014
Copy Paste: [[2603.09014]] The Coupling Within: Flow Matching via Distilled Normalizing Flows(https://arxiv.org/abs/2603.09014)
Keywords: generation
Abstract: Flow models have rapidly become the go-to method for training and deploying large-scale generators, owing their success to inference-time flexibility via adjustable integration steps. A crucial ingredient in flow training is the choice of coupling measure for sampling noise/data pairs that define the flow matching (FM) regression loss. While FM training defaults usually to independent coupling, recent works show that adaptive couplings informed by noise/data distributions (e.g., via optimal transport, OT) improve both model training and inference. We radicalize this insight by shifting the paradigm: rather than computing adaptive couplings directly, we use distilled couplings from a different, pretrained model capable of placing noise and data spaces in bijection -- a property intrinsic to normalizing flows (NF) through their maximum likelihood and invertibility requirements. Leveraging recent advances in NF image generation via auto-regressive (AR) blocks, we propose Normalized Flow Matching (NFM), a new method that distills the quasi-deterministic coupling of pretrained NF models to train student flow models. These students achieve the best of both worlds: significantly outperforming flow models trained with independent or even OT couplings, while also improving on the teacher AR-NF model.
摘要：流模型已迅速成为训练和部署大型生成器的首选方法，因为它们的成功归功于通过可调整的集成步骤实现推理时间的灵活性。流训练的一个关键因素是选择用于定义流匹配（FM）回归损失的噪声/数据对采样的耦合测量。虽然 FM 训练通常默认为独立耦合，但最近的研究表明，由噪声/数据分布通知的自适应耦合（例如，通过最佳传输，OT）可以改善模型训练和推理。我们通过改变范式来彻底化这一见解：我们不是直接计算自适应耦合，而是使用来自不同预训练模型的蒸馏耦合，该模型能够将噪声和数据空间置于双射中——这是通过最大似然性和可逆性要求对流（NF）进行归一化的固有属性。利用通过自回归 (AR) 块生成 NF 图像的最新进展，我们提出了归一化流匹配 (NFM)，这是一种新方法，可提取预训练 NF 模型的准确定性耦合来训练学生流模型。这些学生实现了两全其美：显着优于使用独立甚至 OT 耦合训练的流模型，同时还改进了教师 AR-NF 模型。

Title: Spectral-Structured Diffusion for Single-Image Rain Removal

Authors: Yucheng Xing, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09054
Pdf URL: https://arxiv.org/pdf/2603.09054
Copy Paste: [[2603.09054]] Spectral-Structured Diffusion for Single-Image Rain Removal(https://arxiv.org/abs/2603.09054)
Keywords: restoration
Abstract: Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
摘要：雨条纹表现为方向性和频率集中的结构，在多个尺度上重叠，使得单图像除雨特别具有挑战性。虽然基于扩散的恢复模型为渐进式降噪提供了强大的框架，但标准空间域扩散并没有明确考虑这种结构化的光谱特征。我们引入了 SpectralDiff，这是一种基于光谱结构扩散的框架，专为单图像除雨而定制。我们的方法没有重新定义扩散公式，而是结合了结构化光谱扰动来指导多方向降雨分量的逐步抑制。为了支持这种设计，我们进一步提出了一种全乘积 U-Net 架构，该架构利用卷积定理用逐元素乘积层替换卷积运算，从而提高计算效率，同时保留建模能力。对合成和现实世界基准的大量实验表明，与现有的基于扩散的方法相比，SpectralDiff 具有改进的模型紧凑性和良好的推理效率，从而实现了有竞争力的除雨性能。

Title: GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models

Authors: Md Selim Sarowar, Omer Tariq, Sungho Kim
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.09079
Pdf URL: https://arxiv.org/pdf/2603.09079
Copy Paste: [[2603.09079]] GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models(https://arxiv.org/abs/2603.09079)
Keywords: generation
Abstract: VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $\mu \in \mathbb{R}^3$, log-scale covariance $\log \sigma \in \mathbb{R}^3$, and learned opacity $\alpha \in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
摘要：VLA 模型将视觉观察结果编码为没有内在几何结构的 2D 补丁标记。我们引入 GST-VLA 有两个贡献。首先，高斯空间分词器 (GST) 将冻结密集深度和冻结语义块特征转换为 $N_g{=}128$ 各向异性 3D 高斯基元，每个基元由度量残差均值 $\mu \in \mathbb{R}^3$、对数尺度协方差 $\log \sigma \in \mathbb{R}^3$ 和学习不透明度 $\alpha \in 参数化(0,1)$。协方差特征结构编码局部表面方向，不透明度提供每个基元的几何置信度，这两者都无法从标量深度访问。具有学习查询的空间注意力池将固定令牌预算集中在几何显着区域，而不是均匀分布。其次，3D 深度感知思想链 (DA-CoT) 推理监督四种结构化中间空间思想，涵盖 3D 对象接地、掌握可供性接触几何、成对度量距离和粗 SE(3) 路径点，作为训练损失中的明确生成目标。每个 VLM Transformer 块上的交叉注意力子层可在 DA-CoT 生成期间直接访问原始 256 基元高斯场。具有专家混合前馈子层的 300M 参数流匹配动作专家通过条件 ODE 积分解码 7-DoF 增量动作块，通过双交叉注意力以 VLM 隐藏状态和 DA-CoT 输出为条件。在三个渐进阶段中使用复合 $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ 进行训练，GST-VLA 在 LIBERO 上达到 96.4%（+2.0%），在 SimplerEnv 上达到 80.2%（+5.4%）。消融隔离了每个 GST 组件、每个 DA-CoT 思想和每个训练阶段的贡献，确认了集中在精度要求高的任务上的独立和协同增益。

Title: OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing

Authors: Lixiang Lin, Siyuan Jin, Jinshan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09084
Pdf URL: https://arxiv.org/pdf/2603.09084
Copy Paste: [[2603.09084]] OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing(https://arxiv.org/abs/2603.09084)
Keywords: generation
Abstract: Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at this https URL.
摘要：口型同步和视听编辑已成为多模式学习中的基本挑战，支撑着广泛的应用，包括电影制作、虚拟化身和远程呈现。尽管最近取得了进展，但大多数现有的口型同步和视听编辑方法都依赖于预训练模型的监督微调，导致相当大的计算开销和数据需求。在本文中，我们介绍了 OmniEdit，这是一个专为口型同步和视听编辑而设计的免培训框架。我们的方法通过用目标序列替换 FlowEdit 中的编辑序列来重新制定编辑范式，从而产生对所需输出的无偏估计。此外，通过从生成过程中去除随机元素，我们建立了平滑稳定的编辑轨迹。大量的实验结果验证了所提出框架的有效性和鲁棒性。代码可从此 https URL 获取。

Title: Chain of Event-Centric Causal Thought for Physically Plausible Video Generation

Authors: Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, Yinjie Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09094
Pdf URL: https://arxiv.org/pdf/2603.09094
Copy Paste: [[2603.09094]] Chain of Event-Centric Causal Thought for Physically Plausible Video Generation(https://arxiv.org/abs/2603.09094)
Keywords: generation
Abstract: Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
摘要：物理上合理的视频生成（PPVG）已成为模拟现实世界物理现象的一种有前途的途径。 PPVG 需要理解常识知识，这对于视频扩散模型来说仍然是一个挑战。当前的方法利用大型语言模型的常识推理能力将物理概念嵌入到提示中。然而，由于缺乏用于建模因果进展的调节机制，生成模型通常将物理现象呈现为由提示定义的单个时刻。在本文中，我们将 PPVG 视为生成一系列因果关联且动态演变的事件。为了实现这一范式，我们设计了两个关键模块：（1）物理驱动的事件链推理。该模块利用思维链推理，将提示中描述的物理现象分解为多个基本事件单元。为了减轻因果模糊性，我们嵌入物理公式作为约束，以在推理过程中施加确定性因果依赖性。 (2) 转换感知的跨模式提示（TCP）。为了保持事件之间的连续性，该模块将因果事件单元转换为时间对齐的视觉语言提示。它总结离散事件描述以获得因果一致的叙述，同时通过交互式编辑逐步合成单个事件的视觉关键帧。 PhyGenBench 和 VideoPhy 基准测试的综合实验表明，我们的框架在跨不同物理领域生成物理上合理的视频方面实现了卓越的性能。我们的代码很快就会发布。

Title: Training-free Motion Factorization for Compositional Video Generation

Authors: Zixuan Wang, Ziqin Zhou, Feng Chen, Duo Peng, Yixin Hu, Changsheng Li, Yinjie Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09104
Pdf URL: https://arxiv.org/pdf/2603.09104
Copy Paste: [[2603.09104]] Training-free Motion Factorization for Compositional Video Generation(https://arxiv.org/abs/2603.09104)
Keywords: generation
Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
摘要：合成视频生成旨在合成具有不同外观和运动的多个实例，广泛适用于现实场景。然而，当前的方法主要集中于绑定语义，忽略了理解提示中指定的不同运动类别。在本文中，我们提出了一种运动分解框架，将复杂运动分解为三个主要类别：静止、刚性运动和非刚性运动。具体来说，我们的框架遵循生成前规划范例。（1）在规划过程中，我们对运动图上的运动规律进行推理，以获得每个实例的形状和位置的逐帧变化。这通过将用户提示组织成实例及其交互的结构化表示来减轻用户提示中的语义歧义。 (2)在生成过程中，我们以一种解开的方式调节不同运动类别的合成。根据运动线索，引导分支稳定静止区域的外观，保留刚体几何形状，并规范局部非刚性变形。至关重要的是，我们的两个模块与模型无关，可以无缝地合并到各种扩散模型架构中。大量的实验表明，我们的框架在现实世界基准的运动合成中取得了令人印象深刻的性能。我们的代码很快就会发布。

Title: Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Authors: Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09117
Pdf URL: https://arxiv.org/pdf/2603.09117
Copy Paste: [[2603.09117]] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards(https://arxiv.org/abs/2603.09117)
Keywords: generation
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
摘要：可验证奖励的强化学习 (RLVR) 显着增强了大型语言模型 (LLM) 的推理能力，但严重受到校准退化的影响，即模型对错误答案过于自信。先前的研究致力于将校准目标直接纳入现有的优化目标中。然而，我们的理论分析表明，最大化策略准确性和最小化校准误差的优化之间存在根本的梯度冲突。基于这一见解，我们提出了 DCPO，这是一个简单而有效的框架，可以系统地解耦推理和校准目标。大量的实验表明，我们的 DCPO 不仅保持了与 GRPO 相当的精度，而且还实现了最佳的校准性能，并大大缓解了过度自信的问题。我们的研究为更可靠的法学硕士部署提供了宝贵的见解和实用的解决方案。

Title: QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model

Authors: Junjie Yin, Jiaju Li, Hanfa Xing
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09125
Pdf URL: https://arxiv.org/pdf/2603.09125
Copy Paste: [[2603.09125]] QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model(https://arxiv.org/abs/2603.09125)
Keywords: restoration, super-resolution, generation
Abstract: Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at this https URL.
摘要：基于扩散的图像超分辨率（ISR）已显示出强大的潜力，但在退化未知且空间不均匀的现实场景中仍然举步维艰，通常会导致细节丢失或视觉伪影。为了应对这一挑战，我们提出了一种新颖的超分辨率扩散模型 QUSR，它将质量感知先验（QAP）与不确定性引导噪声生成（UNG）模块集成在一起。 UNG 模块自适应地调整噪声注入强度，对高不确定性区域（例如边缘和纹理）应用更强的扰动以重建复杂细节，同时最大限度地减少低不确定性区域（例如平坦区域）中的噪声以保留原始信息。同时，QAP 利用先进的多模态大语言模型 (MLLM) 生成可靠的质量描述，为恢复过程提供有效且可解释的质量先验。实验结果证实QUSR可以在现实场景中产生高保真度和高真实感的图像。源代码可从此 https URL 获取。

Title: Rotation Equivariant Mamba for Vision Tasks

Authors: Zhongchen Zhao, Qi Xie, Keyu Huang, Lei Zhang, Deyu Meng, Zongben Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09138
Pdf URL: https://arxiv.org/pdf/2603.09138
Copy Paste: [[2603.09138]] Rotation Equivariant Mamba for Vision Tasks(https://arxiv.org/abs/2603.09138)
Keywords: super-resolution
Abstract: Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at this https URL.
摘要：旋转等变性构成了视觉数据最普遍和最重要的结构先验之一，但在当前基于 Mamba 的视觉架构中仍然明显缺乏它。尽管 Mamba 在自然语言处理方面取得了成功，并且在计算机视觉领域的应用日益广泛，但现有的视觉 Mamba 模型未能在设计中考虑旋转对称性。这种遗漏使得它们本质上对图像旋转敏感，从而限制了它们的鲁棒性和跨任务泛化。为了解决这个限制，我们建议将旋转对称（图像中普遍且基本的几何先验）纳入基于 Mamba 的架构中。具体来说，我们引入了 EQ-VMamba，这是第一个用于视觉任务的旋转等变视觉 Mamba 架构。 EQ-VMamba 的核心组件包括精心设计的旋转等变交叉扫描策略和组 Mamba 块。此外，我们对内在等方差误差进行了严格的理论分析，证明所提出的架构在整个网络中强制执行端到端旋转等方差。跨多个基准的广泛实验（包括高级图像分类任务、中级语义分割任务和低级图像超分辨率任务）表明，与非等变基线相比，EQ-VMamba 实现了卓越或有竞争力的性能，同时需要的参数减少了大约 50%。这些结果表明，嵌入旋转等方差不仅有效增强了视觉 Mamba 模型针对旋转变换的鲁棒性，而且还通过显着提高参数效率来增强整体性能。代码可从此 https URL 获取。

Title: Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G

Authors: Loc X. Nguyen, Ji Su Yoon, Huy Q. Le, Yu Qiao, Avi Deb Raha, Eui-Nam Huh, Nguyen H. Tran, Choong Seon Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09141
Pdf URL: https://arxiv.org/pdf/2603.09141
Copy Paste: [[2603.09141]] Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G(https://arxiv.org/abs/2603.09141)
Keywords: generation
Abstract: The shift toward user-customized on-device learning places new demands on wireless systems: models must be trained on diverse, distributed data while meeting strict latency, bandwidth, and reliability constraints. To address this, we propose an Agentic AI as the control layer for managing federated learning (FL) over 6G networks, which translates high-level task goals into actions that are aware of network conditions. Rather than simply viewing FL as a learning challenge, our system sees it as a combined task of learning and network management. A set of specialized agents focused on retrieval, planning, coding, and evaluation utilizes monitoring tools and optimization methods to handle client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation. The use of closed-loop evaluation and memory allows the system to consistently refine its decisions, taking into account varying signal-to-noise ratios, bandwidth conditions, and device capabilities. Finally, our case study has demonstrated the effectiveness of the Agentic AI system's use of tools for achieving high performance.
摘要：向用户定制的设备上学习的转变对无线系统提出了新的要求：模型必须在多样化的分布式数据上进行训练，同时满足严格的延迟、带宽和可靠性约束。为了解决这个问题，我们提出了一个 Agentic AI 作为控制层，用于管理 6G 网络上的联邦学习 (FL)，它将高级任务目标转化为了解网络条件的行动。我们的系统并没有简单地将 FL 视为一项学习挑战，而是将其视为学习和网络管理的综合任务。一组专注于检索、规划、编码和评估的专业代理利用监控工具和优化方法来处理客户选择、激励结构、调度、资源分配、自适应本地培训和代码生成。闭环评估和内存的使用使系统能够不断完善其决策，同时考虑到不同的信噪比、带宽条件和设备功能。最后，我们的案例研究证明了 Agentic AI 系统使用工具来实现高性能的有效性。

Title: RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Authors: Tzu-Heng Huang, Sirajul Salekin, Javier Movellan, Frederic Sala, Manjot Bilkhu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09160
Pdf URL: https://arxiv.org/pdf/2603.09160
Copy Paste: [[2603.09160]] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning(https://arxiv.org/abs/2603.09160)
Keywords: generation, quality assessment
Abstract: Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
摘要：密集图像字幕对于视觉语言预训练和文本到图像生成中的跨模式对齐至关重要，但扩展专家质量的注释成本高昂。虽然通过强大的视觉语言模型 (VLM) 进行合成字幕是一种实用的替代方案，但监督蒸馏通常会产生有限的输出多样性和弱泛化性。强化学习（RL）可以克服这些限制，但迄今为止它的成功主要集中在依赖确定性检查器的可验证领域——这是开放式字幕所不具备的奢侈品。我们使用 RubiCap 解决了这个瓶颈，这是一种新颖的 RL 框架，可以从 LLM 编写的规则中导出细粒度的、特定于样本的奖励信号。 RubiCap 首先组建了一个由候选人标题组成的多元化委员会，然后聘请法学硕士标题作者来提取共识优势并诊断当前政策中的缺陷。这些见解被转化为明确的评估标准，使法学硕士法官能够分解整体质量评估，并用结构化、多方面的评估取代粗略的标量奖励。在广泛的基准测试中，RubiCap 在 CapArena 上实现了最高的获胜率，优于监督蒸馏、先前的 RL 方法、人类专家注释和 GPT-4V 增强输出。在 CaptionQA 上，它展示了卓越的单词效率：我们的 7B 模型与 Qwen2.5-VL-32B-Instruct 匹配，而我们的 3B 模型超越了其 7B 模型。值得注意的是，使用紧凑型 RubiCap-3B 作为字幕生成器可以产生比使用专有模型字幕进行训练的 VLM 更强的预训练 VLM。

Title: Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

Authors: Siyang Cai, Cangyuan Li, Yinhe Han, Ying Wang
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2603.09161
Pdf URL: https://arxiv.org/pdf/2603.09161
Copy Paste: [[2603.09161]] Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL(https://arxiv.org/abs/2603.09161)
Keywords: generation
Abstract: Learning effective netlist representations is fundamentally constrained by the scarcity of labeled datasets, as real designs are protected by Intellectual Property (IP) and costly to annotate. Existing work therefore focuses on small-scale circuits with clean labels, limiting scalability to realistic designs. Meanwhile, Large Language Models (LLMs) can generate Register-Transfer-Level (RTL) at scale, but their functional incorrectness has hindered their use in circuit analysis. In this work, we make a key observation: even when LLM-Generated RTL is functionally imperfect, the synthesized netlists still preserve structural patterns that are strongly indicative of the intended functionality. Building on this insight, we propose a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-Generated RTL as training data for netlist representation learning, forming an end-to-end pipeline from automated code generation to downstream tasks. We conduct evaluations on circuit functional understanding tasks, including sub-circuit boundary identification and component classification, across benchmarks of increasing scales, extending the task scope from operator-level to IP-level. The evaluations demonstrate that models trained on our noisy synthetic corpus generalize well to real-world netlists, matching or even surpassing methods trained on scarce high-quality data and effectively breaking the data bottleneck in circuit representation learning.
摘要：学习有效的网表表示从根本上受到标记数据集稀缺的限制，因为真正的设计受知识产权 (IP) 保护并且注释成本高昂。因此，现有的工作重点是具有干净标签的小规模电路，限制了实际设计的可扩展性。与此同时，大型语言模型（LLM）可以大规模生成寄存器传输级（RTL），但其功能不正确阻碍了它们在电路分析中的使用。在这项工作中，我们做了一个关键的观察：即使 LLM 生成的 RTL 在功能上不完善，合成的网表仍然保留了强烈指示预期功能的结构模式。基于这一见解，我们提出了一种经济有效的数据增强和训练框架，系统地利用不完善的 LLM 生成的 RTL 作为网表表示学习的训练数据，形成从自动代码生成到下游任务的端到端管道。我们对电路功能理解任务进行评估，包括子电路边界识别和组件分类，跨尺度不断增加的基准，将任务范围从算子级扩展到IP级。评估表明，在我们的噪声合成语料库上训练的模型可以很好地推广到现实世界的网表，匹配甚至超越在稀缺高质量数据上训练的方法，并有效打破电路表示学习中的数据瓶颈。

Title: Progressive Split Mamba: Effective State Space Modelling for Image Restoration

Authors: Mohammed Hassanin, Nour Moustafa, Weijian Deng, Ibrahim Radwan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09171
Pdf URL: https://arxiv.org/pdf/2603.09171
Copy Paste: [[2603.09171]] Progressive Split Mamba: Effective State Space Modelling for Image Restoration(https://arxiv.org/abs/2603.09171)
Keywords: restoration, super-resolution
Abstract: Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.
摘要：图像恢复需要同时保留细粒度的局部结构并保持远程空间相干性。虽然卷积网络与有限的接受域作斗争，并且 Transformer 会产生二次复杂性以引起全球关注，但最近的状态空间模型 (SSM)（例如 Mamba）为远程依赖建模提供了一种有吸引力的线性时间替代方案。然而，天真地将 Mamba 扩展到 2D 图像暴露了两个内在的缺点。首先，将 2D 特征图展平为 1D 序列会破坏空间拓扑，导致局部性失真，从而阻碍精确的结构恢复。其次，SSM 的稳定性驱动的循环动态会引起长程衰减，逐渐减弱遥远空间位置的信息并削弱全局一致性。这些影响共同限制了高保真恢复中状态空间建模的有效性。我们提出了渐进式 Split-Mamba (PS-Mamba)，这是一种拓扑感知的分层状态空间框架，旨在协调局部性保留与高效的全局传播。 PS-Mamba 不是按顺序展平整个特征图，而是执行几何一致的分区，在状态空间处理之前保持邻域完整性。渐进式分割层次结构（半分、象限、八分圆）可实现结构化多尺度建模，同时保留线性复杂性。为了抵消长程衰减，我们引入了对称的跨尺度捷径，可以跨层次直接传输低频全局上下文，从而稳定大空间范围内的信息流。关于超分辨率、去噪和 JPEG 伪影减少的大量实验表明，与最近基于 Mamba 和基于注意力的模型相比，有了明显的改进。

Title: Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning

Authors: Lina Berrayana, Ahmed Heakl, Abdullah Sohail, Thomas Hofmann, Salman Khan, Wei Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09184
Pdf URL: https://arxiv.org/pdf/2603.09184
Copy Paste: [[2603.09184]] Latent-DARM: Bridging Discrete Diffusion And Autoregressive Models For Reasoning(https://arxiv.org/abs/2603.09184)
Keywords: generation
Abstract: Most multi-agent systems rely exclusively on autoregressive language models (ARMs) that are based on sequential generation. Although effective for fluent text, ARMs limit global reasoning and plan revision. On the other hand, Discrete Diffusion Language Models (DDLMs) enable non-sequential, globally revisable generation and have shown strong planning capabilities, but their limited text fluency hinders direct collaboration with ARMs. We introduce Latent-DARM, a latent-space communication framework bridging DDLM (planners) and ARM (executors), maximizing collaborative benefits. Across mathematical, scientific, and commonsense reasoning benchmarks, Latent-DARM outperforms text-based interfaces on average, improving accuracy from 27.0% to 36.0% on DART-5 and from 0.0% to 14.0% on AIME2024. Latent-DARM approaches the results of state-of-the-art reasoning models while using less than 2.2% of its token budget. This work advances multi-agent collaboration among agents with heterogeneous models.
摘要：大多数多智能体系统完全依赖于基于顺序生成的自回归语言模型 (ARM)。尽管 ARM 对流畅的文本有效，但它限制了全局推理和计划修订。另一方面，离散扩散语言模型 (DDLM) 可实现非顺序、全局可修改的生成，并显示出强大的规划能力，但其有限的文本流畅性阻碍了与 ARM 的直接协作。我们引入 Latent-DARM，这是一种桥接 DDLM（规划器）和 ARM（执行器）的潜在空间通信框架，可最大限度地提高协作效益。在数学、科学和常识推理基准测试中，Latent-DARM 的平均性能优于基于文本的界面，DART-5 上的准确率从 27.0% 提高到 36.0%，AIME2024 上的准确率从 0.0% 提高到 14.0%。 Latent-DARM 接近最先进的推理模型的结果，同时使用不到 2.2% 的代币预算。这项工作促进了具有异构模型的代理之间的多代理协作。

Title: TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy

Authors: Yaoyu Liu, Minghui Zhang, Xin You, Hanxiao Zhang, Yun Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09217
Pdf URL: https://arxiv.org/pdf/2603.09217
Copy Paste: [[2603.09217]] TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy(https://arxiv.org/abs/2603.09217)
Keywords: generation
Abstract: Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $\beta_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $\beta_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.
摘要：由于其复杂的拓扑结构和对数据集变化的敏感性，对类似医疗血管的解剖结构进行建模具有挑战性。因此，特定于任务的模型经常会出现拓扑不一致的问题，包括人为断开连接和虚假合并。受到多模态大语言模型（MLLM）零样本泛化前景的推动，我们提出了 TubeMLLM，这是一种统一的基础模型，它将结构化理解与医疗血管解剖学的可控生成结合起来。通过显式自然语言提示整合拓扑先验，并将其与共享注意力架构中的视觉表示对齐，TubeMLLM 显着增强了拓扑感知感知。此外，我们构建了 TubeMData，这是一个先锋多模态基准，包含全面的以拓扑为中心的任务，并引入自适应损失加权策略来强调训练期间的拓扑关键区域。对十五个不同数据集的广泛实验证明了我们的优势。从数量上讲，TubeMLLM 实现了最先进的分布性能，大大减少了彩色眼底摄影的全局拓扑差异（与基线相比，将 $\beta_{0}$ 数字误差从 37.42 减少到 8.58）。值得注意的是，TubeMLLM 在看不见的 X 射线血管造影上表现出出色的零射击跨模态传输能力，实现了 67.50% 的 Dice 分数，同时将 $\beta_{0}$ 误差显着降低至 1.21。 TubeMLLM 还保持了针对模糊、噪声和低分辨率等退化的鲁棒性。此外，在拓扑感知理解任务中，该模型在评估掩模拓扑质量方面达到了 97.38% 的准确率，显着优于标准视觉语言基线。

Title: RAE-NWM: Navigation World Model in Dense Visual Representation Space

Authors: Mingkun Zhang, Wangtian Shen, Fan Zhang, Haijian Qin, Zihao Pei, Ziyang Meng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2603.09241
Pdf URL: https://arxiv.org/pdf/2603.09241
Copy Paste: [[2603.09241]] RAE-NWM: Navigation World Model in Dense Visual Representation Space(https://arxiv.org/abs/2603.09241)
Keywords: generation
Abstract: Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
摘要：视觉导航要求智能体通过感知和规划在复杂环境中实现目标。世界模型通过模拟动作条件状态转换来预测未来的观察来解决此任务。当前的导航世界模型通常在变分自动编码器的压缩潜在空间内学习状态演化，其中空间压缩通常会丢弃细粒度的结构信息并阻碍精确控制。为了更好地理解不同表示的传播特性，我们进行了线性动力学探测，并观察到密集的 DINOv2 特征对动作条件转换表现出更强的线性可预测性。受这一观察的启发，我们提出了基于表示自动编码器的导航世界模型（RAE-NWM），它在密集的视觉表示空间中对导航动态进行建模。我们采用带有解耦扩散变压器头的条件扩散变压器 (CDiT-DH) 来模拟连续过渡，并引入一个单独的时间驱动选通模块进行动态调节，以在生成过程中调节动作注入强度。广泛的评估表明，对该领域的连续推出进行建模可以提高结构稳定性和行动准确性，有利于下游规划和导航。

Title: When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

Authors: Chao Shuai, Zhenguang Liu, Shaojing Fan, Bin Gong, Weichen Lian, Xiuli Bi, Zhongjie Ba, Kui Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09242
Pdf URL: https://arxiv.org/pdf/2603.09242
Copy Paste: [[2603.09242]] When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2603.09242)
Keywords: generation, generative
Abstract: AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).
摘要：随着生成式人工智能的快速发展，人工智能生成的图像检测变得越来越重要。然而，基于视觉基础模型（VFM、\emph{例如}、CLIP）构建的检测器通常很难推广到使用看不见的生成管道创建的图像。我们首次确定了一种关键的故障机制，称为 \emph{语义回退}，其中基于 VFM 的检测器依赖于占主导地位的预训练语义先验（例如身份），而不是分布变化下的伪造特定痕迹。为了解决这个问题，我们提出了 \textbf{Geometric Semantic Decoupling (GSD)}，这是一个无参数模块，它通过利用冻结的 VFM 作为语义指南，并使用可训练的 VFM 作为伪影检测器，从学习的表示中显式地删除语义组件。 GSD 根据批量统计数据估计语义方向，并通过几何约束将它们投影出来，迫使工件检测器依赖于语义不变的取证证据。大量的实验表明，我们的方法始终优于最先进的方法，在跨数据集评估中实现了 94.4% 的视频级 AUC (+\textbf{1.2\%})，提高了对未见操作的鲁棒性（DF40 上的 +\textbf{3.0\%}），并推广到一般场景的合成图像的检测，包括 UniversalFakeDetect (+\textbf{0.9\%}) 和 GenImage (+\textbf{1.7\%})。

Title: ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph

Authors: Junhao Cai, Deyu Zeng, Junhao Pang, Lini Li, Zongze Wu, Xiaopin Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09266
Pdf URL: https://arxiv.org/pdf/2603.09266
Copy Paste: [[2603.09266]] ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph(https://arxiv.org/abs/2603.09266)
Keywords: generation
Abstract: Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.
摘要：目前的文本到 3D 生成方法在自然场景中表现出色，但由于两个关键限制而难以适应工业应用：传统 LoRA 融合导致跨类别知识干扰的领域适应挑战，以及成对一致性约束无法捕获精密制造所必需的高阶结构依赖性的几何推理缺陷。我们提出了一个名为 ForgeDreamer 的新颖框架，通过两项关键创新来应对这两个挑战。首先，我们引入了多专家 LoRA 集成机制，将多个特定类别的 LoRA 模型整合为统一的表示，实现卓越的跨类别泛化，同时消除知识干扰。其次，在增强的语义理解的基础上，我们开发了一种跨视图超图几何增强方法，可以同时捕获跨越多个视点的结构依赖性。这些组件协同工作，改进了语义理解，实现更有效的几何推理，而超图建模则确保了制造级的一致性。与最先进的方法相比，对自定义工业数据集的大量实验证明了卓越的语义泛化和增强的几何保真度。我们的代码和数据在附录所附的补充材料中提供，以供审查之用。

Title: From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Authors: Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09283
Pdf URL: https://arxiv.org/pdf/2603.09283
Copy Paste: [[2603.09283]] From Ideal to Real: Stable Video Object Removal under Imperfect Conditions(https://arxiv.org/abs/2603.09283)
Keywords: generation
Abstract: Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
摘要：在存在阴影、突然运动和有缺陷的遮罩等现实世界缺陷的情况下，从视频中删除对象仍然很困难。在这些挑战下，现有的基于扩散的视频修复模型通常难以保持时间稳定性和视觉一致性。我们提出了稳定视频对象去除（SVOR），这是一个强大的框架，通过三个关键设计实现无阴影、无闪烁和掩模缺陷容忍去除：（1）用于稳定擦除的掩模联合（MUSE），一种在时间掩模下采样期间应用的窗口联合策略，以保留每个窗口内观察到的所有目标区域，有效处理突然运动并减少丢失的去除； (2) 去噪感知分割（DA-Seg），解耦侧分支上的轻量级分割头，配备去噪感知 AdaLN，并通过掩模降级进行训练，以在不影响内容生成的情况下提供内部扩散感知定位先验； (3) 课程两阶段训练：第一阶段使用在线随机掩模对未配对的真实背景视频进行自监督预训练，以学习现实背景和时间先验，第二阶段使用掩模退化和副作用加权损失对合成对进行细化，联合去除对象及其相关的阴影/反射，同时提高跨域鲁棒性。大量实验表明，SVOR 在多个数据集和降级掩模基准上取得了最先进的结果，将视频对象去除从理想设置推进到现实世界的应用。

Title: CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation

Authors: Shengqi Dang, Jiaying Lei, Yi He, Ziqing Qian, Nan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09286
Pdf URL: https://arxiv.org/pdf/2603.09286
Copy Paste: [[2603.09286]] CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation(https://arxiv.org/abs/2603.09286)
Keywords: generation, generative
Abstract: Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.
摘要：除了传达语义信息之外，图像还可以表现出认知属性，从而引发观看者的特定认知过程，例如记忆编码或情绪反应。尽管现代文本到图像模型擅长生成语义连贯的内容，但它们控制图像认知属性（例如效价、可记忆性）的能力仍然有限，往往无法与特定的心理意图保持一致。为了弥补这一差距，我们引入了 CogBlender，这是一个框架，可以在文本到图像的生成过程中对认知属性进行连续和多维的干预。我们的方法建立在认知空间（代表认知属性的空间）和语义流形（代表视觉语义的流形）之间的映射之上。我们定义了一组认知锚点，作为认知空间的边界点。然后，我们通过从不同锚点的速度场进行插值来重新表述流匹配过程中的速度场。因此，生成过程由速度场驱动，并由多维认知评分动态引导，从而实现精确、细粒度和持续的干预。我们在四个代表性认知维度上验证了 CogBlender 的有效性：效价、唤醒度、主导度和图像记忆力。大量的实验证明我们的方法实现了有效的认知干预。我们的工作为认知驱动的创意设计提供了有效的范例。

Title: IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework

Authors: Feiyu Wang, Jiayuan Yang, Zhiyuan Zhao, Da Zhang, Bingyu Li, Peng Liu, Junyu Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09312
Pdf URL: https://arxiv.org/pdf/2603.09312
Copy Paste: [[2603.09312]] IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework(https://arxiv.org/abs/2603.09312)
Keywords: generation
Abstract: Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
摘要：可扩展矢量图形 (SVG) 由于其固有的可扩展性和可编辑性而成为数字设计的核心。尽管视觉语言模型 (VLM) 在内容生成方面取得了显着进步，但现有的文本到 SVG 生成方法仍受到核心挑战的限制：自回归训练过程不包含最终渲染图像的视觉感知，这从根本上限制了生成质量。为了解决这个限制，我们提出了一个内省 SVG 生成框架 (IntroSVG)。该框架的核心是实例化一个统一的 VLM，该 VLM 在闭环中运行，承担生成器和批评者的双重角色。具体来说，通过监督微调 (SFT)，模型学习起草 SVG 并提供有关其渲染输出的反馈；此外，我们系统地将早期故障转化为高质量的纠错训练数据，从而增强模型的鲁棒性。随后，我们利用大容量教师 VLM 构建偏好数据集，并通过直接偏好优化 (DPO) 进一步调整生成器的策略。在推理过程中，优化的生成器和批评者在迭代的“生成-审查-细化”循环中协作运行，从不完美的中间草稿开始，自主提高输出质量。实验结果表明，我们的方法在几个关键评估指标上实现了最先进的性能，生成了具有更复杂结构、更强语义对齐和更高可编辑性的 SVG。这些结果证实了将显式视觉反馈纳入生成循环的有效性。

Title: Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework

Authors: Engin Deniz Erkan, Elif Surer, Ulas Yaman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09353
Pdf URL: https://arxiv.org/pdf/2603.09353
Copy Paste: [[2603.09353]] Interactive 3D visualization of surface roughness predictions in additive manufacturing: A data-driven framework(https://arxiv.org/abs/2603.09353)
Keywords: generative
Abstract: Surface roughness in Material Extrusion Additive Manufacturing varies across a part and is difficult to anticipate during process planning because it depends on both printing parameters and local surface inclination, which governs the staircase effect. A data-driven framework is presented to predict the arithmetic mean roughness (Ra) prior to fabrication using process parameters and surface angle. A structured experimental dataset was created using a three-level Box-Behnken design: 87 specimens were printed, each with multiple planar faces spanning different inclination angles, yielding 1566 Ra measurements acquired with a contact profilometer. A multilayer perceptron regressor was trained to capture nonlinear relationships between manufacturing conditions, inclination, and Ra. To mitigate limited experimental data, a conditional generative adversarial network was used to generate additional condition-specific tabular samples, thereby improving predictive performance. Model performance was assessed on a hold-out test set. A web-based decision-support interface was also developed to enable interactive process planning by loading a 3D model, specifying printing parameters, and adjusting the part's orientation. The system computes face-wise inclination from the model geometry and visualizes predicted Ra as an interactive colormap over the surface, enabling rapid identification of regions prone to high roughness and immediate comparison of parameter and orientation choices.
摘要：材料挤出增材制造中的表面粗糙度因零件而异，并且在工艺规划期间很难预测，因为它取决于打印参数和局部表面倾斜度，而局部表面倾斜度控制着阶梯效应。提出了一个数据驱动框架，用于在制造之前使用工艺参数和表面角度预测算术平均粗糙度 (Ra)。使用三级 Box-Behnken 设计创建了结构化实验数据集：打印了 87 个样本，每个样本都有跨越不同倾角的多个平面，产生使用接触式轮廓仪获取的 1566 个 Ra 测量值。训练多层感知器回归器来捕获制造条件、倾角和 Ra 之间的非线性关系。为了减轻有限的实验数据，使用条件生成对抗网络来生成额外的特定条件的表格样本，从而提高预测性能。模型性能在保留测试集上进行评估。还开发了基于 Web 的决策支持界面，通过加载 3D 模型、指定打印参数和调整零件方向来实现交互式流程规划。该系统根据模型几何形状计算面向倾斜度，并将预测的 Ra 可视化为表面上的交互式色彩图，从而能够快速识别容易出现高粗糙度的区域，并立即比较参数和方向选择。

Title: Reviving ConvNeXt for Efficient Convolutional Diffusion Models

Authors: Taesung Kwon, Lorenzo Bianchi, Lennart Wittke, Felix Watine, Fabio Carrara, Jong Chul Ye, Romann Weber, Vinicius Azevedo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.09408
Pdf URL: https://arxiv.org/pdf/2603.09408
Copy Paste: [[2603.09408]] Reviving ConvNeXt for Efficient Convolutional Diffusion Models(https://arxiv.org/abs/2603.09408)
Keywords: generative
Abstract: Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
摘要：最近的扩散模型越来越有利于 Transformer 主干，这是受到完全注意力架构卓越的可扩展性的推动。然而，局部性偏差、参数效率和硬件友好性（将卷积网络确立为高效视觉骨干的属性）在现代生成模型中的探索有限。在这里，我们介绍完全卷积扩散模型（FCDM），该模型具有与 ConvNeXt 类似的主干，但专为条件扩散建模而设计。我们发现，仅使用 DiT-XL/2 的 50% 的 FLOP，FCDM-XL 在 256$\times$256 和 512$\times$512 分辨率下分别减少了 7$\times$ 和 7.5$\times$ 的训练步骤，从而实现了有竞争力的性能。值得注意的是，FCDM-XL 可以在 4-GPU 系统上进行训练，凸显了我们架构卓越的训练效率。我们的结果表明，现代卷积设计为缩放扩散模型提供了一种有竞争力且高效的替代方案，使 ConvNeXt 重新成为高效生成建模的简单而强大的构建块。

Title: Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Authors: Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09480
Pdf URL: https://arxiv.org/pdf/2603.09480
Copy Paste: [[2603.09480]] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity(https://arxiv.org/abs/2603.09480)
Keywords: generation
Abstract: Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at this https URL}{this https URL.
摘要：视觉语言模型 (VLM) 面临着由于视觉标记生成过多而导致的计算效率显着低下的问题。虽然之前的工作表明很大一部分视觉标记是多余的，但现有的压缩方法很难平衡重要性保存和信息多样性。为了解决这个问题，我们提出了 PruneSID，一种免训练的协同重要性多样性方法，具有两阶段管道：（1）主语义成分分析（PSCA），用于将标记聚类成语义一致的组，确保全面的概念覆盖；（2）组内非极大值抑制（NMS），用于修剪冗余标记，同时保留每个组内的关键代表性标记。此外，PruneSID还采用了信息感知的动态压缩比机制，可根据图像复杂度优化令牌压缩率，从而能够在不同场景中更有效地保存平均信息。大量实验证明了最先进的性能，在 LLaVA-1.5 上实现了 96.3% 的准确率，仅保留了 11.1% 的令牌，在 LLaVA-NeXT 上以极端压缩率 (5.6%) 实现了 92.8% 的准确率，比之前的方法高出 2.5%，预填充速度比原始模型快 7.8 倍。我们的框架泛化于不同的 VLM 以及图像和视频模式，展示了强大的跨模式多功能性。代码可在此 https URL}{此 https URL 处获取。

Title: Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

Authors: Ali Zia, Muhammad Umer Ramzan, Usman Ali, Muhammad Faheem, Abdelwahed Khamis, Shahnawaz Qureshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09484
Pdf URL: https://arxiv.org/pdf/2603.09484
Copy Paste: [[2603.09484]] Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion(https://arxiv.org/abs/2603.09484)
Keywords: restoration, generation
Abstract: Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
摘要：将手绘草图转化为逼真的图像仍然是图像合成中的一个基本挑战，特别是由于草图的抽象、稀疏和风格多样化的性质。现有的方法，包括基于 GAN 和基于扩散的模型，通常很难重建细粒度的细节、保持空间对齐或适应不同的草图域。在本文中，我们提出了一种用于从草图到图像生成的组件感知、自我完善框架，通过新颖的两阶段架构解决这些挑战。基于自注意力的自动编码器网络（SA2N）首先从组件草图区域捕获局部语义和结构特征，而坐标保留门控融合（CGF）模块将这些特征集成到连贯的空间布局中。最后，基于修改后的 StyleGAN2 主干构建的空间自适应细化修正器 (SARR)，通过空间上下文引导的迭代细化增强了真实性和一致性。在面部（CelebAMask-HQ、CUFSF）和非面部（Sketchy、ChairsV2、ShoesV2）数据集上进行的广泛实验证明了我们方法的稳健性和通用性。所提出的框架始终优于最先进的 GAN 和扩散模型，在图像保真度、语义准确性和感知质量方面取得了显着的进步。在 CelebAMask-HQ 上，我们的模型比之前的方法提高了 21% (FID)、58% (IS)、41% (KID) 和 20% (SSIM)。这些结果以及跨不同领域的更高效率和视觉一致性，使我们的方法成为取证、数字艺术修复和一般基于草图的图像合成应用的有力候选者。

Title: Streaming Autoregressive Video Generation via Diagonal Distillation

Authors: Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-HsuanYang, Weiyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09488
Pdf URL: https://arxiv.org/pdf/2603.09488
Copy Paste: [[2603.09488]] Streaming Autoregressive Video Generation via Diagonal Distillation(https://arxiv.org/abs/2603.09488)
Keywords: generation
Abstract: Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
摘要：大型预训练扩散模型显着提高了生成视频的质量，但它们在实时流媒体中的使用仍然有限。自回归模型为顺序帧合成提供了一个自然的框架，但需要大量计算才能实现高保真度。扩散蒸馏可以将这些模型压缩为有效的几个步骤变体，但现有的视频蒸馏方法很大程度上采用了忽略时间依赖性的图像特定方法。这些技术通常在图像生成方面表现出色，但在视频合成方面表现不佳，表现出运动一致性降低、长序列上的错误累积以及延迟质量权衡。我们确定了导致这些限制的两个因素：在步骤减少期间对时间上下文的利用不足以及在下一个块预测中对后续噪声水平的隐式预测（即曝光偏差）。为了解决这些问题，我们提出了对角蒸馏，它与现有方法正交，可以更好地利用视频块和去噪步骤中的时间信息。我们方法的核心是不对称生成策略：早期步骤较多，后期步骤较少。这种设计允许后面的块从彻底处理的早期块继承丰富的外观信息，同时使用部分去噪的块作为后续合成的条件输入。通过将块生成期间后续噪声水平的隐式预测与实际推理条件对齐，我们的方法减轻了错误传播并减少了长程序列中的过饱和。我们进一步结合隐式光流建模，以在严格的步长约束下保持运动质量。我们的方法在 2.61 秒内生成一个 5 秒的视频（高达 31 FPS），比未蒸馏的模型实现了 277.3 倍的加速。

Title: Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Authors: Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09512
Pdf URL: https://arxiv.org/pdf/2603.09512
Copy Paste: [[2603.09512]] Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning(https://arxiv.org/abs/2603.09512)
Keywords: generation
Abstract: A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
摘要：可靠的驾驶助手应该根据观察到的信息得出的基于时间的推理提供一致的响应。在这项工作中，我们研究了视觉语言模型（VLM）在用作驾驶助手时是否能够做出一致的响应并理解当前的观察如何影响未来的结果，或者它们的输出是否仅仅反映了训练期间记忆的模式而没有基于时间的推理。虽然最近的努力已将 VLM 集成到自动驾驶中，但之前的研究通常强调场景理解和指令生成，隐含地假设强大的视觉解释自然能够实现一致的未来推理，从而确保可靠的决策，我们对此进行了严格的审查。我们重点关注在这种情况下限制 VLM 可靠性的两个主要挑战：响应不一致，即较小的输入扰动会产生不同的答案，或者在某些情况下，响应会退化为近乎随机的猜测；以及有限的时间推理，即模型无法根据当前观察来推理和对齐顺序事件，通常会导致不正确甚至矛盾的响应。此外，我们发现具有较强视觉理解能力的模型不一定在需要时间推理的任务上表现最好，这表明倾向于过度依赖预训练模式而不是对时间动态进行建模。为了解决这些问题，我们采用现有的评估方法并引入FutureVQA，这是一个专门用于评估未来场景推理的人工注释基准数据集。此外，我们提出了一种简单而有效的自我监督调整方法，具有思想链推理，可以在不需要时间标签的情况下提高一致性和时间推理。

Title: Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Authors: Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Yuhao Chen, Qingyu Zhang, Jixiang Luo, Xuelong Li, Rongrong Ji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09527
Pdf URL: https://arxiv.org/pdf/2603.09527
Copy Paste: [[2603.09527]] Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation(https://arxiv.org/abs/2603.09527)
Keywords: generation
Abstract: Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at this https URL.
摘要：推测性解码可加速 LLM 推理，但当目标模型针对特定领域进行微调时，性能会下降。一个简单的解决方案是为每个目标模型重新训练草稿模型，这是昂贵且低效的。为了解决这个问题，我们引入了一个名为 Efficient Draft Adaptation（简称 EDA）的参数和数据高效框架，用于有效地适应草稿模型。 EDA引入了三项创新：（1）解耦架构，利用共享和私有组件分别对共享和特定目标输出分布进行建模，通过仅更新轻量级私有组件来实现参数高效的自适应；（2）数据再生策略，利用微调的目标模型重新生成训练数据，从而改善训练和推测解码之间的一致性，从而获得更高的平均接受长度；（3）优先考虑高价值数据以实现高效自适应的样本选择机制。我们的实验表明，与完全再训练相比，EDA 有效地恢复了微调模型的推测性能，实现了优异的平均接受长度，同时显着降低了训练成本。代码可从此 https URL 获取。

Title: Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09538
Pdf URL: https://arxiv.org/pdf/2603.09538
Copy Paste: [[2603.09538]] Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization(https://arxiv.org/abs/2603.09538)
Keywords: generation
Abstract: Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
摘要：统一的视觉语言模型在多模态理解和生成方面取得了重大进展，但在产生多模态交错输出方面却存在很大差距，而多模态交错输出对于视觉讲故事和逐步视觉推理等任务来说是至关重要的能力。在这项工作中，我们提出了一种基于强化学习的后训练策略，以在现有统一模型中解锁此功能，而不依赖于大规模多模态交错数据集。我们从预热阶段开始，使用混合数据集，该数据集包含精心策划的交错序列和用于多模态理解和文本到图像生成的有限数据，这使模型暴露于交错生成模式，同时保留其预训练的功能。为了进一步细化交错生成，我们提出了一个统一的策略优化框架，将组相对策略优化（GRPO）扩展到多模式设置。我们的方法在单个解码轨迹内联合建模文本和图像生成，并通过我们新颖的混合奖励（涵盖文本相关性、视觉文本对齐和结构保真度）对其进行优化。此外，我们还结合流程级奖励来提供逐步指导，从而提高复杂多模式任务的培训效率。 MMIE 和 InterleavedBench 上的实验表明，我们的方法显着提高了多模态交错生成的质量和连贯性。

Title: Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

Authors: Cosmo Santoni
Subjects: cs.LG, cs.AI, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2603.09555
Pdf URL: https://arxiv.org/pdf/2603.09555
Copy Paste: [[2603.09555]] Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference(https://arxiv.org/abs/2603.09555)
Keywords: generation
Abstract: State-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2's state space duality algorithm -- diagonal state structure, chunkable recurrence, and einsum-dominated compute with static control flow -- maps cleanly onto what XLA's fusion and tiling passes actually optimise, making custom kernels optional rather than required. We implement the full inference path (prefill, cached autoregressive decoding) as shaped standard primitives under XLA, without hand-written kernels, and realise the architecture's theoretical $O(1)$ state management as a compiled on-device cache requiring no host synchronisation during generation. The implementation runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU from a single JAX source. On TPU v6e across five model scales (130M--2.7B parameters), XLA-generated code reaches approximately 140 TFLOPS on single-stream prefill ($15%$ MFU) and up to $64%$ bandwidth utilisation on decode. Greedy decoding matches the PyTorch/CUDA reference token-for-token across 64 steps, with hidden-state agreement within float32 rounding tolerance. The pattern transfers to any SSM recurrence satisfying the same structural conditions, on any platform with a mature XLA backend. The implementation is publicly available at this https URL and merged into the Bonsai JAX model library.
摘要：状态空间模型版本通常与融合的 CUDA 和 Triton 内核耦合，继承了对 NVIDIA 硬件的硬依赖。我们展示了 Mamba-2 的状态空间对偶算法（对角状态结构、可分块递归以及具有静态控制流的 einsum 主导计算）清楚地映射到 XLA 的融合和平铺过程实际优化的内容，使自定义内核成为可选而不是必需的。我们将完整的推理路径（预填充、缓存自回归解码）实现为 XLA 下的成形标准原语，无需手写内核，并将架构的理论 $O(1)$ 状态管理实现为已编译的设备上缓存，在生成过程中无需主机同步。该实现无需修改即可从单个 JAX 源在 CPU、NVIDIA GPU 和 Google Cloud TPU 上运行。在跨五个模型规模（130M--2.7B 参数）的 TPU v6e 上，XLA 生成的代码在单流预填充上达到约 140 TFLOPS（15%$ MFU），解码时带宽利用率高达 64%$。贪婪解码在 64 个步骤中逐个匹配 PyTorch/CUDA 参考令牌，并在 float32 舍入容差范围内实现隐藏状态一致性。该模式可转移到具有成熟 XLA 后端的任何平台上满足相同结构条件的任何 SSM 递归。该实现可通过此 https URL 公开获得，并合并到 Bonsai JAX 模型库中。

Title: ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis

Authors: KunHo Heo, SuYeon Kim, Yonghyun Gwon, Youngbin Kim, MyeongAh Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09611
Pdf URL: https://arxiv.org/pdf/2603.09611
Copy Paste: [[2603.09611]] ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis(https://arxiv.org/abs/2603.09611)
Keywords: generation
Abstract: Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
摘要：文本到动作合成旨在从文本描述生成自然且富有表现力的人类动作。虽然现有的方法主要侧重于从文本描述生成整体动作，但它们很难准确反映涉及特定身体部位的动作。最近的部分运动生成方法试图解决这个问题，但面临两个关键限制：（i）它们缺乏将文本语义与各个身体部位对齐的明确机制，以及（ii）由于集成独立生成的部分运动，它们经常生成不连贯的全身运动。为了克服这些问题并解决现有方法中的基本权衡问题，我们提出了 ParTY，这是一种新颖的框架，可以增强局部表现力，同时生成连贯的全身运动。 ParTY包括：（1）Part-Guided Network，首先生成部分运动以获得部分引导，然后用它生成整体运动； (2) 部分感知文本基础，它可以对文本嵌入进行不同的转换，并将其与每个身体部位适当地对齐； (3)整体-部分融合，自适应地融合整体运动和部分运动。广泛的实验，包括部分级和连贯性级评估，表明 ParTY 比以前的方法取得了实质性改进。

Title: Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution

Authors: Shuting Liu, Lei Zhang, Wei Huang, Zhao Zhang, Zizhou Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09621
Pdf URL: https://arxiv.org/pdf/2603.09621
Copy Paste: [[2603.09621]] Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution(https://arxiv.org/abs/2603.09621)
Keywords: super-resolution
Abstract: High-resolution Magnetic Resonance Imaging (MRI) is vital for clinical diagnosis but limited by long acquisition times and motion artifacts. Super-resolution (SR) reconstructs low-resolution scans into high-resolution images, yet existing methods are mutually constrained: paired-data methods achieve efficiency only by relying on costly aligned datasets, while implicit neural representation approaches avoid such data needs at the expense of heavy computation. We propose a zero-shot MRI SR framework using explicit Gaussian representation to balance data requirements and efficiency. MRI-tailored Gaussian parameters embed tissue physical properties, reducing learnable parameters while preserving MR signal fidelity. A physics-grounded volume rendering strategy models MRI signal formation via normalized Gaussian aggregation. Additionally, a brick-based order-independent rasterization scheme enables highly parallel 3D computation, lowering training and inference costs. Experiments on two public MRI datasets show superior reconstruction quality and efficiency, demonstrating the method's potential for clinical MRI SR.
摘要：高分辨率磁共振成像 (MRI) 对于临床诊断至关重要，但受到采集时间长和运动伪影的限制。超分辨率（SR）将低分辨率扫描重建为高分辨率图像，但现有方法相互制约：配对数据方法只能通过依赖昂贵的对齐数据集来实现效率，而隐式神经表示方法则以大量计算为代价来避免此类数据需求。我们提出了一种使用显式高斯表示的零样本 MRI SR 框架来平衡数据需求和效率。 MRI 定制的高斯参数嵌入组织物理特性，减少可学习参数，同时保持 MR 信号保真度。基于物理的体绘制策略通过归一化高斯聚合对 MRI 信号形成进行建模。此外，基于砖的与顺序无关的光栅化方案可实现高度并行的 3D 计算，从而降低训练和推理成本。对两个公共 MRI 数据集的实验显示出卓越的重建质量和效率，证明了该方法在临床 MRI SR 中的潜力。

Title: Decoder-Free Distillation for Quantized Image Restoration

Authors: S. M. A. Sharif, Abdur Rehman, Seongwan Kim, Jaeho Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09624
Pdf URL: https://arxiv.org/pdf/2603.09624
Copy Paste: [[2603.09624]] Decoder-Free Distillation for Quantized Image Restoration(https://arxiv.org/abs/2603.09624)
Keywords: restoration
Abstract: Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization "tug-of-war" between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP
摘要：量化感知训练 (QAT) 与知识蒸馏 (KD) 相结合，为压缩边缘部署模型带来了巨大希望。然而，用于从退化图像中恢复视觉质量的精确敏感图像恢复 (IR) 联合优化在很大程度上仍未得到充分探索。直接将 QAT-KD 应用于低级视觉揭示了三个关键瓶颈：师生能力不匹配、解码器蒸馏过程中的空间误差放大以及量化噪声引起的重建和蒸馏损失之间的优化“拉锯战”。为了解决这些问题，我们引入了量化感知蒸馏恢复 (QDR)，这是一种边缘部署 IR 的框架。 QDR通过FP32自蒸馏消除容量不匹配，并通过无解码器蒸馏（DFD）防止误差放大，严格纠正网络瓶颈处的量化误差。为了稳定优化拉锯战，我们提出了一种可学习幅度重新加权（LMR），可以动态平衡竞争梯度。最后，我们设计了一个边缘友好模型（EFM），具有轻量级可学习退化门控（LDG）来动态调节空间退化定位。跨四个 IR 任务的大量实验表明，我们的 Int8 模型恢复了 FP32 性能的 96.5%，在 NVIDIA Jetson Orin 上实现了 442 帧每秒 (FPS)，并将下游对象检测提高了 16.3 mAP

Title: Grounding Synthetic Data Generation With Vision and Language Models

Authors: Ümit Mert Çağlar, Alptekin Temizel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09625
Pdf URL: https://arxiv.org/pdf/2603.09625
Copy Paste: [[2603.09625]] Grounding Synthetic Data Generation With Vision and Language Models(https://arxiv.org/abs/2603.09625)
Keywords: generation, generative
Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at this http URL and the code base at this http URL.
摘要：深度学习模型受益于数据多样性和数量的增加，激励合成数据增强以改进现有数据集。然而，现有的合成数据评估指标通常会计算潜在特征相似性，这很难解释，并且并不总是与对下游任务的贡献相关。我们提出了一个基于视觉语言的框架，用于遥感中可解释的合成数据增强和评估。我们的方法将生成模型、语义分割和图像字幕与视觉和语言模型结合起来。基于这个框架，我们引入了 ARAS400k：一个大规模遥感数据集，通过用于分割和字幕的合成数据增强，包含 100k 真实图像和 300k 合成图像，每个图像都配有分割图和描述。 ARAS400k 通过分析语义构成、最大限度地减少标题冗余以及验证视觉结构和语言描述之间的跨模式一致性，实现合成数据的自动评估。实验结果表明，虽然专门在合成数据上训练的模型达到了有竞争力的性能水平，但使用增强数据（真实图像和合成图像的组合）训练的模型始终优于真实数据基线。因此，这项工作为遥感任务建立了可扩展的基准，特别是在语义分割和图像字幕方面。数据集可从此 http URL 获取，代码库可在此 http URL 获取。

Title: X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Authors: Yueen Ma, Irwin King
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.09632
Pdf URL: https://arxiv.org/pdf/2603.09632
Copy Paste: [[2603.09632]] X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models(https://arxiv.org/abs/2603.09632)
Keywords: generation
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
摘要：3D 高斯分布 (3DGS) 已成为新颖视图合成的强大技术，随后扩展到众多空间人工智能应用中。然而，大多数现有的 3DGS 方法都是孤立的，专注于特定领域，例如在线 SLAM、语义丰富或未摆出图像的 3DGS。在本文中，我们介绍了 X-GS，这是一个可扩展的开放框架，它统一了广泛的技术，以实现基于 3DGS 的实时在线 SLAM，并丰富了语义，从而弥补了与下游多模态模型的差距。 X-GS 的核心是一个名为 X-GS-Perceiver 的高效管道，能够将未姿态的 RGB（或可选 RGB-D）视频流作为输入来共同优化几何和姿态，并将高维语义特征从视觉基础模型提取到 3D 高斯。我们通过新颖的在线矢量量化 (VQ) 模块、GPU 加速的网格采样方案和高度并行化的管道设计来实现实时性能。然后，X-GS-Thinker 组件中的视觉语言模型可以利用语义 3D 高斯函数，从而实现目标检测、零镜头字幕生成和潜在的具体化任务等下游任务。真实数据集的实验结果展示了 X-GS 框架的功效、效率和新解锁的多模式功能。

Title: Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs

Authors: Ali Sadeghkhani, A. Assadi, B. Bennett, A. Rabbani
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2603.09651
Pdf URL: https://arxiv.org/pdf/2603.09651
Copy Paste: [[2603.09651]] Well Log-Guided Synthesis of Subsurface Images from Sparse Petrography Data Using cGANs(https://arxiv.org/abs/2603.09651)
Keywords: generative
Abstract: Pore-scale imaging of subsurface formations is costly and limited to discrete depths, creating significant gaps in reservoir characterization. To address this, we present a conditional Generative Adversarial Network (cGAN) framework for synthesizing realistic thin section images of carbonate rock formations, conditioned on porosity values derived from well logs. The model is trained on 5,000 sub-images extracted from 15 petrography samples over a depth interval of 1992-2000m, the model generates geologically consistent images across a wide porosity range (0.004-0.745), achieving 81% accuracy within a 10\% margin of target porosity values. The successful integration of well log data with the trained generator enables continuous pore-scale visualization along the wellbore, bridging gaps between discrete core sampling points and providing valuable insights for reservoir characterization and energy transition applications such as carbon capture and underground hydrogen storage.
摘要：地下地层的孔隙尺度成像成本高昂，并且仅限于离散深度，从而在储层表征中造成巨大差距。为了解决这个问题，我们提出了一个条件生成对抗网络（cGAN）框架，用于合成碳酸盐岩地层的真实薄片图像，以测井得出的孔隙度值为条件。该模型使用从 1992-2000m 深度间隔的 15 个岩相学样本中提取的 5,000 个子图像进行训练，该模型在较宽的孔隙度范围 (0.004-0.745) 内生成地质一致的图像，在目标孔隙度值的 10% 裕度内实现 81% 的精度。测井数据与训练有素的生成器的成功集成可以实现沿井眼的连续孔隙尺度可视化，弥合离散岩心采样点之间的差距，并为储层表征和能源转换应用（例如碳捕获和地下储氢）提供有价值的见解。

Title: When to Lock Attention: Training-Free KV Control in Video Diffusion

Authors: Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang
Subjects: cs.CV, cs.AI, cs.ET, eess.IV
Abstract URL: https://arxiv.org/abs/2603.09657
Pdf URL: https://arxiv.org/pdf/2603.09657
Copy Paste: [[2603.09657]] When to Lock Attention: Training-Free KV Control in Video Diffusion(https://arxiv.org/abs/2603.09657)
Keywords: generation
Abstract: Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
摘要：在增强前景质量的同时保持背景一致性仍然是视频编辑的核心挑战。注入全图像信息通常会导致背景伪影，而严格的背景锁定严重限制了模型的前景生成能力。为了解决这个问题，我们提出了 KV-Lock，这是一种为基于 DiT 的视频扩散模型量身定制的免训练框架。我们的核心见解是幻觉度量（去噪预测的方差）直接量化世代多样性，这与无分类器指导（CFG）规模有着内在的联系。在此基础上，KV-Lock 利用扩散幻觉检测来动态调度两个关键组件：缓存的背景键值 (KV) 和新生成的 KV 之间的融合比率，以及 CFG 规模。当检测到幻觉风险时，KV-Lock 会加强背景 KV 锁定，同时放大前景生成的条件指导，从而减少伪影并提高生成保真度。作为免训练、即插即用的模块，KV-Lock 可以轻松集成到任何预先训练的基于 DiT 的模型中。大量的实验验证了我们的方法在提高前景质量和跨各种视频编辑任务中的高背景保真度方面优于现有方法。

Title: ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Authors: Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.09692
Pdf URL: https://arxiv.org/pdf/2603.09692
Copy Paste: [[2603.09692]] ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning(https://arxiv.org/abs/2603.09692)
Keywords: generation
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at this https URL and our preference datasets at this https URL.
摘要：人类反馈强化学习 (RLHF) 已成为调整大型语言模型 (LLM) 的标准，但其功效受到获取偏好数据成本高昂的瓶颈，尤其是在资源匮乏和专家领域。为了解决这个问题，我们引入了 ACTIVEULTRAFEEDBACK，这是一种模块化的主动学习管道，它利用不确定性估计来动态识别最具信息性的注释响应。我们的流程有助于对标准响应选择方法以及双反向汤普森采样 (DRTS) 和 DELTAUCB 进行系统评估，这两种新颖的方法优先考虑具有较大预测质量差距的响应对，利用最近的结果表明此类对为微调提供了良好的信号。我们的实验表明，ACTIVEULTRAFEEDBACK 产生高质量的数据集，从而显着提高下游性能，特别是仅用相对于静态基线六分之一的注释数据即可实现可比较或更好的结果。我们的管道可通过此 https URL 获取，我们的偏好数据集可通过此 https URL 获取。

Title: TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR

Authors: Fayaz Ali Dharejo, Sharif S. M. A., Aiman Khalil, Nachiket Chaudhary, Rizwan Ali Naqvi, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09702
Pdf URL: https://arxiv.org/pdf/2603.09702
Copy Paste: [[2603.09702]] TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR(https://arxiv.org/abs/2603.09702)
Keywords: super-resolution
Abstract: Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.
摘要：多模态医学图像融合通过聚合互补的结构和功能信息促进综合诊断，但其有效性受到分辨率下降和模态差异的限制。现有方法通常在不同的阶段执行图像融合和超分辨率 (SR)，从而导致伪影和感知质量下降。由于明显的频域不平衡，这些局限性在将解剖模式（例如 MRI、CT）与功能扫描（例如 PET、SPECT）相结合的三模式设置中进一步放大。我们提出了 TriFusionSR，一种用于联合三模态融合和 SR 的小波引导条件扩散框架。该框架使用 2D 离散小波变换将多模态特征显式分解为频带，从而实现频率感知的跨模态交互。我们进一步引入了用于潜在系数校准的修正小波特征（RWF）策略，然后是具有门控通道空间注意力的自适应空间频率融合（ASFF）模块，以实现结构驱动的多模态细化。大量实验证明了最先进的性能，在多个上采样尺度上实现了 4.8-12.4% 的 PSNR 改进以及 RMSE 和 LPIPS 的大幅降低。

Title: FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

Authors: Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09721
Pdf URL: https://arxiv.org/pdf/2603.09721
Copy Paste: [[2603.09721]] FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation(https://arxiv.org/abs/2603.09721)
Keywords: generation
Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
摘要：由于难以有效地建模复杂的时空动力学，高保真视频生成对于扩散模型仍然具有挑战性。最近的视频扩散方法通常将视频表示为一系列时空标记，可以使用扩散变压器（DiT）对其进行建模。然而，这种方法面临着强大但昂贵的 Full 3D Attention 和高效但时间有限的 Local Factorized Attention 之间的权衡。为了解决这种权衡，我们提出了矩阵注意力，这是一种帧级时间注意力机制，它将整个帧作为矩阵处理，并通过矩阵本机操作生成查询、键和值矩阵。通过跨帧而不是令牌进行关注，矩阵注意力有效地保留了全局时空结构并适应重大运动。我们构建了基于 MatrixAttention 的 DiT 架构 FrameDiT-G，并进一步介绍了 FrameDiT-H，它将 Matrix Attention 与 Local Factorized Attention 相结合，以捕获大运动和小运动。大量实验表明，FrameDiT-H 在多个视频生成基准中实现了最先进的结果，提供了改进的时间一致性和视频质量，同时保持了与局部因子化注意力相当的效率。

Title: LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control

Authors: Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09759
Pdf URL: https://arxiv.org/pdf/2603.09759
Copy Paste: [[2603.09759]] LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control(https://arxiv.org/abs/2603.09759)
Keywords: generation
Abstract: Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
摘要：文本到图像生成方面的最新进展令人瞩目，但生成和谐地整合视觉和文本元素的多语言设计徽标仍然是一项具有挑战性的任务。现有方法在应用创意风格时经常会扭曲字符几何形状，并且在无需额外培训的情况下很难支持多语言文本生成。为了应对这些挑战，我们提出了 LogoDiffuser，这是一种无需训练的方法，可以使用多模态扩散转换器合成多语言徽标设计。我们不使用文本提示，而是以图像形式输入目标字符，从而无论语言如何，都能实现强大的字符结构控制。我们首先分析联合注意力机制来识别核心标记，这些标记对文本结构有强烈的反应。根据这一观察，我们的方法通过注入信息最丰富的注意力图来整合角色结构和视觉设计。此外，我们对注意力图进行分层聚合，以减轻跨层的注意力转移并获得一致的核心标记。大量的实验和用户研究表明，我们的方法在多语言徽标生成方面实现了最先进的性能。

Title: ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation

Authors: Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Yang Bai, Chi Zhang, Ziyuan Liu, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09819
Pdf URL: https://arxiv.org/pdf/2603.09819
Copy Paste: [[2603.09819]] ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation(https://arxiv.org/abs/2603.09819)
Keywords: generation
Abstract: We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
摘要：我们解决了在大视点变化下仅从两个输入图像合成新颖视图的挑战。现有的基于回归的方法缺乏重建未见区域的能力，而相机引导的扩散模型通常由于点云投影的噪声或相机姿态调节不足而偏离预期轨迹。为了解决这些问题，我们提出了 ConfCtrl，这是一种具有置信度的视频插值框架，使扩散模型能够遵循规定的相机姿势，同时完成看不见的区域。 ConfCtrl 通过将置信加权投影潜在点云与噪声相结合作为调节输入来初始化扩散过程。然后，它应用卡尔曼启发的预测更新机制，将投影点云视为噪声测量，并使用学习的残差校正来平衡姿势驱动的预测与噪声几何观测。这使得模型能够依赖可靠的预测，同时降低不确定区域的权重，从而产生稳定的、几何感知的生成。对多个数据集的实验表明，ConfCtrl 产生几何一致且视觉上合理的新颖视图，在大视点变化下有效地重建遮挡区域。

Title: CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning

Authors: Aleksei Rozanov, Arvind Renganathan, Yimeng Zhang, Vipin Kumar
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2603.09868
Pdf URL: https://arxiv.org/pdf/2603.09868
Copy Paste: [[2603.09868]] CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning(https://arxiv.org/abs/2603.09868)
Keywords: generation
Abstract: Accurately quantifying terrestrial carbon exchange is essential for climate policy and carbon accounting, yet models must generalize to ecosystems underrepresented in sparse eddy covariance observations. Despite this challenge being a natural instance of zero-shot spatial transfer learning for time series regression, no standardized benchmark exists to rigorously evaluate model performance across geographically distinct locations with different climate regimes and vegetation types. We introduce CarbonBench, the first benchmark for zero-shot spatial transfer in carbon flux upscaling. CarbonBench comprises over 1.3 million daily observations from 567 flux tower sites globally (2000-2024). It provides: (1) stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, separating spatial transfer from temporal autocorrelation; (2) a harmonized set of remote sensing and meteorological features to enable flexible architecture design; and (3) baselines ranging from tree-based methods to domain-generalization architectures. By bridging machine learning methodologies and Earth system science, CarbonBench aims to enable systematic comparison of transfer learning methods, serves as a testbed for regression under distribution shift, and contributes to the next-generation climate modeling efforts.
摘要：准确量化陆地碳交换对于气候政策和碳核算至关重要，但模型必须推广到稀疏涡协方差观测中代表性不足的生态系统。尽管这一挑战是时间序列回归的零样本空间迁移学习的自然实例，但不存在标准化基准来严格评估具有不同气候状况和植被类型的不同地理位置的模型性能。我们推出了 CarbonBench，这是碳通量升级中零样本空间转移的第一个基准。 CarbonBench 包含来自全球 567 个通量塔站点的超过 130 万个每日观测数据（2000 年至 2024 年）。它提供：（1）分层评估协议，明确测试未见过的植被类型和气候状况的泛化，将空间转移与时间自相关分开； (2) 一套协调一致的遥感和气象特征，以实现灵活的架构设计； (3) 基线范围从基于树的方法到领域泛化架构。通过连接机器学习方法和地球系统科学，CarbonBench 旨在实现迁移学习方法的系统比较，作为分布变化下回归的测试平台，并为下一代气候建模工作做出贡献。

Title: InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Authors: Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09877
Pdf URL: https://arxiv.org/pdf/2603.09877
Copy Paste: [[2603.09877]] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing(https://arxiv.org/abs/2603.09877)
Keywords: generation
Abstract: Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
摘要：集成理解、推理、生成和编辑的统一多模态模型 (UMM) 面临着保持强大的语义理解和获得强大的生成能力之间固有的权衡。在本报告中，我们介绍了 InternVL-U，这是一种轻量级 4B 参数 UMM，可在统一框架内实现这些功能的民主化。在统一上下文建模和具有解耦视觉表示的特定模态模块化设计原则的指导下，InternVL-U 将最先进的多模态大语言模型 (MLLM) 与基于 MMDiT 的专门视觉生成头集成在一起。为了进一步弥合审美生成和高级智能之间的差距，我们构建了一个针对高语义密度任务（例如文本渲染和科学推理）的综合数据合成管道，在以推理为中心的范式下，利用思想链（CoT）更好地将抽象的用户意图与细粒度的视觉生成细节结合起来。大量实验表明，InternVL-U 实现了卓越的性能-效率平衡。尽管仅使用 4B 参数，但它在各种生成和编辑任务上始终优于具有超过 3 倍更大尺度的统一基线模型，例如 BAGEL (14B)，同时保留强大的多模态理解和推理能力。

Title: DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

Authors: Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09883
Pdf URL: https://arxiv.org/pdf/2603.09883
Copy Paste: [[2603.09883]] DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary(https://arxiv.org/abs/2603.09883)
Keywords: generation
Abstract: Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{this https URL}.
摘要：以人为中心的视频生成技术发展迅速，但现有方法难以生成可控且物理一致的人机交互 (HOI) 视频。现有的作品依赖于密集的控制信号、模板视频或精心设计的文本提示，这限制了对新物体的灵活性和泛化。我们引入了一个框架，即 DISPLAY，由稀疏运动指导引导，仅由腕关节坐标和形状不可知的对象边界框组成。这种轻量级的指导减轻了人类和物体表示之间的不平衡，并实现了直观的用户控制。为了增强这种稀疏条件下的保真度，我们提出了一种对象应力注意力机制，可以提高对象的鲁棒性。为了解决高质量 HOI 数据的稀缺问题，我们进一步开发了具有专用数据管理管道的多任务辅助训练策略，使模型能够从可靠的 HOI 样本和辅助任务中受益。综合实验表明，我们的方法可以在不同的任务中实现高保真、可控的 HOI 生成。项目页面可以在 \href{此 https URL} 中找到。

Title: Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Authors: Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09896
Pdf URL: https://arxiv.org/pdf/2603.09896
Copy Paste: [[2603.09896]] Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports(https://arxiv.org/abs/2603.09896)
Keywords: generation
Abstract: Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
摘要：长期以来，体育运动因其突破人类身体和认知能力的极限而受到广泛关注。随着人们对视觉语言模型 (VLM) 空间智能的兴趣日益浓厚，体育运动为理解高强度人体运动和动态物体交互提供了一个天然的测试平台。为此，我们推出了CourtSI，这是第一个针对运动场景量身定制的大规模空间智能数据集。 CourtSI 包含超过 100 万个 QA 对，按照整体分类法组织，系统地涵盖空间计数、距离测量、定位和关系推理，涵盖羽毛球、网球和乒乓球等代表性网络运动。利用明确定义的球场几何形状作为公制锚点，我们开发了半自动数据引擎来重建运动场景，从而实现了 CourtSI 的可扩展管理。此外，我们还推出了 CourtSI-Bench，这是一个高质量的评估基准，包含 3,686 个 QA 对，并经过严格的人工验证。我们在 CourtSI-Bench 上评估了 25 个专有和开源 VLM，揭示了人类与人工智能之间的剩余性能差距以及现有空间智能基准的有限泛化。这些发现表明，体育场景暴露了现有基准捕获的空间智能能力的局限性。此外，在 CourtSI 上微调 Qwen3-VL-8B 将 CourtSI-Bench 的准确性提高了 23.5 个百分点。改编后的模型还可以有效地推广到 CourtSI-Ext，这是一个基于类似但未见过的运动的评估集，并展示了增强的空间感知评论生成。总之，这些发现表明，CourtSI 提供了一条可扩展的途径，以推进体育运动中 VLM 的空间智能。

Title: WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.09921
Pdf URL: https://arxiv.org/pdf/2603.09921
Copy Paste: [[2603.09921]] WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition(https://arxiv.org/abs/2603.09921)
Keywords: generative
Abstract: Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at this https URL
摘要：开放域视觉实体识别 (VER) 旨在将图像与维基百科等百科全书知识库中的实体关联起来。最近为 VER 定制的生成方法表现出强大的性能，但会产生较高的计算成本，限制了其可扩展性和实际部署。在这项工作中，我们重新审视 VER 的对比范式，并引入 WikiCLIP，这是一个简单而有效的框架，为开放域 VER 建立了强大而高效的基线。 WikiCLIP 利用大型语言模型嵌入作为知识丰富的实体表示，并通过视觉引导知识适配器 (VGKA) 来增强它们，该适配器将文本语义与补丁级别的视觉提示保持一致。为了进一步鼓励细粒度的区分，硬否定合成机制在训练期间生成视觉上相似但语义上不同的否定。流行的开放域 VER 基准（例如 OVEN）的实验结果表明，WikiCLIP 的性能显着优于强基准。具体来说，与领先的生成模型 AutoVER 相比，WikiCLIP 在具有挑战性的 OVEN unseen 集上实现了 16% 的改进，同时将推理延迟减少了近 100 倍。项目页面可通过此 https URL 获取

Title: On the Structural Failure of Chamfer Distance in 3D Shape Optimization

Authors: Chang-Yong Song, David Hyde
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2603.09925
Pdf URL: https://arxiv.org/pdf/2603.09925
Copy Paste: [[2603.09925]] On the Structural Failure of Chamfer Distance in 3D Shape Optimization(https://arxiv.org/abs/2603.09925)
Keywords: generation
Abstract: Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.
摘要：倒角距离是点云重建、完成和生成的标准训练损失，但直接优化它可能会产生比根本不优化更差的倒角值。我们证明这种自相矛盾的失败是梯度结构性的。每点倒角梯度会产生多对一的塌陷，这是前向项的独特吸引子，并且无法通过任何局部正则化器解决，包括排斥、平滑度和密度感知重新加权。我们推导出抑制塌陷的必要条件：耦合必须传播到局部邻域之外。在受控的 2D 设置中，共享基础变形通过提供全局耦合来抑制塌陷；在 3D 形状变形中，可微 MPM 先验实例化了相同的原理，持续减少 20 个有向对的倒角间隙，对拓扑复杂的龙提高了 2.5$\times$。非局部耦合的存在或不存在决定了倒角优化是成功还是失败。这为任何优化点级距离度量的管道提供了实用的设计标准。

Title: Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

Authors: Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.09931
Pdf URL: https://arxiv.org/pdf/2603.09931
Copy Paste: [[2603.09931]] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation(https://arxiv.org/abs/2603.09931)
Keywords: generation
Abstract: Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at this https URL
摘要：多模态神经影像为阿尔茨海默病的诊断提供了补充见解，但临床数据集经常缺少模态。我们提出了 ACADiff，这是一个通过适应性临床感知扩散来合成缺失的大脑成像模式的框架。 ACADiff 通过逐步对潜在表示进行去噪，同时关注可用的成像数据和临床元数据，来学习不完整的多模态观察和目标模态之间的映射。该框架采用自适应融合，根据输入可用性动态重新配置，并通过 GPT-4o 编码提示结合语义临床指导。三个专用发生器可实现 sMRI、FDG-PET 和 AV45-PET 之间的双向合成。在 ADNI 受试者上进行评估，ACADiff 实现了卓越的生成质量，即使在极端 80\% 缺失情况下也能保持强大的诊断性能，优于所有现有基线。为了提高可重复性，可在此 https URL 获取代码

Title: Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Authors: Erkan Turan, Maks Ovsjanikov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.09936
Pdf URL: https://arxiv.org/pdf/2603.09936
Copy Paste: [[2603.09936]] Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective(https://arxiv.org/abs/2603.09936)
Keywords: generation, generative
Abstract: Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $\sigma(t)=\sigma_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
摘要：通过漂移的生成建模最近通过基于内核的漂移算子实现了最先进的一步图像生成，但成功很大程度上是经验性的，其理论基础仍然知之甚少。在本文中，我们做出以下观察：\emph{在高斯核下，漂移算子恰好是平滑分布上的分数差}。这一见解使我们能够回答原始工作中未解决的所有三个关键问题：（1）消失漂移是否保证分布相等（$V_{p,q}=0\Rightarrow p=q$），（2）如何在内核之间进行选择，以及（3）为什么停止梯度算子对于稳定训练是必不可少的。我们的观察在经过充分研究的分数匹配家族中定位漂移，并提供了丰富的理论视角。通过线性化 McKean-Vlasov 动力学并在傅立叶空间中探测它们，我们揭示了与等离子体动力学理论中的 emph{Landau 阻尼} 相当的频率相关收敛时间尺度：高斯核遭受指数高频瓶颈，解释了对拉普拉斯核的经验偏好。我们还提出了指数带宽退火方案$\sigma(t)=\sigma_0 e^{-rt}$，将收敛时间从$\exp(O(K_{\max}^2))$减少到$O(\log K_{\max})$。最后，通过将漂移形式化为平滑 KL 散度的 Wasserstein 梯度流，我们证明停止梯度算子是直接从 JKO 方案要求的冻结场离散化导出的，并且删除它会切断任何梯度流保证的训练。这种变分视角进一步提供了构建新颖漂移算子的通用模板，并通过 Sinkhorn 散度漂移进行了演示。

Title: Towards a Neural Debugger for Python

Authors: Maximilian Beck, Jonas Gehring, Jannik Kossen, Gabriel Synnaeve
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2603.09951
Pdf URL: https://arxiv.org/pdf/2603.09951
Copy Paste: [[2603.09951]] Towards a Neural Debugger for Python(https://arxiv.org/abs/2603.09951)
Keywords: generation
Abstract: Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.
摘要：在 Python 执行轨迹上训练大型语言模型 (LLM) 可以使它们扎根于代码执行，并能够对整个 Python 程序进行逐行执行预测，从而有效地将它们转变为神经解释器（FAIR CodeGen Team 等人，2025）。然而，开发人员很少一步步执行程序；相反，他们使用调试器在某些断点处停止执行，并仅在检查或修改程序变量时逐步执行相关部分。现有的神经解释器方法缺乏这种交互控制。为了解决这个限制，我们引入了神经调试器：模拟传统调试器的语言模型，支持诸如单步进入、越过或退出函数等操作，以及在特定源代码行设置断点。我们证明，通过微调大型 LLM 或从头开始预训练较小模型获得的神经调试器可以可靠地对基于调试器操作的正向执行（预测未来状态和输出）和逆向执行（推断先前状态或输入）进行建模。在 CruxEval 上进行评估，我们的模型在输出和输入预测任务上均取得了出色的性能，展示了强大的条件执行模型。我们的工作朝着未来的代理编码系统迈出了第一步，在该系统中，神经调试器充当模拟调试环境的世界模型，提供执行反馈或使代理能够与真实的调试工具交互。此功能为更强大的代码生成、程序理解和自动调试奠定了基础。