2026-03-19

Title: Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards

Authors: Kaito Baba, Satoshi Kodera
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16876
Pdf URL: https://arxiv.org/pdf/2603.16876
Copy Paste: [[2603.16876]] Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards(https://arxiv.org/abs/2603.16876)
Keywords: generation
Abstract: We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.
摘要：我们提出了 MARL-Rad，一种用于生成放射学报告的新型多模式多智能体强化学习框架，它协调区域特定智能体和全局集成智能体，并通过临床可验证的奖励进行优化。与之前的单模型强化学习或独立训练模型的事后代理化不同，我们的方法联合训练多个代理并通过强化学习优化整个代理系统。 MIMIC-CXR 和 IU X 射线数据集上的实验表明，MARL-Rad 持续改进了 RadGraph、CheXbert 和 GREEN 评分等临床疗效 (CE) 指标，实现了最先进的 CE 性能。进一步的分析证实，MARL-Rad 增强了偏侧性一致性，并生成更准确、更详细的报告。

Title: Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

Authors: Rena Suzuki, Masato Kikuchi, Tadachika Ozono
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16931
Pdf URL: https://arxiv.org/pdf/2603.16931
Copy Paste: [[2603.16931]] Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation(https://arxiv.org/abs/2603.16931)
Keywords: generation
Abstract: While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process -- particularly applying visual effects to ground spoken content to slide objects -- remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,'' a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.
摘要：虽然增强视觉效果的基于幻灯片的视频在教育和研究演示中广泛使用，但视频编辑过程（特别是将视觉效果应用于幻灯片对象的口头内容）仍然是高度劳动密集型的。本研究旨在开发一种系统，可以从幻灯片和相应的脚本自动生成此类教学视频。作为基础步骤，本文提出并制定了脚本到幻灯片基础（S2SG），定义为将脚本句子基础到相应幻灯片对象的任务。此外，作为第一步，我们提出了“Text-S2SG”，这是一种利用大型语言模型（LLM）来执行文本对象的基础任务的方法。我们的实验表明，所提出的方法实现了高性能（F1 分数：0.924）。这项工作的贡献是将以前隐式的基于幻灯片的视频编辑过程形式化为可计算的任务，从而为其自动化铺平了道路。

Title: AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding

Authors: Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16934
Pdf URL: https://arxiv.org/pdf/2603.16934
Copy Paste: [[2603.16934]] AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding(https://arxiv.org/abs/2603.16934)
Keywords: generative
Abstract: The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat's superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at this https URL .
摘要：多模态大型语言模型 (MLLM) 在农业中的部署目前因关键权衡而陷入停滞：现有文献缺乏稳健模型开发和评估所需的大规模农业数据集，而当前最先进的模型缺乏跨不同分类法进行推理所需的经过验证的领域专业知识。为了应对这些挑战，我们提出了视觉到验证知识（V2VK）管道，这是一种新颖的生成式人工智能驱动注释框架，它将视觉字幕与网络增强科学检索相结合，自动生成 AgriMM 基准，通过将训练数据建立在经过验证的植物病理学文献中，有效消除生物幻觉。 AgriMM 基准包含超过 3,000 个农业类别和超过 607,000 个 VQA，涵盖多种任务，包括细粒度植物物种识别、植物病害症状识别、作物计数和成熟度评估。利用这些可验证的数据，我们推出了 AgriChat，这是一种专门的 MLLM，它提供了数千个农业类别的广泛知识，并提供了详细的农业评估和广泛的解释。对不同任务、数据集和评估条件的广泛评估揭示了当前农业 MLLM 的能力和局限性，同时证明了 AgriChat 相对于其他开源模型（包括内部和外部基准）的卓越性能。结果证明，保留视觉细节与网络验证的知识相结合，构成了实现稳健且值得信赖的农业人工智能的可靠途径。代码和数据集可在此 https URL 公开获取。

Title: TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Authors: Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yolo Yunlong Tang, Zichong Xu, Susan Liang, Jing Bi, Jason J Corso, Chenliang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16936
Pdf URL: https://arxiv.org/pdf/2603.16936
Copy Paste: [[2603.16936]] TDMM-LM: Bridging Facial Understanding and Animation via Language Models(https://arxiv.org/abs/2603.16936)
Keywords: generative
Abstract: Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
摘要：文本引导的人体动画发展迅速，但由于缺乏注释良好的、文本配对的面部语料库，面部动画却滞后。为了缩小这一差距，我们利用基础生成模型来合成一个大型、平衡的面部行为语料库。我们设计了涵盖情绪和头部运动的提示套件，使用多个生成器生成约 80 小时的面部视频，并拟合每帧 3D 面部参数，产生用于训练的大规模（提示和参数）对。在此数据集的基础上，我们通过两个互补的任务探索面部运动双向能力的语言模型：（1）Motion2Language：给定一系列 3D 面部参数，该模型生成捕获内容、风格和动态的自然语言描述； (2) Language2Motion：给定提示，模型通过量化的运动标记合成相应的 3D 面部参数序列，用于下游动画。大量实验表明，在这种情况下，语言模型可以解释和合成具有很强泛化性的面部运动。据我们所知，这是第一个将面部参数建模视为语言问题的工作，为文本条件的面部动画和运动理解建立了统一的路径。

Title: KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition

Authors: Yuhan Chen, Yicui Shi, Guofa Li, Liping Zhang, Jie Li, Jiaxin Gao, Wenbo Chu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16943
Pdf URL: https://arxiv.org/pdf/2603.16943
Copy Paste: [[2603.16943]] KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition(https://arxiv.org/abs/2603.16943)
Keywords: generative
Abstract: Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.
摘要：基于骨架的动作识别广泛应用于传感器系统，包括人机交互和智能监控。然而，当前的传感器设备通常会生成稀疏的骨架数据作为离散坐标，这不可避免地会在高度动态运动期间丢弃细粒度的时空细节。此外，预定义物理传感器拓扑的严格约束阻碍了潜在远程依赖性的建模。为了克服这些限制，我们提出了 KGS-GCN，一种图卷积网络，它将运动学驱动的高斯分布与概率拓扑集成在一起。我们的框架通过将离散关节转换为连续的生成表示来明确解决传感器数据稀疏性和拓扑刚性的挑战。首先，设计了运动学驱动的高斯喷射模块，以使用瞬时关节速度矢量动态构造各向异性协方差矩阵。该模块通过将稀疏骨架序列渲染为富含时空语义的多视图连续热图来增强视觉表示。其次，为了超越固定物理连接的限制，提出了概率拓扑构造方法。该方法通过联合高斯分布之间的 Bhattacharyya 距离量化统计相关性，生成自适应先验邻接矩阵。最终，GCN 主干通过视觉上下文门控机制由渲染的视觉特征进行自适应调制。经验结果表明，KGS-GCN 显着增强了复杂时空动力学的建模。通过解决稀疏输入的固有局限性，我们的框架为处理低保真传感器数据提供了强大的解决方案。这种方法为提高现实世界传感应用中的感知可靠性建立了一条实用途径。

Title: Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Authors: Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao, Haopeng Jin, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Haijin Liang, Jin Ma, Xinming Wang, Ruiwen Tao, Hongzhu Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16944
Pdf URL: https://arxiv.org/pdf/2603.16944
Copy Paste: [[2603.16944]] Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models(https://arxiv.org/abs/2603.16944)
Keywords: generation
Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
摘要：虽然基于指令的图像编辑 (IIE) 取得了显着进展，但现有基准通过混合评估来追求任务广度。这种范式掩盖了专业应用程序中至关重要的关键故障模式：不同语义规模的任务之间的模型性能不一致。为了弥补这一差距，我们推出了 Omni IIE Bench，这是一种高质量的人工注释基准，专门用于诊断实际应用场景中 IIE 模型的编辑一致性。 Omni IIE Bench 采用创新的双轨诊断设计：（1）单轮一致性，包括属性修改和实体替换的共享上下文任务对；（2）多轮协调，涉及跨越语义尺度的连续对话任务。该基准是通过极其严格的多阶段人工过滤过程构建的，结合了计算机视觉研究生执行的质量标准和专业设计师进行的行业相关性审查。我们使用Omni IIE Bench对8种主流IIE模型进行了综合评估。我们的分析首次量化了普遍存在的性能差距：几乎所有模型在从低语义规模任务过渡到高语义规模任务时都表现出显着的性能下降。 Omni IIE Bench 为开发下一代、更可靠、更稳定的 IIE 模型提供了关键的诊断工具和见解。

Title: PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Authors: Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16958
Pdf URL: https://arxiv.org/pdf/2603.16958
Copy Paste: [[2603.16958]] PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models(https://arxiv.org/abs/2603.16958)
Keywords: generation
Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
摘要：视觉语言模型 (VLM) 越来越多地应用于机器人感知和操作，但它们推断操作所需物理属性的能力仍然有限。特别是，估计现实世界物体的质量对于确定适当的抓握力和确保安全交互至关重要。然而，当前的 VLM 缺乏可靠的质量推理能力，并且大多数现有基准没有明确评估现实传感条件下的物理量估计。在这项工作中，我们提出了 PhysQuantAgent（一种使用 VLM 进行现实世界物体质量估计的框架）以及 VisPhysQuant（一种用于评估的新基准数据集）。 VisPhysQuant 由从多个视点捕获的真实物体的 RGB-D 视频组成，并用精确的质量测量进行注释。为了提高估计精度，我们引入了三种视觉提示方法，通过对象检测、尺度估计和横截面图像生成来增强输入图像，以帮助模型理解目标对象的大小和内部结构。实验表明，视觉提示显着提高了现实世界数据的质量估计准确性，表明将空间推理与 VLM 知识相结合进行物理推理的有效性。

Title: Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

Authors: Abinav Rao, Sujan Rachuri
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.17044
Pdf URL: https://arxiv.org/pdf/2603.17044
Copy Paste: [[2603.17044]] Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models(https://arxiv.org/abs/2603.17044)
Keywords: generation
Abstract: Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.
摘要：统一的多模态模型共享用于理解和生成图像的语言模型主干。 DPO 能否同时调整这两种功能？我们提出了这个问题的第一个系统研究，在七种训练策略和两种事后方法下，将 DPO 应用于 1B 和 7B 参数的 Janus-Pro。主要发现是否定的：在该架构的所有测试条件下，发电质量都无法与 DPO 保持一致。没有任何方法可以提高 7B 的生成 CLIPScore（|Delta| < 0.2，p > 0.5，n=200 个种子，3 个种子）；在 1B 处，所有方法都会降低生成速度，并且结果适用于偏好数据类型（真实与生成和模型与模型）和测试的数据量（150-288 对）。梯度分析揭示了原因：理解和生成梯度接近正交 (cos ~ 0)，由 VQ 令牌计数不对称性驱动的 ~11-14 倍量级不平衡（576 个生成令牌与 ~30-100 个文本令牌）。这种不平衡是多任务DPO中的主导干扰机制；幅度平衡产生了方向性正向理解增量（+0.01-0.04 VQA，尽管单独而言并不显着），但代沟无论如何仍然存在。我们将离散 VQ 标记化确定为可能的结构瓶颈（由收敛到 ln(2) 的生成 DPO 损失支持），并为使用基于 VQ 的统一模型的从业者提供实用指导。

Title: SCE-LITE-HQ: Smooth visual counterfactual explanations with generative foundation models

Authors: Ahmed Zeid, Sidney Bender
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.17048
Pdf URL: https://arxiv.org/pdf/2603.17048
Copy Paste: [[2603.17048]] SCE-LITE-HQ: Smooth visual counterfactual explanations with generative foundation models(https://arxiv.org/abs/2603.17048)
Keywords: generation, generative
Abstract: Modern neural networks achieve strong performance but remain difficult to interpret in high-dimensional visual domains. Counterfactual explanations (CFEs) provide a principled approach to interpreting black-box predictions by identifying minimal input changes that alter model outputs. However, existing CFE methods often rely on dataset-specific generative models and incur substantial computational cost, limiting their scalability to high-resolution data. We propose SCE-LITE-HQ, a scalable framework for counterfactual generation that leverages pretrained generative foundation models without task-specific retraining. The method operates in the latent space of the generator, incorporates smoothed gradients to improve optimization stability, and applies mask-based diversification to promote realistic and structurally diverse counterfactuals. We evaluate SCE-LITE-HQ on natural and medical datasets using a desiderata-driven evaluation protocol. Results show that SCE-LITE-HQ produces valid, realistic, and diverse counterfactuals competitive with or outperforming existing baselines, while avoiding the overhead of training dedicated generative models.
摘要：现代神经网络取得了强大的性能，但在高维视觉领域仍然难以解释。反事实解释 (CFE) 通过识别改变模型输出的最小输入变化，提供了解释黑盒预测的原则性方法。然而，现有的 CFE 方法通常依赖于特定于数据集的生成模型，并会产生大量的计算成本，从而限制了其对高分辨率数据的可扩展性。我们提出了 SCE-LITE-HQ，这是一种可扩展的反事实生成框架，它利用预训练的生成基础模型，无需针对特定任务进行再训练。该方法在生成器的潜在空间中运行，结合平滑梯度来提高优化稳定性，并应用基于掩模的多样化来促进现实和结构多样化的反事实。我们使用需求驱动的评估协议在自然和医学数据集上评估 SCE-LITE-HQ。结果表明，SCE-LITE-HQ 生成有效、现实且多样化的反事实，与现有基线竞争或优于现有基线，同时避免了训练专用生成模型的开销。

Title: Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Authors: Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17051
Pdf URL: https://arxiv.org/pdf/2603.17051
Copy Paste: [[2603.17051]] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models(https://arxiv.org/abs/2603.17051)
Keywords: generation
Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
摘要：蒸馏自回归 (AR) 视频模型可实现高效的流生成，但经常与人类视觉偏好不一致。现有的强化学习 (RL) 框架并不自然地适合这些架构，通常需要昂贵的重新蒸馏或求解器耦合的逆向过程优化，从而引入大量的内存和计算开销。我们推出 Astrolabe，一个专为精炼 AR 模型量身定制的高效在线 RL 框架。为了克服现有的瓶颈，我们引入了一种基于负感知微调的前向过程强化学习公式。通过直接在推理端点对比正样本和负样本，该方法建立了隐式策略改进方向，而不需要逆向过程展开。为了将这种对齐扩展到长视频，我们提出了一种流式训练方案，该方案通过滚动 KV 缓存逐步生成序列，将 RL 更新专门应用于本地剪辑窗口，同时根据先前的上下文进行调节，以确保远程一致性。最后，为了减轻奖励黑客行为，我们集成了通过不确定性感知选择性正则化和动态参考更新来稳定的多重奖励目标。大量实验表明，我们的方法能够持续提高多个精炼 AR 视频模型的生成质量，成为强大且可扩展的对齐解决方案。

Title: Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Authors: Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, Dianbo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17052
Pdf URL: https://arxiv.org/pdf/2603.17052
Copy Paste: [[2603.17052]] Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization(https://arxiv.org/abs/2603.17052)
Keywords: generative
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.
摘要：矢量量化是机器学习中的一种技术，它将连续表示离散化为一组离散向量。它广泛应用于大型语言模型、扩散模型和其他生成模型的数据表示标记。尽管矢量量化很流行，但生成模型中矢量量化的特征和行为在很大程度上仍未得到充分探索。在这项研究中，我们系统地研究了矢量量化中的折叠问题，其中在离散码本标记和连续潜在嵌入中观察到折叠表示。通过利用合成数据集和真实数据集，我们确定了每种类型崩溃的严重程度和触发条件。我们的分析表明，随机初始化和有限的编码器容量会导致令牌崩溃和嵌入崩溃。基于这些发现，我们提出了旨在减轻每次崩溃的潜在解决方案。据我们所知，这是第一个检查矢量量化中表示崩溃问题的综合研究。

Title: PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

Authors: Yijian Wang, Qingsen Yan, Jiantao Zhou, Duwei Dai, Wei Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17055
Pdf URL: https://arxiv.org/pdf/2603.17055
Copy Paste: [[2603.17055]] PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning(https://arxiv.org/abs/2603.17055)
Keywords: restoration, generation
Abstract: Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent's ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent's superiority in addressing complex IR tasks. Our project page is \href{this https URL}{PaAgent}.
摘要：图像恢复 (IR) 代理利用多模态大语言模型来感知退化并调用恢复工具，在自动化 IR 任务方面已显示出前景。然而，现有的 IR 代理通常缺乏对过去交互的洞察总结机制，这导致对最佳 IR 工具的详尽搜索。为了解决这个限制，我们提出了一种肖像感知 IR 代理，称为 PaAgent，它包含一个用于 IR 工具和检索增强生成（RAG）的自我进化肖像库，以选择合适的 IR 工具进行输入。具体来说，为了构建和发展肖像库，PaAgent通过总结各种IR工具的特点，包括恢复图像、选定的IR工具和降级图像，不断丰富肖像库。此外，RAG 用于通过从肖像库中检索相关见解来为输入图像选择最佳的 IR 工具。此外，为了增强 PaAgent 在复杂场景中感知退化的能力，我们提出了一种主客观强化学习策略，该策略在奖励生成中同时考虑图像质量得分和语义洞察，即使在部分和非均匀退化的情况下也能准确提供退化信息。跨 8 个 IR 基准的广泛实验，涵盖六种单一降解和八种混合降解场景，验证了 PaAgent 在解决复杂 IR 任务方面的优越性。我们的项目页面是\href{这个https URL}{PaAgent}。

Title: CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning

Authors: Weikun K. Zhang, Rohan Pandey, Bhaumik Mehta, Kaijie Jin, Naomi Morato, Archit Ganapule, Michael Ruofan Zeng, Jarod Alper
Subjects: cs.LG, cs.AI, cs.CC
Abstract URL: https://arxiv.org/abs/2603.17075
Pdf URL: https://arxiv.org/pdf/2603.17075
Copy Paste: [[2603.17075]] CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning(https://arxiv.org/abs/2603.17075)
Keywords: generation
Abstract: Motivated by auto-proof generation and Valiant's VP vs. VNP conjecture, we study the problem of discovering efficient arithmetic circuits to compute polynomials, using addition and multiplication gates. We formulate this problem as a single-player game, where an RL agent attempts to build the circuit within a fixed number of operations. We implement an AlphaZero-style training loop and compare two approaches: Proximal Policy Optimization with Monte Carlo Tree Search (PPO+MCTS) and Soft Actor-Critic (SAC). SAC achieves the highest success rates on two-variable targets, while PPO+MCTS scales to three variables and demonstrates steady improvement on harder instances. These results suggest that polynomial circuit synthesis is a compact, verifiable setting for studying self-improving search policies.
摘要：受自动证明生成和 Valiant 的 VP 与 VNP 猜想的推动，我们研究了使用加法门和乘法门来发现有效算术电路来计算多项式的问题。我们将此问题表述为单人游戏，其中强化学习代理尝试在固定数量的操作内构建电路。我们实现了 AlphaZero 式的训练循环，并比较了两种方法：蒙特卡罗树搜索的近端策略优化 (PPO+MCTS) 和 Soft Actor-Critic (SAC)。 SAC 在两个变量目标上实现了最高的成功率，而 PPO+MCTS 可以扩展到三个变量，并在较困难的情况下表现出稳定的改进。这些结果表明，多项式电路综合是研究自我改进搜索策略的紧凑、可验证的设置。

Title: SENSE: Efficient EEG-to-Text via Privacy-Preserving Semantic Retrieval

Authors: Akshaj Murhekar, Christina Liu, Abhijit Mishra, Shounak Roychowdhury, Jacek Gwizdka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17109
Pdf URL: https://arxiv.org/pdf/2603.17109
Copy Paste: [[2603.17109]] SENSE: Efficient EEG-to-Text via Privacy-Preserving Semantic Retrieval(https://arxiv.org/abs/2603.17109)
Keywords: generation, generative
Abstract: Decoding brain activity into natural language is a major challenge in AI with important applications in assistive communication, neurotechnology, and human-computer interaction. Most existing Brain-Computer Interface (BCI) approaches rely on memory-intensive fine-tuning of Large Language Models (LLMs) or encoder-decoder models on raw EEG signals, resulting in expensive training pipelines, limited accessibility, and potential exposure of sensitive neural data. We introduce SENSE (SEmantic Neural Sparse Extraction), a lightweight and privacy-preserving framework that translates non-invasive electroencephalography (EEG) into text without LLM fine-tuning. SENSE decouples decoding into two stages: on-device semantic retrieval and prompt-based language generation. EEG signals are locally mapped to a discrete textual space to extract a non-sensitive Bag-of-Words (BoW), which conditions an off-the-shelf LLM to synthesize fluent text in a zero-shot manner. The EEG-to-keyword module contains only ~6M parameters and runs fully on-device, ensuring raw neural signals remain local while only abstract semantic cues interact with language models. Evaluated on a 128-channel EEG dataset across six subjects, SENSE matches or surpasses the generative quality of fully fine-tuned baselines such as Thought2Text while substantially reducing computational overhead. By localizing neural decoding and sharing only derived textual cues, SENSE provides a scalable and privacy-aware retrieval-augmented architecture for next-generation BCIs.
摘要：将大脑活动解码为自然语言是人工智能的一项重大挑战，在辅助通信、神经技术和人机交互方面有着重要的应用。大多数现有的脑机接口 (BCI) 方法依赖于对原始脑电图信号的大型语言模型 (LLM) 或编码器-解码器模型的内存密集型微调，从而导致训练流程昂贵、可访问性有限以及敏感神经数据的潜在暴露。我们介绍 SENSE（语义神经稀疏提取），这是一个轻量级且保护隐私的框架，无需法学硕士微调即可将非侵入性脑电图（EEG）转换为文本。 SENSE 将解码分解为两个阶段：设备上语义检索和基于提示的语言生成。脑电图信号被本地映射到离散文本空间，以提取不敏感的词袋（BoW），它使现成的法学硕士能够以零样本的方式合成流畅的文本。 EEG-to-keyword 模块仅包含约 6M 参数，并且完全在设备上运行，确保原始神经信号保留在本地，同时只有抽象语义线索与语言模型交互。在六个受试者的 128 通道脑电图数据集上进行评估后，SENSE 匹配或超越了完全微调的基线（例如 Thought2Text）的生成质量，同时大大减少了计算开销。通过本地化神经解码并仅共享派生的文本提示，SENSE 为下一代 BCI 提供了可扩展且具有隐私意识的检索增强架构。

Title: Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

Authors: Marceau Lafargue-Hauret, Raghav Mehta, Fabio De Sousa Ribeiro, Mélanie Roschewitz, Ben Glocker
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.17110
Pdf URL: https://arxiv.org/pdf/2603.17110
Copy Paste: [[2603.17110]] Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation(https://arxiv.org/abs/2603.17110)
Keywords: generation
Abstract: Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.
摘要：图像分割依赖于大型带注释的数据集，这些数据集昂贵且生成缓慢。银标（人工智能生成）标签更容易获得，但有引入偏见的风险。仅需要图像的自监督学习已成为预训练的关键。最近的工作将对比学习与反事实生成相结合，改进了分类的表示学习，但不容易扩展到像素级任务。我们提出了一种通过双视图（DVD-CL）和多视图（MVD-CL）方法将反事实生成与密集对比学习相结合的管道，以及利用可用银标准注释的监督变体。还引入了一种新的可视化算法，即颜色编码高分辨率叠加图（CHRO-map）。实验表明，无注释 DVD-CL 优于其他密集对比学习方法，而使用银标准标签的监督变体优于直接对银标准标记数据进行的训练，在具有挑战性的数据上实现了 $\sim$94% DSC。这些结果强调，通过反事实和银标准注释增强的像素级对比学习可以提高采集和病理变化的鲁棒性。

Title: MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Authors: Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17117
Pdf URL: https://arxiv.org/pdf/2603.17117
Copy Paste: [[2603.17117]] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models(https://arxiv.org/abs/2603.17117)
Keywords: generation
Abstract: Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
摘要：视频传播模型正在超越简短、合理的剪辑，转向世界模拟器，在摄像机运动、重访和干预下必须保持一致。然而，空间记忆仍然是一个关键瓶颈：显式 3D 结构可以提高基于重投影的一致性，但难以描绘移动物体，而隐式记忆即使在正确的姿势下也经常产生不准确的相机运动。我们提出了马赛克记忆 (MosaicMem)，这是一种混合空间记忆，可将补丁提升为 3D，以实现可靠的定位和有针对性的检索，同时利用模型的本机调节来保留后续生成。 MosaicMem 通过补丁和组合接口在查询视图中组合空间对齐的补丁，保留应该持续存在的内容，同时允许模型修复应该发展的内容。通过 PRoPE 相机调节和两种新的记忆对齐方法，实验表明，与隐式记忆相比，姿势依从性得到了改善，动态建模也比显式基线更强。 MosaicMem 进一步实现了分钟级导航、基于内存的场景编辑和自回归推出。

Title: SMAL-pets: SMAL Based Avatars of Pets from Single Image

Authors: Piotr Borycki, Joanna Waczyńska, Yizhe Zhu, Yongqiang Gao, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17131
Pdf URL: https://arxiv.org/pdf/2603.17131
Copy Paste: [[2603.17131]] SMAL-pets: SMAL Based Avatars of Pets from Single Image(https://arxiv.org/abs/2603.17131)
Keywords: generative
Abstract: Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar's appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.
摘要：创建高保真、可动画的 3D 狗头像仍然是计算机视觉领域的一项艰巨挑战。与人类数字替身不同，动物重建面临着专门应用的大规模带注释数据集的严重短缺。此外，物种、品种和杂交之间巨大的形态多样性，在大小、比例和特征上存在显着差异，使现有模型的推广变得复杂。当前的重建方法通常很难捕捉真实的毛发纹理。此外，确保这些化身完全可编辑并能够执行复杂、自然的动作通常需要劳动密集型的手动网格操作和专家装配。本文介绍了 SMAL-pets，这是一个综合框架，可以从单个输入图像生成高质量、可编辑的动物头像。我们的方法通过利用混合架构弥合了重建和生成建模之间的差距。我们的方法将 3D 高斯溅射与 SMAL 参数化模型集成，以提供视觉上高保真度和解剖学基础的表示。我们引入了一个多模式编辑套件，使用户能够通过直接的文本提示来完善化身的外观并执行复杂的动画。 SMAL-pets 允许用户通过自然语言控制模型的美学和行为方面，为动画和虚拟现实提供了灵活、强大的工具。

Title: Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

Authors: Parsa Mirtaheri, Mikhail Belkin
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.17199
Pdf URL: https://arxiv.org/pdf/2603.17199
Copy Paste: [[2603.17199]] Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing(https://arxiv.org/abs/2603.17199)
Keywords: generation
Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model's residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.
摘要：大型语言模型 (LLM) 可能会产生无法准确反映驱动其答案的实际因素的思维链 (CoT)。在注入有利于特定选项的提示的多项选择设置中，模型可能会将其最终答案转向暗示的选项，并生成一个 CoT，在不确认提示的情况下合理化响应 - 这是动机推理的一个实例。我们在多个 LLM 系列和数据集中研究了这种现象，证明即使在无法通过 CoT 轻松确定的情况下，也可以通过探测内部激活来识别动机推理。使用在模型的残差流上训练的监督探针，我们表明（i）在生成任何 CoT 令牌之前应用的预生成探针，可以预测动机推理以及访问完整 CoT 跟踪的基于 LLM 的 CoT 监视器，以及（ii）在 CoT 生成之后应用的后生成探针，其性能优于相同的监视器。总之，这些结果表明，从内部表征中检测到动机推理比通过 CoT 监控更可靠。此外，生成前探测可以尽早标记动机行为，从而有可能避免不必要的生成。

Title: GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Authors: Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17240
Pdf URL: https://arxiv.org/pdf/2603.17240
Copy Paste: [[2603.17240]] GigaWorld-Policy: An Efficient Action-Centered World--Action Model(https://arxiv.org/abs/2603.17240)
Keywords: generation
Abstract: World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
摘要：从预先训练的视频生成主干初始化的世界动作模型（WAM）已经证明了机器人策略学习的巨大潜力。然而，现有方法面临两个阻碍性能和部署的关键瓶颈。首先，对未来视觉动态和相应动作的联合推理会产生大量的推理开销。其次，联合建模通常将视觉和运动表示纠缠在一起，使得运动预测的准确性在很大程度上取决于未来视频预测的质量。为了解决这些问题，我们引入了 GigaWorld-Policy，这是一种以动作为中心的 WAM，它可以学习 2D 像素动作动态，同时实现高效的动作解码，并具有可选的视频生成功能。具体来说，我们将策略训练制定为两个耦合的组件：模型根据当前观察预测未来的动作序列，并同时根据预测的动作和相同的观察生成未来的视频。该策略受到动作预测和视频生成的监督，提供更丰富的学习信号并通过视觉动态约束鼓励物理上合理的动作。通过防止未来视频令牌影响动作令牌的因果设计，在推理时可以选择显式的未来视频生成，从而在部署期间实现更快的动作预测。为了支持这种范例，我们策划了一个多样化的大规模机器人数据集来预训练以动作为中心的视频生成模型，然后将其用作机器人策略学习的骨干。现实世界机器人平台上的实验结果表明，GigaWorld-Policy 的运行速度比领先的 WAM 基线 Motus 快 9 倍，同时将任务成功率提高 7%。此外，与pi-0.5相比，GigaWorld-Policy在RoboTwin 2.0上的性能提高了95%。

Title: Variational Rectification Inference for Learning with Noisy Labels

Authors: Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu, Fan Liu, Hehe Fan, Yilong Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17255
Pdf URL: https://arxiv.org/pdf/2603.17255
Copy Paste: [[2603.17255]] Variational Rectification Inference for Learning with Noisy Labels(https://arxiv.org/abs/2603.17255)
Keywords: generation
Abstract: Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (\textit{e.g.}, re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.
摘要：标签噪声在现实世界的数据集中被广泛观察到。为了减轻过度拟合对深度模型标记噪声的负面影响，有效的策略（例如，重新加权或损失校正）已广泛应用于流行的方法中，这些方法通常是在元学习场景下学习的。尽管概率元学习模型实现了对噪声的鲁棒性，但它们通常会遭受模型崩溃的影响，从而降低泛化性能。在本文中，我们提出变分校正推理（VRI），将损失函数的自适应校正公式化为摊销变分推理问题，并在元学习框架下推导证据下界。具体来说，VRI通过将校正向量视为潜在变量来构建为分层贝叶斯，它可以通过额外的随机性正则化来校正噪声样本的损失，因此对标签噪声更加鲁棒。为了实现校正向量的推断，我们用摊销元网络来近似其条件后验。通过在VRI中引入变分项，可以准确估计条件后验并避免崩溃为Dirac delta函数，从而可以显着提高泛化性能。精心设计的元网络和先验网络坚持平滑假设，从而能够生成可靠的校正向量。给定一组干净的元数据，VRI 可以在双层优化编程中有效地进行元学习。此外，理论分析保证了我们的算法可以有效地学习元网络。全面的比较实验和分析验证了其对噪声标签的鲁棒学习的有效性，特别是在存在开放集噪声的情况下。

Title: Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

Authors: Jianzhang Zhang, Yijing Tian, Jiwang Qu, Chuang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17295
Pdf URL: https://arxiv.org/pdf/2603.17295
Copy Paste: [[2603.17295]] Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation(https://arxiv.org/abs/2603.17295)
Keywords: generation
Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.
摘要：故事可视化需要生成顺序图像，在语义上与不断发展的叙述保持一致，同时保持角色身份和视觉风格的严格一致性。然而，现有的方法常常会遇到主题不一致和身份漂移的问题，特别是在描述复杂的交互或扩展的叙事弧时。为了应对这些挑战，我们提出了一个有凝聚力的两阶段框架，旨在生成强大且一致的故事。首先，我们引入群组共享注意力（GSA），这是一种通过在注意力层内实现无损跨样本信息流来促进内在一致性的机制。这使得模型能够在结构上编码跨帧的身份对应关系，而无需依赖外部编码器。其次，我们利用直接偏好优化（DPO）使生成的输出与人类审美和叙事标准保持一致。与依赖于相互冲突的辅助损失的传统方法不同，我们的方法通过从整体偏好数据中学习，同时增强视觉保真度和身份保留。对 ViStoryBench 基准的广泛评估表明，我们的方法建立了一种新的最先进的方法，其性能显着优于强大的基线，在角色识别 (CIDS) 方面获得 +10.0，在风格一致性 (CSD) 方面获得 +18.7，同时保留了高保真生成。

Title: WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation

Authors: Zahin Sufiyan, Shadan Golestan, Yoshihiro Mitsuka, Shotaro Miwa, Osmar Zaiane
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17301
Pdf URL: https://arxiv.org/pdf/2603.17301
Copy Paste: [[2603.17301]] WINFlowNets: Warm-up Integrated Networks Training of Generative Flow Networks for Robotics and Machine Fault Adaptation(https://arxiv.org/abs/2603.17301)
Keywords: generative
Abstract: Generative Flow Networks for continuous scenarios (CFlowNets) have shown promise in solving sequential decision-making tasks by learning stochastic policies using a flow and a retrieval network. Despite their demonstrated efficiency compared to state-of-the-art Reinforcement Learning (RL) algorithms, their practical application in robotic control tasks is constrained by the reliance on pre-training the retrieval network. This dependency poses challenges in dynamic robotic environments, where pre-training data may not be readily available or representative of the current environment. This paper introduces WINFlowNets, a novel CFlowNets framework that enables the co-training of flow and retrieval networks. WINFlowNets begins with a warm-up phase for the retrieval network to bootstrap its policy, followed by a shared training architecture and a shared replay buffer for co-training both networks. Experiments in simulated robotic environments demonstrate that WINFlowNets surpasses CFlowNets and state-of-the-art RL algorithms in terms of average reward and training stability. Furthermore, WINFlowNets exhibits strong adaptive capability in fault environments, making it suitable for tasks that demand quick adaptation with limited sample data. These findings highlight WINFlowNets' potential for deployment in dynamic and malfunction-prone robotic systems, where traditional pre-training or sample inefficient data collection may be impractical.
摘要：用于连续场景的生成流网络 (CFlowNets) 通过使用流和检索网络学习随机策略，在解决顺序决策任务方面表现出了良好的前景。尽管与最先进的强化学习（RL）算法相比，它们的效率很高，但它们在机器人控制任务中的实际应用受到对检索网络预训练的依赖的限制。这种依赖性在动态机器人环境中带来了挑战，其中预训练数据可能不容易获得或无法代表当前环境。本文介绍了 WINFlowNets，这是一种新颖的 CFlowNets 框架，可以实现流网络和检索网络的协同训练。 WINFlowNets 从检索网络的预热阶段开始，以引导其策略，然后是共享训练架构和共享重播缓冲区，用于共同训练两个网络。模拟机器人环境中的实验表明，WINFlowNets 在平均奖励和训练稳定性方面超越了 CFlowNets 和最先进的 RL 算法。此外，WINFlowNets在故障环境中表现出很强的自适应能力，适合需要在样本数据有限的情况下快速适应的任务。这些发现凸显了 WINFlowNets 在动态和容易发生故障的机器人系统中部署的潜力，在这些系统中，传统的预训练或样本低效数据收集可能不切实际。

Title: Stereo World Model: Camera-Guided Stereo Video Generation

Authors: Yang-Tian Sun, Zehuan Huang, Yifan Niu, Lin Ma, Yan-Pei Cao, Yuewen Ma, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17375
Pdf URL: https://arxiv.org/pdf/2603.17375
Copy Paste: [[2603.17375]] Stereo World Model: Camera-Guided Stereo Video Generation(https://arxiv.org/abs/2603.17375)
Keywords: generation
Abstract: We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video this http URL monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
摘要：我们提出了 StereoWorld，一种相机调节的立体世界模型，它联合学习端到端立体视频的外观和双目几何形状，这种 http URL 单目 RGB 或 RGBD 方法，StereoWorld 专门在 RGB 模态内运行，同时直接根据视差来接地几何形状。为了有效地实现一致的立体生成，我们的方法引入了两个关键设计：（1）统一的相机框架 RoPE，通过相机感知的旋转位置编码增强潜在标记，实现相对、视图和时间一致的调节，同时通过稳定的注意力初始化保留预先训练的视频先验； (2) 立体感知注意力分解，将完整的 4D 注意力分解为 3D 视图内注意力和水平行注意力，利用极线先于以低得多的计算捕获视差对齐对应关系。在各个基准测试中，StereoWorld 比强大的单目然后转换管道提高了立体一致性、视差精度和相机运动保真度，实现了 3 倍以上的生成速度，并且视点一致性额外提高了 5%。除了基准之外，StereoWorld 还可以实现端到端双目 VR 渲染，无需深度估计或修复，通过度量尺度深度基础增强具体策略学习，并与长视频蒸馏兼容以扩展交互式立体合成。

Title: SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction

Authors: Shuizhou Chen, Lang Yu, Kedu Jin, Songming Zhang, Hao Wu, Wenxuan Huang, Sheng Xu, Quan Qian, Qin Chen, Lei Bai, Siqi Sun, Zhangyang Gao
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.17380
Pdf URL: https://arxiv.org/pdf/2603.17380
Copy Paste: [[2603.17380]] SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction(https://arxiv.org/abs/2603.17380)
Keywords: generative
Abstract: Virtual cell models aim to enable in silico experimentation by predicting how cells respond to genetic, chemical, or cytokine perturbations from single-cell measurements. In practice, however, large-scale perturbation prediction remains constrained by three coupled bottlenecks: inefficient training and inference pipelines, unstable modeling in high-dimensional sparse expression space, and evaluation protocols that overemphasize reconstruction-like accuracy while underestimating biological fidelity. In this work we present a specialized large-scale foundation model SCALE for virtual cell perturbation prediction that addresses the above limitations jointly. First, we build a BioNeMo-based training and inference framework that substantially improves data throughput, distributed scalability, and deployment efficiency, yielding 12.51* speedup on pretrain and 1.29* on inference over the prior SOTA pipeline under matched system settings. Second, we formulate perturbation prediction as conditional transport and implement it with a set-aware flow architecture that couples LLaMA-based cellular encoding with endpoint-oriented supervision. This design yields more stable training and stronger recovery of perturbation effects. Third, we evaluate the model on Tahoe-100M using a rigorous cell-level protocol centered on biologically meaningful metrics rather than reconstruction alone. On this benchmark, our model improves PDCorr by 12.02% and DE Overlap by 10.66% over STATE. Together, these results suggest that advancing virtual cells requires not only better generative objectives, but also the co-design of scalable infrastructure, stable transport modeling, and biologically faithful evaluation.
摘要：虚拟细胞模型旨在通过单细胞测量预测细胞如何响应遗传、化学或细胞因子扰动，从而实现计算机实验。然而，在实践中，大规模扰动预测仍然受到三个耦合瓶颈的限制：低效的训练和推理流程、高维稀疏表达空间中的不稳定建模以及过分强调类似重建的准确性而低估生物保真度的评估协议。在这项工作中，我们提出了一种用于虚拟细胞扰动预测的专用大规模基础模型 SCALE，共同解决了上述限制。首先，我们构建了一个基于 BioNeMo 的训练和推理框架，该框架可大幅提高数据吞吐量、分布式可扩展性和部署效率，在匹配的系统设置下，与之前的 SOTA 管道相比，预训练速度提高了 12.51*，推理速度提高了 1.29*。其次，我们将扰动预测制定为条件传输，并使用集合感知流架构来实现它，该架构将基于 LLaMA 的蜂窝编码与面向端点的监督相结合。这种设计产生更稳定的训练和更强的扰动效果恢复。第三，我们使用严格的细胞级协议来评估 Tahoe-100M 上的模型，该协议以具有生物学意义的指标为中心，而不仅仅是重建。在此基准测试中，我们的模型比 STATE 将 PDCorr 提高了 12.02%，将 DE Overlap 提高了 10.66%。总之，这些结果表明，推进虚拟细胞不仅需要更好的生成目标，还需要可扩展基础设施的共同设计、稳定的传输模型和生物学上忠实的评估。

Title: Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models

Authors: Rui Wu, Hong Xie, Yongjun Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17384
Pdf URL: https://arxiv.org/pdf/2603.17384
Copy Paste: [[2603.17384]] Cohomological Obstructions to Global Counterfactuals: A Sheaf-Theoretic Foundation for Generative Causal Models(https://arxiv.org/abs/2603.17384)
Keywords: generative
Abstract: Current continuous generative models (e.g., Diffusion Models, Flow Matching) implicitly assume that locally consistent causal mechanisms naturally yield globally coherent counterfactuals. In this paper, we prove that this assumption fails fundamentally when the causal graph exhibits non-trivial homology (e.g., structural conflicts or hidden confounders). We formalize structural causal models as cellular sheaves over Wasserstein spaces, providing a strict algebraic topological definition of cohomological obstructions in measure spaces. To ensure computational tractability and avoid deterministic singularities (which we define as manifold tearing), we introduce entropic regularization and derive the Entropic Wasserstein Causal Sheaf Laplacian, a novel system of coupled non-linear Fokker-Planck equations. Crucially, we prove an entropic pullback lemma for the first variation of pushforward measures. By integrating this with the Implicit Function Theorem (IFT) on Sinkhorn optimality conditions, we establish a direct algorithmic bridge to automatic differentiation (VJP), achieving O(1)-memory reverse-mode gradients strictly independent of the iteration horizon. Empirically, our framework successfully leverages thermodynamic noise to navigate topological barriers ("entropic tunneling") in high-dimensional scRNA-seq counterfactuals. Finally, we invert this theoretical framework to introduce the Topological Causal Score, demonstrating that our Sheaf Laplacian acts as a highly sensitive algebraic detector for topology-aware causal discovery.
摘要：当前的连续生成模型（例如扩散模型、流匹配）隐含地假设局部一致的因果机制自然会产生全局一致的反事实。在本文中，我们证明当因果图表现出非平凡的同源性（例如结构冲突或隐藏的混杂因素）时，这种假设从根本上失败。我们将结构因果模型形式化为 Wasserstein 空间上的细胞滑轮，为测度空间中的上同调障碍提供严格的代数拓扑定义。为了确保计算的易处理性并避免确定性奇点（我们将其定义为流形撕裂），我们引入熵正则化并推导熵瓦瑟斯坦因果束拉普拉斯算子，这是一个新颖的耦合非线性福克-普朗克方程组。至关重要的是，我们证明了前推措施第一个变体的熵回拉引理。通过将其与 Sinkhorn 最优条件上的隐式函数定理 (IFT) 相结合，我们建立了自动微分 (VJP) 的直接算法桥梁，实现了严格独立于迭代范围的 O(1) 内存反向模式梯度。根据经验，我们的框架成功地利用热力学噪声来导航高维 scRNA-seq 反事实中的拓扑障碍（“熵隧道”）。最后，我们反转这个理论框架来引入拓扑因果分数，证明我们的谢夫拉普拉斯算子可以作为高度灵敏的代数检测器，用于拓扑感知的因果发现。

Title: The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

Authors: Rui Wu, Hong Xie, Yongjun Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17385
Pdf URL: https://arxiv.org/pdf/2603.17385
Copy Paste: [[2603.17385]] The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions(https://arxiv.org/abs/2603.17385)
Keywords: generative
Abstract: Judea Pearl's do-calculus provides a foundation for causal inference, but its translation to continuous generative models remains fraught with geometric challenges. We establish the fundamental limits of such interventions. We define the Counterfactual Event Horizon and prove the Manifold Tearing Theorem: deterministic flows inevitably develop finite-time singularities under extreme interventions. We establish the Causal Uncertainty Principle for the trade-off between intervention extremity and identity preservation. Finally, we introduce Geometry-Aware Causal Flow (GACF), a scalable algorithm that utilizes a topological radar to bypass manifold tearing, validated on high-dimensional scRNA-seq data.
摘要：Judea Pearl 的 do-calculus 为因果推理提供了基础，但将其转化为连续生成模型仍然充满了几何挑战。我们确定此类干预措施的基本限制。我们定义了反事实事件视界并证明了流形撕裂定理：确定性流在极端干预下不可避免地会产生有限时间奇点。我们建立了因果不确定性原则来权衡干预极端和身份保留之间的关系。最后，我们介绍了几何感知因果流（GACF），这是一种可扩展的算法，利用拓扑雷达绕过流形撕裂，并在高维 scRNA-seq 数据上进行了验证。

Title: Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Authors: Rui Hong, Jana Kosecka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17388
Pdf URL: https://arxiv.org/pdf/2603.17388
Copy Paste: [[2603.17388]] Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis(https://arxiv.org/abs/2603.17388)
Keywords: generation, generative
Abstract: Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.
摘要：根据文本输入生成自然、正确且视觉流畅的 3D 头像手语动作仍然非常具有挑战性。在这项工作中，我们使用 ASL-LEX 2.0 注释（例如手部形状、手部位置和运动）训练 3D 身体运动的生成模型，并探索语音属性调节在手语运动生成中的作用。我们首先使用具有 SMPL-X 表示的 Human Motion MDM 式扩散模型建立强大的扩散基线，该模型在光泽度辨别指标方面优于最先进的 CVAE 方法 SignAvatar。然后，我们使用不同的文本编码器（CLIP 与 T5）、调节模式（仅光泽与光泽+语音属性）和属性符号格式（符号与自然语言）系统地研究文本调节的作用。我们的分析表明，将符号 ASL-LEX 符号翻译为自然语言是有效的基于 CLIP 的属性调节的必要条件，而 T5 很大程度上不受这种翻译的影响。此外，我们性能最佳的变体（具有映射属性的 CLIP）在所有指标上都优于 SignAvatar。这些发现强调输入表示是基于文本编码器的属性调节的关键因素，并激发了结构化调节方法，其中光泽和语音属性通过独立的路径进行编码。

Title: Harnessing the Power of Foundation Models for Accurate Material Classification

Authors: Qingran Lin, Fengwei Yang, Chaolun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17390
Pdf URL: https://arxiv.org/pdf/2603.17390
Copy Paste: [[2603.17390]] Harnessing the Power of Foundation Models for Accurate Material Classification(https://arxiv.org/abs/2603.17390)
Keywords: generation
Abstract: Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific this http URL experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.
摘要：材料分类已成为计算机视觉和图形领域的一项关键任务，支持将准确的材料属性分配给广泛的数字和现实世界应用。虽然传统上被视为图像分类任务，但由于注释数据的稀缺，该领域面临着重大挑战，限制了训练模型的准确性和通用性。视觉语言基础模型（VLM）的最新进展为解决这些问题提供了有希望的途径，但利用这些模型的现有解决方案在材料识别任务中仍然表现出不令人满意的结果。在这项工作中，我们提出了一种新颖的框架，可以有效地利用基础模型来克服数据限制并提高分类准确性。我们的方法集成了两个关键创新：（a）强大的图像生成和自动标记管道，可以使用以材料为中心的图像创建多样化且高质量的训练数据集，并通过融合文本提示中的对象语义和材料属性来自动分配标签； (b) 从 VLM 中提取信息的先验合并策略，结合联合微调方法，优化预先训练的视觉基础模型以及 VLM 派生的先验，在适应特定材料的同时保留广泛的通用性。该 http URL 实验证明了多个数据集的显着改进。我们表明，我们的合成数据集有效地捕获了现实世界材料的特征，并且视觉语言模型先验的集成显着提高了最终性能。源代码和数据集将被发布。

Title: Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Authors: Rui Hong, Shuxue Quan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17398
Pdf URL: https://arxiv.org/pdf/2603.17398
Copy Paste: [[2603.17398]] Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion(https://arxiv.org/abs/2603.17398)
Keywords: generation
Abstract: We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.
摘要：我们提出了一种基于冻结稳定扩散模型的运动自适应时间注意机制，用于参数有效的视频生成。我们的方法不是统一处理所有视频内容，而是根据估计的运动内容动态调整时间注意感受域：高运动序列在跨帧中局部参与以保留快速变化的细节，而低运动序列则全局参与以强制场景一致性。我们通过级联策略将轻量级时间注意力模块注入到所有 UNet 变换器块中——下采样和中间块中的全局注意力用于语义稳定，上采样块中的运动自适应注意力用于细粒度细化。结合时间相关噪声初始化和运动感知门控，该系统仅添加 25.8M 可训练参数（基础 UNet 的 2.9%），同时在 100K 视频上训练时在 WebVid 验证上取得有竞争力的结果。我们证明，标准去噪目标本身就提供了足够的隐式时间正则化，优于增加显式时间一致性损失的方法。我们的消融研究揭示了噪声相关性和运动幅度之间的明显权衡，为不同的生成行为提供了实用的推理时间控制。

Title: Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching

Authors: Yaozhong Shi, Grigorios Lavrentiadis, Konstantinos Tsalouchidis, Zachary E. Ross, David McCallen, Caifeng Zou, Kamyar Azizzadenesheli, Domniki Asimaki
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17403
Pdf URL: https://arxiv.org/pdf/2603.17403
Copy Paste: [[2603.17403]] Large-Scale 3D Ground-Motion Synthesis with Physics-Inspired Latent Operator Flow Matching(https://arxiv.org/abs/2603.17403)
Keywords: generative
Abstract: Earthquake hazard analysis and design of spatially distributed infrastructure, such as power grids and energy pipeline networks, require scenario-specific ground-motion time histories with realistic frequency content and spatiotemporal coherence. However, producing the large ensembles needed for uncertainty quantification with physics-based simulations is computationally intensive and impractical for engineering workflows. To address this challenge, we introduce Ground-Motion Flow (GMFlow), a physics-inspired latent operator flow matching framework that generates realistic, large-scale regional ground-motion time-histories conditioned on physical parameters. Validated on simulated earthquake scenarios in the San Francisco Bay Area, GMFlow generates spatially coherent ground motion across more than 9 million grid points in seconds, achieving a 10,000-fold speedup over the simulation workflow, which opens a path toward rapid and uncertainty-aware hazard assessment for distributed infrastructure. More broadly, GMFlow advances mesh-agnostic functional generative modeling and could potentially be extended to the synthesis of large-scale spatiotemporal physical fields in diverse scientific domains.
摘要：电网和能源管网等空间分布式基础设施的地震灾害分析和设计需要具有真实频率内容和时空相干性的特定场景地面运动时程。然而，通过基于物理的模拟生成不确定性量化所需的大型集合是计算密集型的，并且对于工程工作流程来说是不切实际的。为了应对这一挑战，我们引入了地面运动流（GMFlow），这是一种受物理启发的潜在算子流匹配框架，可根据物理参数生成逼真的大规模区域地面运动时程。 GMFlow 在旧金山湾区的模拟地震场景中进行了验证，可在几秒钟内生成跨越超过 900 万个网格点的空间相干地面运动，实现模拟工作流程的 10,000 倍加速，从而为分布式基础设施的快速和不确定性感知危险评估开辟了道路。更广泛地说，GMFlow 推进了与网格无关的功能生成建模，并有可能扩展到不同科学领域的大规模时空物理场的综合。

Title: Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Authors: Xinning Chai, Zhengxue Cheng, Xin Li, Rong Xie, Li Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17408
Pdf URL: https://arxiv.org/pdf/2603.17408
Copy Paste: [[2603.17408]] Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression(https://arxiv.org/abs/2603.17408)
Keywords: restoration, super-resolution, generative
Abstract: Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.
摘要：最近基于扩散的极端图像压缩方法在超低比特率下表现出了卓越的性能。然而，大多数方法需要为每个目标比特率训练单独的扩散模型，导致大量的计算开销并阻碍实际部署。同时，最近的研究表明联合超分辨率可以作为增强低比特率重建的有效方法。然而，当转向超低比特率时，这些方法会由于严重的信息丢失而陷入困境，并且它们对固定超分辨率尺度的依赖阻碍了跨不同比特率的灵活适应。为了解决这些限制，我们提出了 ASSR-EIC，这是一种新颖的图像压缩框架，它利用任意尺度超分辨率（ASSR）来支持可变速率极限图像压缩（EIC）。在编码器端引入任意尺度下采样模块，以提供可控的速率降低，而基于扩散的联合退化感知 ASSR 解码器可以在单个模型内实现速率自适应重建。我们在指导重建之前利用压缩和重新缩放感知扩散，在不同的压缩和重新缩放设置中产生高保真度和高真实感恢复。具体来说，我们设计了一个全局压缩重新缩放适配器，为速率适应提供整体指导，以及一个局部压缩重新缩放调制器，动态平衡生成和保真度导向的行为，以实现细粒度、比特率自适应的细节恢复。为了进一步提高重建质量，我们引入了双重语义增强设计。大量实验表明，ASSR-EIC 在极端图像压缩方面提供最先进的性能，同时支持灵活的比特率控制和自适应速率相关重建。

Title: ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Authors: Xiangyu Kong, Xiaoyu Jin, Yihan Pan, Haoqin Sun, Hengde Zhu, Xiaoming Xu, Xiaoming Wei, Lu Liu, Siyang Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17427
Pdf URL: https://arxiv.org/pdf/2603.17427
Copy Paste: [[2603.17427]] ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation(https://arxiv.org/abs/2603.17427)
Keywords: generation
Abstract: In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.
摘要：在自然的面对面互动中，参与者在说和听之间无缝交替，产生面部行为（FB），这些行为受到远程上下文的精细影响，并自然地表现出上下文适当性和情感理性。交互式头部生成（IHG）旨在模拟此类功能，合成逼真的头像头部视频。现有的 IHG 方法通常以短时间窗口内的双轨信号（即人类用户的行为和化身的预定义音频）为条件，共同驱动化身的音频对齐唇部发音和非语言 FB 的生成。然而，这些方法仍然存在两个主要挑战：（i）对短片段行为线索的依赖而没有远程上下文建模导致它们产生缺乏上下文适当性的面部行为； (ii) 双轨信号的纠缠、角色不可知的融合根据经验引入了交叉信号干扰，可能会损害说话期间唇部区域的同步。为此，我们提出了 ECHO，一种新颖的 IHG 框架，由两个关键组件组成：远程上下文理解（LCU）组件，促进对基于行为的动态和语言驱动的情感语义的上下文理解，以促进合成化身 FB 的上下文适当性和情感合理性；以及块式空间感知解耦交叉注意调制（SDCM）模块，该模块保留自音频驱动的唇部发音，同时自适应地集成非唇部面部区域的用户上下文行为线索，并辅以我们设计的两阶段训练范例，共同增强唇部同步和视觉保真度。大量实验证明了所提出组件的有效性以及 ECHO 卓越的 IHG 性能。

Title: FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Authors: Weidong Chen, Cheng Ye, Zhendong Mao, Peipei Song, Xinyan Liu, Lei Zhang, Xiaojun Chang, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17455
Pdf URL: https://arxiv.org/pdf/2603.17455
Copy Paste: [[2603.17455]] FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning(https://arxiv.org/abs/2603.17455)
Keywords: generation
Abstract: Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.
摘要：情感视频字幕（EVC）是一项新兴任务，旨在用视频中表达的内在情感来描述事实内容。现有作品感知全局情感线索，然后与视频内容结合生成描述。然而，生成过程中事实和情感线索的挖掘和协调不足，使得他们的方法难以处理事实情感偏差，即不同样本在生成时的事实和情感要求不同。为此，我们提出了一种具有FActual Calibration and Emotion Augmentation（FACE-net）的检索增强框架，通过统一的架构协同挖掘事实情感语义并为生成提供自适应和准确的指导，突破了所有样本学习中事实情感描述的妥协倾向。从技术上讲，我们首先引入外部存储库并检索与视频内容最相关的句子以增强语义信息。随后，我们通过不确定性估计模块进行事实校准，将检索到的信息拆分为主谓宾三元组，并通过视频内容自提炼和交叉提炼不同的组件，以有效挖掘事实语义；而我们的渐进式视觉情感增强模块则利用校准的事实语义作为专家，与视频内容和情感词典交互以生成视觉查询和候选情感，然后将它们聚合以自适应地将情感增强到每个事实语义。此外，为了减轻事实情感偏差，我们设计了动态偏差调整路由模块来预测和调整样本的偏差程度。

Title: AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

Authors: Dailan He, Guanlin Feng, Xingtong Ge, Yi Zhang, Bingqi Ma, Guanglu Song, Yu Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17461
Pdf URL: https://arxiv.org/pdf/2603.17461
Copy Paste: [[2603.17461]] AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization(https://arxiv.org/abs/2603.17461)
Keywords: generation
Abstract: Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
摘要：流式自回归 (AR) 视频生成器与几步蒸馏相结合，可实现低延迟、高质量的合成，但仍然难以通过人类反馈的强化学习 (RLHF) 进行对齐。现有的基于 SDE 的 GRPO 方法在这种情况下面临挑战：少步 ODE 和一致性模型采样器偏离标准流匹配 ODE，并且它们的短、低随机性轨迹对初始化噪声高度敏感，导致中间 SDE 探索无效。我们提出 AR-CoPO（自回归对比策略优化），这是一个将 Neighbor GRPO 对比视角应用于流式 AR 生成的框架。 AR-CoPO 通过分叉机制引入块级对齐，该分叉机制在随机选择的块上构建邻域候选，分配序列级奖励，并执行本地化 GRPO 更新。我们进一步提出了一种半策略训练策略，通过利用参考推出的重放缓冲区来补充策略探索，从而提高跨域的生成质量。 Self-Forcing 实验表明，AR-CoPO 在基线上改善了域外泛化和域内人类偏好对齐，提供了真正对齐的证据，而不是奖励黑客。

Title: UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Authors: Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.17476
Pdf URL: https://arxiv.org/pdf/2603.17476
Copy Paste: [[2603.17476]] UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models(https://arxiv.org/abs/2603.17476)
Keywords: generation
Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at this https URL
摘要：统一多模态模型 (UMM) 提供强大的跨模态功能，但引入了单任务模型中未观察到的新安全风险。尽管出现了，现有的安全基准仍然分散在不同的任务和模式中，限制了对复杂系统级漏洞的综合评估。为了弥补这一差距，我们推出了 UniSAFE，这是第一个针对跨 7 个 I/O 模态组合的 UMM 系统级安全评估的综合基准，涵盖传统任务和新颖的多模态上下文图像生成设置。 UniSAFE 采用共享目标设计构建，可跨特定于任务的 I/O 配置预测常见风险场景，从而实现安全故障的受控跨任务比较。我们使用 UniSAFE 来评估 15 个最先进的 UMM，包括 6,802 个精选实例，包括专有的和开源的。我们的结果揭示了当前 UMM 的关键漏洞，包括多图像合成和多轮设置中安全违规的增加，图像输出任务始终比文本输出任务更容易受到攻击。 These findings highlight the need for stronger system-level safety alignment for UMMs.我们的代码和数据可通过此 https URL 公开获取

Title: Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Authors: Jiawei Zhou, Chi Zhang, Xiang Feng, Qiming Zhang, Haibo Qiu, Lihuo He, Dengpan Ye, Xinbo Gao, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17508
Pdf URL: https://arxiv.org/pdf/2603.17508
Copy Paste: [[2603.17508]] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation(https://arxiv.org/abs/2603.17508)
Keywords: generation, generative
Abstract: We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at this https URL.
摘要：我们推出 Omni-I2C，这是一个综合基准测试，旨在评估大型多模态模型 (LMM) 将复杂的结构化数字图形转换为可执行代码的能力。我们认为，这项任务对当前一代 LMM 来说是一个不小的挑战：它需要高保真视觉感知（解析复杂的空间层次结构和符号细节）与精确的生成表达（以合成句法合理且逻辑一致的代码）之间前所未有的协同作用。与传统的描述性任务不同，Omni-I2C 需要整体理解，任何轻微的知觉幻觉或编码错误都会导致视觉重建完全失败。 Omni-I2C 具有 1080 个精心策划的样本，这些样本是根据其跨主题、图像模式和编程语言的广度来定义的。通过结合真实的用户来源案例，该基准涵盖了广泛的数字内容——从科学可视化到复杂的符号符号——每个内容都与可执行参考代码配对。为了补充这种多样性，我们的评估框架提供了必要的深度；通过将性能解耦为感知保真度和符号精度，它超越了表面精度，揭示了当前 LMM 的粒度结构故障和推理瓶颈。我们的评估显示领先的 LMM 之间存在巨大的绩效差距；即使是最先进的模型也很难在复杂的场景中保持结构完整性，这凸显出多模式代码生成仍然是一个艰巨的挑战。数据和代码可从此 https URL 获取。

Title: Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model

Authors: Luca Pellegrini
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2603.17523
Pdf URL: https://arxiv.org/pdf/2603.17523
Copy Paste: [[2603.17523]] Translation Invariance of Neural Operators for the FitzHugh-Nagumo Model(https://arxiv.org/abs/2603.17523)
Keywords: generation
Abstract: Neural Operators (NOs) are a powerful deep learning framework designed to learn the solution operator that arise from partial differential equations. This study investigates NOs ability to capture the stiff spatio-temporal dynamics of the FitzHugh-Nagumo model, which describes excitable cells. A key contribution of this work is evaluating the translation invariance using a novel training strategy. NOs are trained using an applied current with varying spatial locations and intensities at a fixed time, and the test set introduces a more challenging out-of-distribution scenario in which the applied current is translated in both time and space. This approach significantly reduces the computational cost of dataset generation. Moreover we benchmark seven NOs architectures: Convolutional Neural Operators (CNOs), Deep Operator Networks (DONs), DONs with CNN encoder (DONs-CNN), Proper Orthogonal Decomposition DONs (POD-DONs), Fourier Neural Operators (FNOs), Tucker Tensorized FNOs (TFNOs), Localized Neural Operators (LocalNOs). We evaluated these models based on training and test accuracy, efficiency, and inference speed. Our results reveal that CNOs performs well on translated test dynamics. However, they require higher training costs, though their performance on the training set is similar to that of the other considered architectures. In contrast, FNOs achieve the lowest training error, but have the highest inference time. Regarding the translated dynamics, FNOs and their variants provide less accurate predictions. Finally, DONs and their variants demonstrate high efficiency in both training and inference, however they do not generalize well to the test set. These findings highlight the current capabilities and limitations of NOs in capturing complex ionic model dynamics and provide a comprehensive benchmark including their application to scenarios involving translated dynamics.
摘要：神经算子 (NO) 是一个强大的深度学习框架，旨在学习偏微分方程产生的解算子。这项研究调查了 NO 捕捉 FitzHugh-Nagumo 模型（描述可兴奋细胞）的僵硬时空动力学的能力。这项工作的一个关键贡献是使用新颖的训练策略评估翻译不变性。 NO 使用在固定时间具有不同空间位置和强度的外加电流进行训练，并且测试集引入了更具挑战性的分布外场景，其中外加电流在时间和空间上进行转换。这种方法显着降低了数据集生成的计算成本。此外，我们对七个 NO 架构进行了基准测试：卷积神经算子 (CNOs)、深度算子网络 (DONs)、带 CNN 编码器的 DONs (DONs-CNN)、适当正交分解 DONs (POD-DONs)、傅里叶神经算子 (FNOs)、塔克张量 FNOs (TFNOs)、局部神经算子 (LocalNOs)。我们根据训练和测试的准确性、效率和推理速度来评估这些模型。我们的结果表明，CNO 在翻译测试动态方面表现良好。然而，尽管它们在训练集上的性能与其他考虑的架构相似，但它们需要更高的训练成本。相比之下，FNO 实现了最低的训练误差，但推理时间最长。关于翻译动态，FNO 及其变体提供的预测不太准确。最后，DON 及其变体在训练和推理方面表现出很高的效率，但它们不能很好地推广到测试集。这些发现强调了目前 NO 在捕获复杂离子模型动力学方面的能力和局限性，并提供了一个全面的基准，包括它们在涉及翻译动力学的场景中的应用。

Title: ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Authors: Daowen Li, Ruixiao Dong, Ying Chen, Kai Li, Ding Ding, Li Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17546
Pdf URL: https://arxiv.org/pdf/2603.17546
Copy Paste: [[2603.17546]] ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling(https://arxiv.org/abs/2603.17546)
Keywords: generative
Abstract: Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.
摘要：感知视频压缩利用生成先验以低比特率重建真实的纹理和运动。然而，现有的感知编解码器通常缺乏对可变比特率和渐进式交付的原生支持，并且它们的生成模块与熵编码的耦合较弱，限制了比特率的降低。受到视觉自回归 (VAR) 模型中下一个规模预测的启发，我们提出了 ProGVC，这是一种基于渐进式的生成视频压缩框架，它将渐进式传输、高效熵编码和细节合成统一在单个编解码器中。 ProGVC 将视频编码为分层多尺度残差标记图，通过以渐进方式传输从粗到细的尺度子集来实现灵活的速率自适应。基于 Transformer 的多尺度自回归上下文模型可估计令牌概率，既可用于传输令牌的高效熵编码，又可用于预测解码器处截断的精细尺度令牌以恢复感知细节。大量实验表明，作为一种新的编码范例，ProGVC 在低比特率下提供了有前景的感知压缩性能，同时提供了实用的可扩展性。

Title: FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Authors: Hugo Caselles-Dupré (1), Mathis Koroglu (1 and 2), Guillaume Jeanneret (2), Arnaud Dapogny (2), Matthieu Cord (2) ((1) Obvious Research, Paris, France, (2) Institute of Intelligent Systems and Robotics - Sorbonne University, Paris, France)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17555
Pdf URL: https://arxiv.org/pdf/2603.17555
Copy Paste: [[2603.17555]] FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion(https://arxiv.org/abs/2603.17555)
Keywords: generation
Abstract: Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
摘要：基于扩散的图像到视频 (I2V) 模型越来越有效，但它们难以扩展到超高分辨率输入（例如 4K）。以模型的原始分辨率生成视频通常会丢失细粒度结构，而高分辨率平铺去噪可以保留局部细节，但会破坏全局布局一致性。这种失败模式在壁画动画设置中尤其严重：包含许多不同角色、物体和语义上不同的子场景的纪念性艺术品必须随着时间的推移保持空间连贯性。我们引入了 FrescoDiffusion，这是一种无需训练的方法，可从单个复杂图像生成连贯的大幅面 I2V。关键思想是通过预先计算的潜在先验来增强平铺去噪：我们首先以底层模型分辨率生成低分辨率视频，并对其潜在轨迹进行上采样以获得捕获长范围时间和空间结构的全局参考。对于 4K 生成，我们计算每个图块的噪声预测，并通过最小化模型输出空间中的单个加权最小二乘目标，在每个扩散时间步将它们与此参考融合。该目标将标准图块合并标准与我们的正则化项相结合，产生封闭形式的融合更新，增强全局一致性，同时保留精细细节。我们还提供了一个空间正则化变量，可以对允许运动的区域进行区域级控制。 VBench-I2V 数据集和我们提出的 fresco I2V 数据集上的实验表明，与平铺基线相比，全局一致性和保真度有所提高，同时计算效率较高。我们的正则化使得创造力和一致性之间的权衡具有明确的可控性。

Title: Face anonymization preserving facial expressions and photometric realism

Authors: Luigi Celona, Simone Bianco, Raimondo Schettini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17567
Pdf URL: https://arxiv.org/pdf/2603.17567
Copy Paste: [[2603.17567]] Face anonymization preserving facial expressions and photometric realism(https://arxiv.org/abs/2603.17567)
Keywords: generative
Abstract: The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject's identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency -- specifically attributes such as illumination and skin tone -- that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.
摘要：社交媒体平台和大规模数据集中人脸图像的广泛共享引发了紧迫的隐私问题，因为生物识别标识符可能会在未经同意的情况下被利用。面部匿名化旨在生成真实的面部图像，不可逆转地隐藏主体的身份，同时保留其对下游任务的有用性。然而，大多数现有的生成方法都专注于身份去除和图像真实感，常常忽略面部表情和光度一致性——特别是照明和肤色等属性——这对于重新照明、颜色恒定性以及医学或情感分析等应用至关重要。在这项工作中，我们提出了一种保留特征的匿名化框架，该框架通过合并密集的面部标志来更好地保留表情，并引入轻量级后处理模块来确保照明方向和肤色的一致性，从而扩展 DeepPrivacy。我们进一步建立了专门用于量化表达保真度、照明一致性和色彩保存的评估指标，补充了图像真实感、姿势准确性和重新识别阻力的标准衡量标准。 CelebA-HQ 数据集上的实验表明，与最先进的基线相比，我们的方法生成的匿名面孔具有更高的真实感，并且在表情、照明和肤色方面的保真度显着提高。这些结果强调了特征感知匿名化的重要性，它是迈向更有用、公平和值得信赖的隐私保护面部数据的一步。

Title: FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models

Authors: Simon Klüttermann, Tim Katzke, Phuong Huong Nguyen, Emmanuel Müller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17570
Pdf URL: https://arxiv.org/pdf/2603.17570
Copy Paste: [[2603.17570]] FoMo X: Modular Explainability Signals for Outlier Detection Foundation Models(https://arxiv.org/abs/2603.17570)
Keywords: generative
Abstract: Tabular foundation models, specifically Prior-Data Fitted Networks (PFNs), have revolutionized outlier detection (OD) by enabling unsupervised zero-shot adaptation to new datasets without training. However, despite their predictive power, these models typically function as opaque black boxes, outputting scalar outlier scores that lack the operational context required for safety-critical decision-making. Existing post-hoc explanation methods are often computationally prohibitive for real-time deployment or fail to capture the epistemic uncertainty inherent in zero-shot inference. In this work, we introduce FoMo-X, a modular framework that equips OD foundation models with intrinsic, lightweight diagnostic capabilities. We leverage the insight that the frozen embeddings of a pretrained PFN backbone already encode rich, context-conditioned relational information. FoMo-X attaches auxiliary diagnostic heads to these embeddings, trained offline using the same generative simulator prior as the backbone. This allows us to distill computationally expensive properties, such as Monte Carlo dropout based epistemic uncertainty, into a deterministic, single-pass inference. We instantiate FoMo-X with two novel heads: a Severity Head that discretizes deviations into interpretable risk tiers, and an Uncertainty Head that provides calibrated confidence measures. Extensive evaluation on synthetic and real-world benchmarks (ADBench) demonstrates that FoMo-X recovers ground-truth diagnostic signals with high fidelity and negligible inference overhead. By bridging the gap between foundation model performance and operational explainability, FoMo-X offers a scalable path toward trustworthy, zero-shot outlier detection.
摘要：表格基础模型，特别是先验数据拟合网络 (PFN)，通过无需训练即可实现对新数据集的无监督零样本适应，彻底改变了异常值检测 (OD)。然而，尽管它们具有预测能力，但这些模型通常充当不透明的黑匣子，输出标量异常值分数，缺乏安全关键决策所需的操作上下文。现有的事后解释方法通常在计算上无法实现实时部署，或者无法捕获零样本推理中固有的认知不确定性。在这项工作中，我们介绍了 FoMo-X，这是一个模块化框架，为 OD 基础模型配备了内在的轻量级诊断功能。我们利用预训练 PFN 主干的冻结嵌入已经编码了丰富的、上下文条件关系信息的见解。 FoMo-X 将辅助诊断头附加到这些嵌入物上，并使用与骨干网相同的生成模拟器进行离线训练。这使我们能够将计算成本高昂的属性（例如基于蒙特卡罗丢失的认知不确定性）提炼为确定性的单通道推理。我们用两个新颖的头来实例化 FoMo-X：一个将偏差离散化为可解释风险等级的严重性头，以及一个提供校准置信度度量的不确定性头。对合成和现实世界基准 (ADBench) 的广泛评估表明，FoMo-X 能够以高保真度和可忽略的推理开销恢复地面实况诊断信号。通过弥合基础模型性能和操作可解释性之间的差距，FoMo-X 提供了一条可扩展的路径，实现值得信赖的零样本异常值检测。

Title: Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

Authors: Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17583
Pdf URL: https://arxiv.org/pdf/2603.17583
Copy Paste: [[2603.17583]] Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing(https://arxiv.org/abs/2603.17583)
Keywords: generation, generative
Abstract: Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
摘要：从自然语言编辑 3D 室内场景在概念上很简单，但在技术上具有挑战性。现有的开放词汇系统通常会重新生成场景的大部分，或者依赖于破坏空间结构的图像空间编辑，从而导致意外的全局变化或物理上不一致的布局。这些限制源于将编辑主要视为一项生成任务。我们持不同的观点。用户指令定义了所需的世界状态，而编辑应该是使该状态成立同时保留其他所有内容的最小操作序列。这种观点催生了 Edit-As-Act，这是一个在 3D 空间中执行开放词汇场景编辑作为目标回归规划的框架。给定源场景和自由格式指令，Edit-As-Act 可以在 EditLang 中预测符号目标谓词和计划，EditLang 是一种受 PDDL 启发的动作语言，我们使用明确的前提条件和效果来编码支持、接触、碰撞和其他几何关系。语言驱动的规划器提出行动，验证器强制目标导向性、单调性和物理可行性，产生可解释的和物理连贯的转换。通过将推理与低级生成分离，“编辑即行为”实现了指令保真度、语义一致性和物理合理性——现有范式无法同时满足这三个标准。在 E2A-Bench（我们跨 9 个室内环境的 63 个编辑任务的基准）上，Edit-As-Act 在所有编辑类型和场景类别中都显着优于先前的方法。

Title: ReLaGS: Relational Language Gaussian Splatting

Authors: Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, Didier Stricker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17605
Pdf URL: https://arxiv.org/pdf/2603.17605
Copy Paste: [[2603.17605]] ReLaGS: Relational Language Gaussian Splatting(https://arxiv.org/abs/2603.17605)
Keywords: generation
Abstract: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: this https URL
摘要：在分割、检索和关系理解等任务中实现统一的 3D 感知和推理仍然具有挑战性，因为现有方法要么以对象为中心，要么依赖于昂贵的对象间推理训练。我们提出了一种新颖的框架，无需特定于场景的训练即可构建分层语言蒸馏的高斯场景及其 3D 语义场景图。高斯修剪机制细化了场景几何形状，而强大的多视图语言对齐策略将嘈杂的 2D 特征聚合成精确的 3D 对象嵌入。在此层次结构之上，我们使用视觉语言派生注释和基于图神经网络的关系推理构建开放词汇 3D 场景图。我们的方法通过联合建模分层语义和对象间/对象内关系，实现高效且可扩展的开放词汇 3D 推理，并在开放词汇分割、场景图生成和关系引导检索等任务中进行验证。项目页面：此 https URL

Title: Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies

Authors: Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum, Henni Ouerdane, Stefan Streif, Pavel Osinenko
Subjects: cs.LG, cs.AI, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2603.17631
Pdf URL: https://arxiv.org/pdf/2603.17631
Copy Paste: [[2603.17631]] Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies(https://arxiv.org/abs/2603.17631)
Keywords: generation
Abstract: The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework's capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
摘要：强化学习 (RL) 算法的客观比较非常复杂，因为不同 RL 方法的结果和性能基准对环境设计、奖励结构以及算法学习和环境动态中固有的随机性非常敏感。为了管理这种复杂性，我们引入了严格的基准测试框架，将逆最优性扩展到离散时间、控制仿射、带有噪声的非线性系统。我们的框架提供了必要和充分的条件，在这些条件下，规定的价值函数和策略对于构建的系统来说是最优的，从而能够通过同伦变化和随机参数系统地生成基准族。我们通过自动构建不同的环境来验证它，展示我们的框架跨算法进行受控和全面评估的能力。通过根据真实最优值评估标准方法，我们的工作为精确和严格的 RL 基准测试提供了可重复的基础。

Title: DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

Authors: Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2603.17637
Pdf URL: https://arxiv.org/pdf/2603.17637
Copy Paste: [[2603.17637]] DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis(https://arxiv.org/abs/2603.17637)
Keywords: generative
Abstract: We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.
摘要：We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis.核心贡献是定向潜在路由 (DLR)，这是一种新颖的调节机制，可将潜在向量分解为特定方向的子向量，每个子向量与类嵌入联合投影，以产生相应 Mamba 扫描的特征级仿射调制。与注入全局信号的传统类条件作用不同，DLR 沿着特征图的不同空间轴耦合类身份和潜在结构，并在所有生成尺度上一致应用。 DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets.对潜在空间的分析表明，方向子向量表现出可测量的专业化：沿各个分量的扰动会在合成图像中产生结构化的、方向相关的变化。

Title: Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Authors: Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17651
Pdf URL: https://arxiv.org/pdf/2603.17651
Copy Paste: [[2603.17651]] Anchoring and Rescaling Attention for Semantically Coherent Inbetweening(https://arxiv.org/abs/2603.17651)
Keywords: generative
Abstract: Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
摘要：生成中间帧 (GI) 旨在合成第一个和最后一个关键帧之间的真实中间帧，而不仅仅是插值。随着序列变得越来越稀疏和运动越来越大，以前的 GI 模型一直在努力应对不一致的帧、不稳定的节奏和语义错位。由于 GI 涉及固定端点和许多看似合理的路径，因此此任务需要从关键帧和文本获得额外的指导来指定预期路径。因此，我们通过关键帧锚定的注意力偏差从关键帧和文本到每个中间帧提供语义和时间指导。我们还通过 Rescaled Temporal RoPE 更好地加强帧一致性，这使得 self-attention 能够更忠实地关注关键帧。 TGI-Bench 是第一个专门为文本条件 GI 评估而设计的基准，可以通过挑战目标评估来分析 GI 模型。无需额外的训练，我们的方法就可以在不同的挑战中实现最先进的帧一致性、语义保真度和长序列的稳定性。

Title: Few-Step Diffusion Sampling Through Instance-Aware Discretizations

Authors: Liangyu Yuan, Ruoyu Wang, Tong Zhao, Dingwen Fu, Mingkun Lei, Beier Zhu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17671
Pdf URL: https://arxiv.org/pdf/2603.17671
Copy Paste: [[2603.17671]] Few-Step Diffusion Sampling Through Instance-Aware Discretizations(https://arxiv.org/abs/2603.17671)
Keywords: generation, generative
Abstract: Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.
摘要：扩散和流动匹配模型通过模拟常微分方程或随机微分方程 (ODE/SDE) 定义的路径，从易于处理的先验分布开始，生成高保真数据。概率流 ODE 公式可以使用高级数值求解器来加速采样。正交但对于求解器设计至关重要的是离散化策略。虽然早期的方法采用手工启发式方法，而最近的方法采用基于优化的技术，但大多数现有策略在所有样本上强制执行全局共享的时间步计划。这种统一的处理无法考虑生成过程中特定于实例的复杂性，可能会限制性能。受合成数据受控实验的启发，该实验揭示了特定于实例的动态下全局调度的次优性，我们提出了一个实例感知的离散化框架。我们的方法学习根据输入相关的先验来调整时间步分配，将基于梯度的离散化搜索扩展到条件生成设置。跨不同设置（包括合成数据、像素空间扩散、潜在空间图像和视频流匹配模型）的经验结果表明，与训练和可忽略的推理开销相比，我们的方法以边际调整成本持续提高生成质量。

Title: Flow Matching Policy with Entropy Regularization

Authors: Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17685
Pdf URL: https://arxiv.org/pdf/2603.17685
Copy Paste: [[2603.17685]] Flow Matching Policy with Entropy Regularization(https://arxiv.org/abs/2603.17685)
Keywords: generative
Abstract: Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
摘要：基于扩散的策略由于能够表示复杂的非高斯分布，因此在强化学习 (RL) 中广受欢迎。由于精确熵的难以处理，基于随机微分方程（SDE）的扩散策略通常依赖于间接熵控制，同时还受到通过迭代去噪链的计算上令人望而却步的策略梯度的影响。为了克服这些问题，我们提出了带有熵正则化的流匹配策略（FMER），这是一种基于常微分方程（ODE）的在线强化学习框架。 FMER 通过流匹配对策略进行参数化，并在最佳传输的推动下沿直线概率路径对动作进行采样。 FMER 利用模型的生成特性从候选集中构建优势加权目标速度场，将策略更新引导至高价值区域。通过导出易于处理的熵目标，FMER 能够实现有原则的最大熵优化，以增强探索。对稀疏多目标 FrankaKitchen 基准的实验表明，FMER 优于最先进的方法，同时在标准 MuJoco 基准上保持竞争力。此外，与重扩散基线 (QVPO) 相比，FMER 将训练时间缩短了 7 倍，与高效变体相比，训练时间缩短了 10-15%。

Title: Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Authors: Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17693
Pdf URL: https://arxiv.org/pdf/2603.17693
Copy Paste: [[2603.17693]] Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos(https://arxiv.org/abs/2603.17693)
Keywords: generation
Abstract: The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
摘要：从图像理解到视频理解的转变需要视觉语言模型（VLM）从识别静态模式转变为对时间动态（例如运动轨迹、速度变化和状态转换）进行推理。然而，当前的后训练方法由于两个关键限制而存在不足：（1）现有数据集通常缺乏时间中心性，可以从孤立的关键帧中推断出答案，而不需要整体时间集成；（2）专有模型生成的训练数据包含基本时间感知的系统错误，例如混淆运动方向或误判速度。我们介绍 SynRL，这是一个训练后框架，用于教授模型时间原语，即时间理解的基本构建块，包括方向、速度和状态跟踪。我们的主要见解是，这些从以编程方式生成的合成视频中学习的抽象原语可以有效地转移到现实世界的场景中。我们将时间理解分解为短期感知基元（速度、方向）和长期认知基元，通过基于代码的视频生成构建具有真实帧级注释的 7.7K CoT 和 7K RL 样本。尽管使用简单的几何形状进行训练，SynRL 在涵盖时间基础、复杂推理和一般视频理解的 15 个基准测试中取得了实质性改进。值得注意的是，我们的 7.7K 合成 CoT 样本的性能优于具有 165K 真实世界样本的 Video-R1。我们将此归因于基本的时间技能，例如逐帧跟踪变化和比较速度，这些技能可以有效地从抽象的合成模式转移到复杂的现实世界场景。这为视频后训练建立了一个新的范例：通过精心设计的合成数据进行视频时间学习提供了更具成本效益的扩展路径。

Title: DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

Authors: Yuhe Tian, Kun Zhang, Haoran Ma, Rui Yan, Yingtai Li, Rongsheng Wang, Shaohua Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17718
Pdf URL: https://arxiv.org/pdf/2603.17718
Copy Paste: [[2603.17718]] DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation(https://arxiv.org/abs/2603.17718)
Keywords: generation
Abstract: While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at this https URL.
摘要：虽然大型语言模型 (LLM) 具有先进的 CT 报告生成功能，但现有方法通常对 3D 体积进行整体编码，无法区分信息线索和冗余解剖背景。受放射认知减法的启发，我们提出了差异视觉提示（DiffVP），它根据明确的、高级语义扫描到参考差异而不是仅仅根据绝对视觉特征来生成报告。 DiffVP 采用分层差异提取器将互补的全局和局部语义差异捕获到共享潜在空间中，并使用差异提示生成器将这些信号转换为可学习的视觉前缀标记以进行 LLM 调节。这些差异提示充当结构化条件信号，隐式抑制不变的解剖结构，同时放大诊断相关的视觉证据，从而促进准确的报告生成，而无需明确的病变定位。在两个大规模基准测试中，DiffVP 始终优于之前的方法，将平均 BLEU-1-4 分别提高了 +10.98 和 +4.36，并进一步提高了 RadGenome-ChestCT 的临床疗效（F1 得分 0.421）。所有代码都将在此 https URL 发布。

Title: TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

Authors: Yan Zeng, Haoran Jiang, Kaixin Yao, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17735
Pdf URL: https://arxiv.org/pdf/2603.17735
Copy Paste: [[2603.17735]] TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos(https://arxiv.org/abs/2603.17735)
Keywords: generation
Abstract: Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.
摘要：自动为无纹理 3D 模型生成逼真且一致的外观是数字内容创建中的一项关键挑战。大规模视频生成模型的进步提供了一种自然的方法：直接合成360度转盘视频（TTV），它不仅可以用作高质量的动态预览，还可以作为驱动纹理合成和神经渲染的中间表示。然而，现有的通用视频扩散模型很难在整个视图范围内保持严格的几何一致性和外观稳定性，从而使其输出不适合高质量的 3D 重建。为此，我们引入了 TAPESTRY，一个用于生成以显式 3D 几何为条件的高保真 TTV 的框架。我们将 3D 外观生成任务重新定义为几何条件视频扩散问题：给定 3D 网格，我们首先渲染和编码多模态几何特征，以像素级精度约束视频生成过程，从而能够创建高质量且一致的 TTV。在此基础上，我们还设计了一种从 TTV 输入进行下游重建任务的方法，具有具有 3D 感知修复功能的多级管道。通过旋转模型并执行上下文感知的二次生成，该管道有效地完成了自遮挡区域，以实现完整的表面覆盖。 TAPESTRY 生成的视频不仅是高质量的动态预览，而且还可以作为可靠的 3D 感知中间表示，可以无缝反投影到 UV 纹理或用于监督 3DGS 等神经渲染方法。这使得能够从无纹理的网格自动创建可用于生产的完整 3D 资源。实验结果表明，我们的方法在视频一致性和最终重建质量方面均优于现有方法。

Title: Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems

Authors: Qi Liu, Laure Zanna, Joan Bruna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17750
Pdf URL: https://arxiv.org/pdf/2603.17750
Copy Paste: [[2603.17750]] Towards Infinitely Long Neural Simulations: Self-Refining Neural Surrogate Models for Dynamical Systems(https://arxiv.org/abs/2603.17750)
Keywords: generation
Abstract: Recent advances in autoregressive neural surrogate models have enabled orders-of-magnitude speedups in simulating dynamical systems. However, autoregressive models are generally prone to distribution drift: compounding errors in autoregressive rollouts that severely degrade generation quality over long time horizons. Existing work attempts to address this issue by implicitly leveraging the inherent trade-off between short-time accuracy and long-time consistency through hyperparameter tuning. In this work, we introduce a unifying mathematical framework that makes this tradeoff explicit, formalizing and generalizing hyperparameter-based strategies in existing approaches. Within this framework, we propose a robust, hyperparameter-free model implemented as a conditional diffusion model that balances short-time fidelity with long-time consistency by construction. Our model, Self-refining Neural Surrogate model (SNS), can be implemented as a standalone model that refines its own autoregressive outputs or as a complementary model to existing neural surrogates to ensure long-time consistency. We also demonstrate the numerical feasibility of SNS through high-fidelity simulations of complex dynamical systems over arbitrarily long time horizons.
摘要：

Title: ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Authors: Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.17812
Pdf URL: https://arxiv.org/pdf/2603.17812
Copy Paste: [[2603.17812]] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation(https://arxiv.org/abs/2603.17812)
Keywords: super-resolution, generation
Abstract: Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
摘要：最近的视频扩散模型通过循环帧处理实现高质量生成，其中每个帧的生成都取决于先前的帧。然而，这种循环机制意味着在像素域中训练此类模型会产生过高的内存成本，因为激活会在整个视频序列中累积。对于长视频或高分辨率视频来说，这一基本限制还使得通过像素损失来微调这些模型在计算上变得困难。本文介绍了 ChopGrad，一种用于视频解码的截断反向传播方案，将梯度计算限制在局部帧窗口，同时保持全局一致性。我们提供了这种近似的理论分析，并表明它可以通过逐帧损失进行有效的微调。 ChopGrad 将训练内存从随视频帧数量线性缩放（完全反向传播）减少到恒定内存，并且在一系列具有像素损失的条件视频生成任务中与现有最先进的视频扩散模型相比，包括视频超分辨率、视频修复、神经渲染场景的视频增强和受控驾驶视频生成。

Title: Symmetry-Reduced Physics-Informed Learning of Tensegrity Dynamics

Authors: Jing Qin, Muhao Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17824
Pdf URL: https://arxiv.org/pdf/2603.17824
Copy Paste: [[2603.17824]] Symmetry-Reduced Physics-Informed Learning of Tensegrity Dynamics(https://arxiv.org/abs/2603.17824)
Keywords: generation
Abstract: Tensegrity structures possess intrinsic geometric symmetries that govern their dynamic behavior. However, most existing physics-informed neural network (PINN) approaches for tensegrity dynamics do not explicitly exploit these symmetries, leading to high computational complexity and unstable optimization. In this work, we propose a symmetry-reduced physics-informed neural network (SymPINN) framework that embeds group-theory-based symmetry directly into both the solution expression and the neural network architecture to predict tensegrity dynamics. By decomposing nodes into symmetry orbits and representing free nodal coordinates using a symmetry basis, the proposed method constructs a reduced coordinate representation that preserves geometric symmetry of the structure. The full coordinates are then recovered via symmetry transformations of the reduced solution learned by the network, ensuring that the predicted configurations automatically satisfy the symmetry constraints. In this framework, equivariance is enforced through orbit-based coordinate generation, symmetry-consistent message passing, and physics residual constraints. In addition, SymPINN improves training effectiveness by encoding initial conditions as hard constraints, incorporating Fourier feature encoding to enhance the representation of dynamic motions, and employing a two-stage optimization strategy. Extensive numerical experiments on symmetric T-bars and lander structures demonstrate significantly improved prediction accuracy and computational efficiency compared to standard physics-informed models, indicating the great potential of symmetry-aware learning for structure-preserving modeling of tensegrity dynamics.
摘要：张拉整体结构具有控制其动态行为的内在几何对称性。然而，大多数现有的用于张拉整体动力学的物理信息神经网络（PINN）方法没有明确利用这些对称性，导致计算复杂性高和优化不稳定。在这项工作中，我们提出了一种对称性降低的物理信息神经网络（SymPINN）框架，它将基于群论的对称性直接嵌入到解表达式和神经网络架构中，以预测张拉整体动力学。通过将节点分解为对称轨道并使用对称基表示自由节点坐标，所提出的方法构造了保留结构几何对称性的简化坐标表示。然后通过网络学习的简化解的对称变换来恢复完整的坐标，确保预测的配置自动满足对称约束。在此框架中，通过基于轨道的坐标生成、对称一致的消息传递和物理残差约束来强制执行等变性。此外，SymPINN 通过将初始条件编码为硬约束、结合傅里叶特征编码来增强动态运动的表示以及采用两阶段优化策略来提高训练效果。对对称 T 形杆和着陆器结构的大量数值实验表明，与标准物理模型相比，预测精度和计算效率显着提高，表明对称感知学习在张拉整体动力学结构保持建模方面具有巨大潜力。

Title: Steering Video Diffusion Transformers with Massive Activations

Authors: Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17825
Pdf URL: https://arxiv.org/pdf/2603.17825
Copy Paste: [[2603.17825]] Steering Video Diffusion Transformers with Massive Activations(https://arxiv.org/abs/2603.17825)
Keywords: generation
Abstract: Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
摘要：尽管视频扩散变压器取得了快速进展，但如何以最小的开销利用其内部模型信号来提高视频生成质量仍然尚未得到充分探索。在这项工作中，我们研究了大规模激活（MA）的作用，这是视频扩散变压器中罕见的高幅度隐藏状态尖峰。我们观察到，MA 在所有视觉标记中一致出现，具有清晰的幅度层次结构：第一帧标记表现出最大的 MA 幅度，潜在帧边界标记（潜在空间中每个时间块的头部和尾部）显示出比第一帧升高但略低的 MA 幅度，每个潜在帧内的内部标记保持升高，但幅度相对适中。这种结构化模式表明，该模型隐式优先考虑与潜在空间中的时间分块对齐的标记位置。基于这一观察，我们提出了结构化激活转向（STAS），这是一种免训练的类似自引导的方法，可将第一帧和边界标记的 MA 值引导至缩放的全局最大参考幅度。 STAS 在不同的文本到视频模型中实现了视频质量和时间一致性方面的持续改进，同时引入的计算开销可以忽略不计。

Title: TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Authors: Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17828
Pdf URL: https://arxiv.org/pdf/2603.17828
Copy Paste: [[2603.17828]] TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models(https://arxiv.org/abs/2603.17828)
Keywords: generative
Abstract: Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.
摘要：尽管文本到图像的扩散模型表现出非凡的生成能力，但概念擦除技术对于其安全部署以防止有害内容的创建至关重要。这促进了擦除防御的发展与旨在绕过它们的对抗性探针之间的动态相互作用，并且这种共同进化逐渐增强了擦除方法的功效。然而，这种对抗性共同进化已经集中在一种狭隘的、以文本为中心的范式上，该范式将擦除等同于切断文本到图像的映射，而忽略了与不需要的概念相关的底层视觉知识仍然存在。为了证实这一说法，我们从视觉角度进行研究，利用 DDIM 反演来探究是否仍然可以找到被擦除概念的生成路径。然而，识别这样的视觉生成路径具有挑战性，因为标准文本引导的 DDIM 反转受到擦除模型中以文本为中心的防御的积极抵制。为了解决这个问题，我们引入了 TINA，一种新颖的无文本反转攻击，它通过在空文本条件下操作来强制执行这种仅视觉探测，从而避免现有的以文本为中心的防御。此外，TINA 集成了一个优化程序，以克服标准反演在没有通常文本指导的情况下运行时出现的累积近似误差。我们的实验表明，TINA 可以从经过最先进的遗忘处理的模型中重新生成被删除的概念。 TINA 的成功证明，当前的方法只是模糊了概念，凸显了对直接作用于内部视觉知识的范式的迫切需要。

Title: Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Authors: Chen Liyi, Wang Pengfei, Zhang Guowen, Ma Zhiyuan, Zhang Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17841
Pdf URL: https://arxiv.org/pdf/2603.17841
Copy Paste: [[2603.17841]] Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass(https://arxiv.org/abs/2603.17841)
Keywords: generative
Abstract: Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.
摘要：大多数指令驱动的 3D 编辑方法依赖 2D 模型来指导 3D 表示的显式和迭代优化。然而，这种范例有两个主要缺点。首先，它缺乏不同 3D 编辑任务的通用设计，因为 3D 几何体的显式操作需要依赖于任务的规则，例如，3D 外观编辑需要固有的源 3D 几何体，而 3D 删除会改变源几何体。其次，迭代优化过程非常耗时，通常需要数千次 2D/3D 更新调用。我们提出了 Omni-3DEdit，这是一个基于学习的统一模型，可以隐式概括各种 3D 编辑任务。实现我们目标的一个关键挑战是缺乏用于训练的配对源编辑多视图资产。为了解决这个问题，我们构建了一个数据管道，合成了相对丰富的高质量配对多视图编辑样本。随后，我们通过将源视图潜在变量与序列空间中的条件标记连接起来，采用预先训练的生成模型 SEVA 作为我们的主干。提出了双流 LoRA 模块来解开不同的视图线索，很大程度上增强了我们模型的表征学习能力。作为一种基于学习的模型，我们的模型无需耗时的在线优化，并且可以在一次前向传递中完成各种3D编辑任务，将推理时间从数十分钟减少到大约两分钟。大量实验证明了 Omni-3DEdit 的有效性和效率。

Title: Revisiting foundation models for cell instance segmentation

Authors: Anwai Archit, Constantin Pape
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17845
Pdf URL: https://arxiv.org/pdf/2603.17845
Copy Paste: [[2603.17845]] Revisiting foundation models for cell instance segmentation(https://arxiv.org/abs/2603.17845)
Keywords: generation
Abstract: Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, $\mu$SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for $\mu$SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at this https URL.
摘要：细胞分割是显微镜图像分析的一项基本任务。已经引入了几种细胞分割的基础模型，几乎所有模型都是分段任意模型（SAM）的扩展，改进了它的显微镜数据。最近，SAM2和SAM3已经发布，进一步改进和扩展了通用分割基础模型的能力。在这里，我们在一组不同的（光学）显微镜数据集上全面评估细胞分割的基础模型（CellPoseSAM、CellSAM、$\mu$SAM）和通用分割（SAM、SAM2、SAM3），用于包括细胞、细胞核和类器官分割等任务。此外，我们引入了一种称为自动提示生成（APG）的新实例分割策略，可用于进一步改进基于 SAM 的显微镜基础模型。 APG 持续改进了用作基础模型的 $\mu$SAM 的分割结果，并且与最先进的模型 CellPoseSAM 具有竞争力。此外，我们的工作为 SAM 型模型适应显微镜的策略提供了重要的经验教训，并为创建更强大的显微镜基础模型提供了策略。我们的代码可通过此 https URL 公开获取。

Title: Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation

Authors: William Thorossian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.17855
Pdf URL: https://arxiv.org/pdf/2603.17855
Copy Paste: [[2603.17855]] Physics-Aware Machine Learning for Seismic and Volcanic Signal Interpretation(https://arxiv.org/abs/2603.17855)
Keywords: generative
Abstract: Modern seismic and volcanic monitoring is increasingly shaped by continuous, multi-sensor observations and by the need to extract actionable information from nonstationary, noisy wavefields. In this context, machine learning has moved from a research curiosity to a practical ingredient of processing chains for detection, phase picking, classification, denoising, and anomaly tracking. However, improved accuracy on a fixed dataset is not sufficient for operational use. Models must remain reliable under domain shift (new stations, changing noise, evolving volcanic activity), provide uncertainty that supports decision-making, and connect their outputs to physically meaningful constraints. This paper surveys and organizes recent ML approaches for seismic and volcanic signal analysis, highlighting where classical signal processing provides indispensable inductive bias, how self-supervision and generative modeling can reduce dependence on labels, and which evaluation protocols best reflect transfer across regions. We conclude with open challenges for robust, interpretable, and maintainable AI-assisted monitoring.
摘要：现代地震和火山监测越来越受到连续、多传感器观测的影响，并且需要从非平稳、嘈杂的波场中提取可操作的信息。在这种背景下，机器学习已经从一种研究兴趣转变为检测、相位选取、分类、去噪和异常跟踪处理链的实用组成部分。然而，提高固定数据集的准确性不足以满足操作使用。模型必须在领域转移（新站、不断变化的噪音、不断变化的火山活动）下保持可靠，提供支持决策的不确定性，并将其输出与物理上有意义的约束联系起来。本文调查并整理了最新的用于地震和火山信号分析的机器学习方法，重点介绍了经典信号处理在何处提供了不可或缺的归纳偏差、自我监督和生成建模如何减少对标签的依赖，以及哪些评估协议最能反映跨区域的传输。最后，我们提出了稳健、可解释和可维护的人工智能辅助监控的开放挑战。

Title: Procedural Generation of Algorithm Discovery Tasks in Machine Learning

Authors: Alexander D. Goldie, Zilin Wang, Adrian Hayler, Deepak Nathani, Edan Toledo, Ken Thampiratwong, Aleksandra Kalisz, Michael Beukman, Alistair Letcher, Shashank Reddy, Clarisse Wibault, Theo Wolf, Charles O'Neill, Uljad Berdica, Nicholas Roberts, Saeed Rahmani, Hannah Erlebach, Roberta Raileanu, Shimon Whiteson, Jakob N. Foerster
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17863
Pdf URL: https://arxiv.org/pdf/2603.17863
Copy Paste: [[2603.17863]] Procedural Generation of Algorithm Discovery Tasks in Machine Learning(https://arxiv.org/abs/2603.17863)
Keywords: generation
Abstract: Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at this https URL.
摘要：机器学习算法的自动化开发有可能带来新的突破。然而，迄今为止，我们改进和评估算法发现系统的能力一直受到现有任务套件的限制。他们面临许多问题，例如：评估方法不佳；数据污染；并包含饱和或非常相似的问题。在这里，我们介绍 DiscoGen，它是机器学习算法发现任务的程序生成器，例如开发用于强化学习的优化器或用于图像分类的损失函数。受强化学习中程序生成成功的推动，DiscoGen 涵盖了来自一系列机器学习领域的数百万个不同难度和复杂性的任务。这些任务由少量配置参数指定，可用于优化算法发现代理 (ADA)。我们提出了 DiscoBench，这是一个由 DiscoGen 任务的固定小子集组成的基准，用于对 ADA 进行原则性评估。最后，除了展示其用于快速优化 ADA 的实验之外，我们还提出了由 DiscoGen 支持的许多雄心勃勃、有影响力的研究方向。 DiscoGen 在此 https URL 开源发布。

Title: Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Authors: Podakanti Satyajith Chary, Nagarajan Ganapathy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.17879
Pdf URL: https://arxiv.org/pdf/2603.17879
Copy Paste: [[2603.17879]] Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification(https://arxiv.org/abs/2603.17879)
Keywords: generation
Abstract: This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.
摘要：这项工作提出了视频胶囊内窥镜 (VCE) 的多标签分类框架，通过架构和优化级别策略的组合解决了 Galar 数据集中固有的极端类别不平衡问题。我们的方法修改了 BiomedCLIP（一种生物医学视觉语言基础模型），将其标准多头自注意力替换为差分注意力机制，该机制计算两个 softmax 注意力图之间的差异以抑制注意力噪声。为了抵消偏斜的标签分布（其中病理结果占所有注释帧的比例不到 0.1%），采用了 sqrt 频率加权采样器、不对称焦点损失、混合正则化和每类阈值优化。在生成事件级 JSON 之前，通过中值滤波器平滑和间隙合并来强制执行时间一致性。在包含三个 NaviCam 检查（161,025 帧）的 RARE-VISION 测试集上，该管道实现了 0.2456 的整体时间 mAP@0.5 和 0.2353 的 mAP@0.95，在单个 GPU 上大约需要 8.6 分钟完成总推理。

Title: Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Authors: Yingjie Chen, Shilun Lin, Cai Xing, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, Jing Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17889
Pdf URL: https://arxiv.org/pdf/2603.17889
Copy Paste: [[2603.17889]] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation(https://arxiv.org/abs/2603.17889)
Keywords: generation
Abstract: Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{this https URL}{Identity-as-Presence}.
摘要：最近的进展展示了将真实个体合成到生成视频方面的引人注目的能力，反映了对身份识别内容创建日益增长的需求。然而，仍然无法提供一个可公开访问的框架，能够对多个身份的面部外观和语音音色进行细粒度控制。在这项工作中，我们提出了一个统一且可扩展的框架，用于身份感知联合音频视频生成，从而实现高保真和一致的个性化。具体来说，我们引入了一个数据管理管道，它可以自动提取带有跨音频和视觉模式的配对注释的身份信息，涵盖从单主体到多主体交互的不同场景。我们进一步提出了一种适用于单主体和多主体场景的灵活且可扩展的身份注入机制，其中面部外观和音色都充当身份承载控制信号。此外，鉴于模态差异，我们设计了一种多阶段训练策略来加速收敛并加强跨模态一致性。实验证明了所提出框架的优越性。有关更多详细信息和定性结果，请参阅我们的网页：\href{此 https URL}{Identity-as-Presence}。

Title: A Creative Agent is Worth a 64-Token Template

Authors: Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17895
Pdf URL: https://arxiv.org/pdf/2603.17895
Copy Paste: [[2603.17895]] A Creative Agent is Worth a 64-Token Template(https://arxiv.org/abs/2603.17895)
Keywords: generation
Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
摘要：文本到图像（T2I）模型极大地提高了图像保真度和提示依从性，但其创造力仍然受到对离散自然语言提示的依赖的限制。当出现诸如“受黑胶唱片启发的创意摩天大楼”等模糊提示时，这些模型通常无法推断出潜在的创意意图，从而将创意构思和提示设计很大程度上留给了人类用户。最近的推理或代理驱动方法迭代地增强提示，但会产生高昂的计算和货币成本，因为它们的特定于实例的生成使“创造力”成本高昂且不可重用，需要对后续生成进行重复查询或推理。为了解决这个问题，我们引入了 \textbf{CAT}，一个用于 \textbf{C}reative \textbf{A}gent \textbf{T} tokenization 的框架，它通过 \textit{Creative Tokenizer} 封装了代理对“创造力”的内在理解。考虑到模糊提示的嵌入，标记生成器会生成可重用的标记模板，该模板可以直接与它们连接，以将创造性语义注入到 T2I 模型中，而无需重复推理或提示增强。为了实现这一点，分词器通过创造性语义解缠进行训练，利用部分重叠的概念对之间的关系来捕获代理的潜在创造性表示。对 \textbf{\textit{建筑设计}}、\textbf{\textit{家具设计}} 和 \textbf{\textit{自然混合}} 任务的大量实验表明，CAT 为增强 T2I 生成中的创造力提供了可扩展且有效的范式，实现了 $3.7\times$ 的加速和 $4.8\times$ 的计算成本降低，同时生成的图像与传统的 T2I 相比具有优越的人类偏好和文本图像对齐。最先进的 T2I 模型和创意生成方法。

Title: SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Authors: Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17920
Pdf URL: https://arxiv.org/pdf/2603.17920
Copy Paste: [[2603.17920]] SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale(https://arxiv.org/abs/2603.17920)
Keywords: generation
Abstract: Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at this https URL.
摘要：无人飞行器 (UAV) 的语义分割是航空场景理解的基础，但由于手动标记成本高昂以及现成无人机上准确 RGB-T 对齐的困难，现有 RGB 和 RGB-T 数据集在规模、多样性和注释效率方面仍然有限。为了应对这些挑战，我们提出了一种可扩展的几何驱动的 2D-3D-2D 范例，该范例利用高重叠航空图像中的多视图冗余，自动将标签从手动注释的 RGB 图像的一小部分传播到统一框架内的 RGB 和热模态。通过将不到 3% 的 RGB 图像提升到语义 3D 点云并将其重新投影到所有视图中，我们的方法可以在大型图像集合中生成密集的伪地面实况，自动生成 97% 的 RGB 标签和 100% 的热标签，同时实现 91% 和 88% 的注释准确性，而无需任何 2D 手动细化。我们进一步将这种 2D-3D-2D 范式扩展到跨模态图像配准，使用 3D 几何作为中间对齐空间来获得全自动、强大的像素级 RGB-T 对齐，配准精度为 87%，并且无需硬件级同步。将我们的框架应用于现有的地理参考航空图像，我们构建了 SegFly，这是一个大型基准，拥有超过 20,000 张高分辨率 RGB 图像和超过 15,000 个几何对齐的 RGB-T 对，跨越多个海拔和季节的不同城市、工业和农村环境。在 SegFly 上，我们建立了 RGB 和热语义分割的 Firefly 基线，并表明传统架构和视觉基础模型都从 SegFly 监督中受益匪浅，凸显了几何驱动的 2D-3D-2D 管道在可扩展的多模态场景理解方面的潜力。数据和代码可在此 https URL 获取。

Title: TransText: Transparency Aware Image-to-Video Typography Animation

Authors: Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17944
Pdf URL: https://arxiv.org/pdf/2603.17944
Copy Paste: [[2603.17944]] TransText: Transparency Aware Image-to-Video Typography Animation(https://arxiv.org/abs/2603.17944)
Keywords: generative
Abstract: We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
摘要：据我们所知，我们介绍第一种方法，用于使图像到视频模型适应分层感知文本（字形）动画，这是实际动态视觉设计的关键功能。现有方法主要将透明度编码（Alpha 通道）作为附加到 RGB 空间的额外潜在维度进行处理，因此需要重建底层以 RGB 为中心的变分自动编码器 (VAE)。然而，鉴于高质量透明字形数据的稀缺，重新训练 VAE 的计算成本很高，并且可能会侵蚀从大量 RGB 语料库中学到的强大语义先验，从而可能导致潜在模式混合。为了减轻这些限制，我们提出了 TransText，这是一个基于新颖的 Alpha-as-RGB 范式的框架，可以联合建模外观和透明度，而无需修改预先训练的生成流形。 TransText 通过潜在空间串联将 Alpha 通道嵌入为 RGB 兼容的视觉信号，明确确保严格的跨模式（RGB 和 Alpha）一致性，同时防止特征纠缠。我们的实验表明，TransText 的性能显着优于基线，可生成具有多样化、细粒度效果的连贯、高保真透明动画。

Title: LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Authors: Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17965
Pdf URL: https://arxiv.org/pdf/2603.17965
Copy Paste: [[2603.17965]] LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition(https://arxiv.org/abs/2603.17965)
Keywords: generation
Abstract: Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
摘要：媒体设计图层生成可以仅使用自然语言提示创建完全可编辑的分层设计文档，例如海报、传单和徽标。现有方法要么将输出限制为固定数量的层，要么要求每层仅包含空间连续的区域，导致层数随着设计复杂性线性缩放。我们提出了LaDe（分层媒体设计），这是一种潜在的扩散框架，可以生成灵活数量的语义上有意义的层。 LaDe 结合了三个组件：一个基于 LLM 的提示扩展器，可将简短的用户意图转换为指导生成的结构化每层描述；一个具有 4D RoPE 位置编码机制的潜在扩散变压器，可共同生成完整的媒体设计及其组成的 RGBA 层；以及一个 RGBA VAE，可在完整的 alpha 通道支持下对每个层进行解码。通过在训练期间对图层样本进行调节，我们的统一框架支持三个任务：文本到图像生成、文本到图层媒体设计生成和媒体设计分解。我们在 Crello 测试集上的文本到图层和图像到图层任务上将 LaDe 与 Qwen-Image-Layered 进行比较。 LaDe 通过改进文本到图层对齐，在文本到图层生成方面优于 Qwen-Image-Layered，这已由两个 VLM 作为判断评估器（GPT-4o mini 和 Qwen3-VL）验证。

Title: Versatile Editing of Video Content, Actions, and Dynamics without Training

Authors: Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17989
Pdf URL: https://arxiv.org/pdf/2603.17989
Copy Paste: [[2603.17989]] Versatile Editing of Video Content, Actions, and Dynamics without Training(https://arxiv.org/abs/2603.17989)
Keywords: generation
Abstract: Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
摘要：近年来，受控视频生成取得了巨大的进步。然而，编辑动作和动态事件，或插入影响现实视频中其他对象行为的内容，仍然是一个重大挑战。现有的经过训练的模型难以进行复杂的编辑，这可能是由于收集相关训练数据的困难所致。同样，现有的免训练方法本质上仅限于结构和运动保留编辑，并且不支持运动或交互的修改。在这里，我们介绍 DynaEdit，这是一种免训练的编辑方法，可通过预训练的文本到视频流模型解锁多功能视频编辑功能。我们的方法依赖于最近引入的无反演方法，该方法不会干预模型内部，因此与模型无关。我们表明，天真地尝试将此方法应用于一般无约束编辑会导致严重的低频未对准和高频抖动。我们解释了这些现象的根源，并引入了克服这些现象的新机制。通过大量的实验，我们表明 DynaEdit 在复杂的基于文本的视频编辑任务上实现了最先进的结果，包括修改动作、插入与场景交互的对象以及引入全局效果。

Title: LoST: Level of Semantics Tokenization for 3D Shapes

Authors: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.17995
Pdf URL: https://arxiv.org/pdf/2603.17995
Copy Paste: [[2603.17995]] LoST: Level of Semantics Tokenization for 3D Shapes(https://arxiv.org/abs/2603.17995)
Keywords: generation, generative
Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.
摘要：标记化是各种模态生成建模的基本技术。特别是，它在自回归 (AR) 模型中发挥着关键作用，自回归 (AR) 模型最近已成为 3D 生成的一个引人注目的选择。然而，3D 形状的最佳标记化仍然是一个悬而未决的问题。最先进的 (SOTA) 方法主要依赖于几何细节层次 (LoD) 层次结构，最初是为渲染和压缩而设计的。这些空间层次结构通常标记效率低下，并且缺乏 AR 建模的语义一致性。我们提出语义级别标记化（LoST），它按语义显着性对标记进行排序，以便早期前缀解码为具有主要语义的完整、合理的形状，而后续标记则细化特定于实例的几何和语义细节。为了训练 LoST，我们引入了关系间距离对齐（RIDA），这是一种新颖的 3D 语义对齐损失，它将 3D 形状潜在空间的关系结构与语义 DINO 特征空间的关系结构对齐。实验表明，LoST 实现了 SOTA 重建，在几何和语义重建指标上都大幅超越了之前基于 LoD 的 3D 形状标记器。此外，LoST 实现了高效、高质量的 AR 3D 生成，并支持语义检索等下游任务，同时仅使用现有 AR 模型所需令牌的 0.1%-10%。

Title: The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Authors: Yigit Ekin, Yossi Gandelsman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.17998
Pdf URL: https://arxiv.org/pdf/2603.17998
Copy Paste: [[2603.17998]] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering(https://arxiv.org/abs/2603.17998)
Keywords: generation, generative
Abstract: We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
摘要：我们提出了一个免训练框架，用于在测试时对文本条件生成模型进行连续且可控的图像编辑。与依赖额外训练或手动用户干预的先前方法相比，我们发现文本嵌入空间中的简单转向足以产生平滑的编辑控制。给定一个目标概念（例如，增强照片真实感或改变面部表情），我们使用大型语言模型自动构建一小组去偏对比提示对，从中计算生成器文本编码器空间中的转向向量。然后，我们将该向量直接添加到输入提示表示中，以控制沿所需语义轴的生成。为了获得连续控制，我们提出了一种弹性范围搜索程序，该程序自动识别转向幅度的有效间隔，避免转向不足（无编辑）和转向过度（更改其他属性）。在此间隔内添加同一向量的缩放版本可产生平滑且连续的编辑。由于我们的方法仅修改文本表示，因此它自然地可以推广到文本条件模式，包括图像和视频生成。为了量化转向连续性，我们引入了一种新的评估指标，用于衡量编辑强度之间语义变化的一致性。我们比较了不同方法的连续编辑行为，发现尽管我们的方法简单且轻量级设计，但它与基于训练的替代方法相当，优于其他免训练方法。

Title: EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

Authors: Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.18001
Pdf URL: https://arxiv.org/pdf/2603.18001
Copy Paste: [[2603.18001]] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding(https://arxiv.org/abs/2603.18001)
Keywords: generation
Abstract: In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
摘要：在这项工作中，我们提出了 EchoGen，一个用于布局到图像生成和图像接地的统一框架，能够生成具有准确布局和对文本描述（例如空间关系）高保真度的图像，同时稳健地接地图像。我们相信图像基础具有强大的文本和布局理解能力，可以弥补布局到图像生成的相应限制。同时，从布局生成的图像在内容上表现出高度的多样性，从而增强了图像接地的鲁棒性。在统一模型中联合训练这两项任务可以促进每项任务的性能提高。然而，我们发现这种联合训练范例遇到了一些优化挑战并导致性能受限。为了解决这些问题，我们提出了渐进式培训策略。首先，并行多任务预训练（PMTP）阶段为模型配备了完成这两项任务的基本能力，利用共享令牌来加速训练。接下来，双联合优化（DJO）阶段利用任务对偶性顺序集成两个任务，从而实现统一优化。最后，Cycle RL 阶段通过使用一致性约束作为奖励，消除了对视觉监督的依赖，通过 GRPO 策略显着增强了模型的统一能力。大量的实验展示了布局到图像生成和图像接地基准的最先进结果，并揭示了同时优化这两项任务所带来的明显协同收益。