2026-03-27

Title: UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

Authors: Yicheng Xu, Jiangning Zhang, Zhucun Xue, Teng Hu, Ran Yi, Xiaobin Hu, Yong Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24690
Pdf URL: https://arxiv.org/pdf/2603.24690
Copy Paste: [[2603.24690]] UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy(https://arxiv.org/abs/2603.24690)
Keywords: generation
Abstract: In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at this https URL.
摘要：情境学习可以通过演示实现无需培训的适应，但对示例选择和格式保持高度敏感。在跨越理解和生成的统一多模态模型中，跨模态干扰和不同的认知需求加剧了这种敏感性。因此，情境学习的功效通常是非单调的并且高度依赖于任务。为了诊断这些行为，我们引入了一个以能力为导向的六级分类法，该分类法将演示的功能作用从基本感知到高阶辨别进行了分类。在这个认知框架的指导下，我们构建了 UniICL-760K，这是一个大型语料库，其中包含跨 15 个子任务的精选 8 个场景上下文学习片段，以及用于严格、受控评估的 UniICL-Bench。作为稳定少数镜头适应的架构干预，我们提出了上下文自适应原型调制器，这是一种轻量级的即插即用模块。 UniICL-Bench 的评估表明，我们的方法产生了极具竞争力的统一结果，在大多数理解的上下文学习任务上优于大参数多模态大语言模型基线。数据和代码很快就会在此 https URL 上提供。

Title: Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Authors: Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2603.24709
Pdf URL: https://arxiv.org/pdf/2603.24709
Copy Paste: [[2603.24709]] Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards(https://arxiv.org/abs/2603.24709)
Keywords: generation
Abstract: Multi-step tool orchestration, where LLMs must invoke multiple dependent APIs in the correct order while propagating intermediate outputs, remains challenging. State-of-the-art models frequently fail on full sequence execution, with parameter value errors accounting for a significant portion of failures. Training models to handle such workflows faces two obstacles: existing environments focus on simple per-turn function calls with simulated data, and binary rewards provide no signal for partial correctness. We present a framework addressing both challenges. First, we construct a reinforcement learning environment backed by a large-scale cache of real API responses, enabling a data synthesis pipeline that samples valid multi-step orchestration traces with controllable complexity and significantly higher generation efficiency than unconstrained methods. Second, we propose a graduated reward design that decomposes correctness into atomic validity (individual function call correctness at increasing granularity) and orchestration (correct tool sequencing with dependency respect). On ComplexFuncBench, our approach demonstrates substantial improvements in turn accuracy. Ablation studies confirm both reward components are essential: using either alone significantly degrades performance.
摘要：多步骤工具编排仍然具有挑战性，其中法学硕士必须以正确的顺序调用多个相关 API，同时传播中间输出。最先进的模型在全序列执行时经常失败，参数值错误占失败的很大一部分。处理此类工作流程的训练模型面临两个障碍：现有环境专注于使用模拟数据的简单每轮函数调用，而二进制奖励不提供部分正确性的信号。我们提出了一个解决这两个挑战的框架。首先，我们构建了一个由真实 API 响应的大规模缓存支持的强化学习环境，从而实现了一个数据合成管道，该管道可以对有效的多步骤编排跟踪进行采样，并且具有可控的复杂性，并且比无约束方法显着提高生成效率。其次，我们提出了一种分级奖励设计，将正确性分解为原子有效性（粒度增加的单个函数调用正确性）和编排（具有依赖性的正确工具排序）。在 ComplexFuncBench 上，我们的方法展示了转弯精度的显着改进。消融研究证实，这两种奖励成分都是必不可少的：单独使用其中任何一种都会显着降低绩效。

Title: Lookalike3D: Seeing Double in 3D

Authors: Chandan Yeshwanth, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24713
Pdf URL: https://arxiv.org/pdf/2603.24713
Copy Paste: [[2603.24713]] Lookalike3D: Seeing Double in 3D(https://arxiv.org/abs/2603.24713)
Keywords: generation
Abstract: 3D object understanding and generation methods produce impressive results, yet they often overlook a pervasive source of information in real-world scenes: repeated objects. We introduce the task of lookalike object detection in indoor scenes, which leverages repeated and complementary cues from identical and near-identical object pairs. Given an input scene, the task is to classify pairs of objects as identical, similar or different using multiview images as input. To address this, we present Lookalike3D, a multiview image transformer that effectively distinguishes such object pairs by harnessing strong semantic priors from large image foundation models. To support this task, we collected the 3DTwins dataset, containing 76k manually annotated identical, similar and different pairs of objects based on ScanNet++, and show an improvement of 104% IoU over baselines. We demonstrate how our method improves downstream tasks such as enabling joint 3D object reconstruction and part co-segmentation, turning repeated and lookalike objects into a powerful cue for consistent, high-quality 3D perception. Our code, dataset and models will be made publicly available.
摘要：3D 对象理解和生成方法产生了令人印象深刻的结果，但它们经常忽略现实世界场景中普遍存在的信息源：重复的对象。我们介绍了室内场景中相似物体检测的任务，该任务利用来自相同和接近相同的物体对的重复和互补的线索。给定输入场景，任务是使用多视图图像作为输入将对象对分类为相同、相似或不同。为了解决这个问题，我们提出了 Lookalike3D，一种多视图图像转换器，它通过利用大型图像基础模型的强语义先验来有效地区分此类对象对。为了支持这项任务，我们收集了 3DTwins 数据集，其中包含基于 ScanNet++ 的 76k 手动注释的相同、相似和不同的对象对，并且显示 IoU 比基线提高了 104%。我们展示了我们的方法如何改进下游任务，例如实现联合 3D 对象重建和部分共同分割，将重复和相似的对象转变为一致、高质量 3D 感知的强大线索。我们的代码、数据集和模型将公开。

Title: A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception

Authors: Yuqi Hu, Vasha DuTell, Ahna R. Girshick, Jennifer E. Corbett
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24730
Pdf URL: https://arxiv.org/pdf/2603.24730
Copy Paste: [[2603.24730]] A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception(https://arxiv.org/abs/2603.24730)
Keywords: generative
Abstract: The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.
摘要：经典的鸭兔错觉表明，当视觉证据不明确时，人脑必须决定它所看到的内容。但是人类观察者到底在哪里划定“鸭子”和“兔子”之间的界限，而机器分类器是否在同一个地方划定界限呢？我们使用语义模糊的图像作为可解释性探针来揭示视觉模型如何表示概念之间的边界。我们提出了一个基于心理物理学的框架，该框架在 CLIP 嵌入空间中的概念之间进行插值，以生成模糊图像的连续光谱，使我们能够精确测量人类和机器分类器在何处以及如何放置其语义边界。使用这个框架，我们表明机器分类器更偏向于看到“兔子”，而人类更倾向于用于合成的 CLIP 嵌入，并且引导尺度似乎比机器分类器更强烈地影响人类的敏感性。我们的框架展示了受控歧义如何作为诊断工具来弥合人类心理物理分析、图像分类和生成图像模型之间的差距，从而提供对人类模型对齐、鲁棒性、模型可解释性和图像合成方法的深入了解。

Title: Contrastive Learning Boosts Deterministic and Generative Models for Weather Data

Authors: Nathan Bailey
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.24744
Pdf URL: https://arxiv.org/pdf/2603.24744
Copy Paste: [[2603.24744]] Contrastive Learning Boosts Deterministic and Generative Models for Weather Data(https://arxiv.org/abs/2603.24744)
Keywords: generative
Abstract: Weather data, comprising multiple variables, poses significant challenges due to its high dimensionality and multimodal nature. Creating low-dimensional embeddings requires compressing this data into a compact, shared latent space. This compression is required to improve the efficiency and performance of downstream tasks, such as forecasting or extreme-weather detection. Self-supervised learning, particularly contrastive learning, offers a way to generate low-dimensional, robust embeddings from unlabelled data, enabling downstream tasks when labelled data is scarce. Despite initial exploration of contrastive learning in weather data, particularly with the ERA5 dataset, the current literature does not extensively examine its benefits relative to alternative compression methods, notably autoencoders. Moreover, current work on contrastive learning does not investigate how these models can incorporate sparse data, which is more common in real-world data collection. It is critical to explore and understand how contrastive learning contributes to creating more robust embeddings for sparse weather data, thereby improving performance on downstream tasks. Our work extensively explores contrastive learning on the ERA5 dataset, aligning sparse samples with complete ones via a contrastive loss term to create SPARse-data augmented conTRAstive spatiotemporal embeddings (SPARTA). We introduce a temporally aware batch sampling strategy and a cycle-consistency loss to improve the structure of the latent space. Furthermore, we propose a novel graph neural network fusion technique to inject domain-specific physical knowledge. Ultimately, our results demonstrate that contrastive learning is a feasible and advantageous compression method for sparse geoscience data, thereby enhancing performance in downstream tasks.
摘要：天气数据由多个变量组成，由于其高维性和多模态性质，带来了重大挑战。创建低维嵌入需要将这些数据压缩到紧凑的共享潜在空间中。需要这种压缩来提高下游任务的效率和性能，例如预测或极端天气检测。自监督学习，特别是对比学习，提供了一种从未标记数据生成低维、稳健嵌入的方法，从而在标记数据稀缺时支持下游任务。尽管初步探索了天气数据中的对比学习，特别是 ERA5 数据集，但当前文献并未广泛研究其相对于替代压缩方法（尤其是自动编码器）的优势。此外，当前的对比学习工作并没有研究这些模型如何整合稀疏数据，这在现实世界的数据收集中更为常见。探索和理解对比学习如何有助于为稀疏天气数据创建更强大的嵌入，从而提高下游任务的性能至关重要。我们的工作广泛探索了 ERA5 数据集上的对比学习，通过对比损失项将稀疏样本与完整样本对齐，以创建 SPARse 数据增强对比时空嵌入 (SPARTA)。我们引入了时间感知批量采样策略和循环一致性损失来改善潜在空间的结构。此外，我们提出了一种新颖的图神经网络融合技术来注入特定领域的物理知识。最终，我们的结果表明，对比学习对于稀疏地球科学数据来说是一种可行且有利的压缩方法，从而提高下游任务的性能。

Title: Synthetic Cardiac MRI Image Generation using Deep Generative Models

Authors: Ishan Kumarasinghe, Dasuni Kawya, Madhura Edirisooriya, Isuri Devindi, Isuru Nawinne, Vajira Thambawita
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24764
Pdf URL: https://arxiv.org/pdf/2603.24764
Copy Paste: [[2603.24764]] Synthetic Cardiac MRI Image Generation using Deep Generative Models(https://arxiv.org/abs/2603.24764)
Keywords: generation, generative
Abstract: Synthetic cardiac MRI (CMRI) generation has emerged as a promising strategy to overcome the scarcity of annotated medical imaging data. Recent advances in GANs, VAEs, diffusion probabilistic models, and flow-matching techniques aim to generate anatomically accurate images while addressing challenges such as limited labeled datasets, vendor variability, and risks of privacy leakage through model memorization. Maskconditioned generation improves structural fidelity by guiding synthesis with segmentation maps, while diffusion and flowmatching models offer strong boundary preservation and efficient deterministic transformations. Cross-domain generalization is further supported through vendor-style conditioning and preprocessing steps like intensity normalization. To ensure privacy, studies increasingly incorporate membership inference attacks, nearest-neighbor analyses, and differential privacy mechanisms. Utility evaluations commonly measure downstream segmentation performance, with evidence showing that anatomically constrained synthetic data can enhance accuracy and robustness across multi-vendor settings. This review aims to compare existing CMRI generation approaches through the lenses of fidelity, utility, and privacy, highlighting current limitations and the need for integrated, evaluation-driven frameworks for reliable clinical workflows.
摘要：合成心脏 MRI (CMRI) 生成已成为克服带注释的医学成像数据稀缺性的一种有前途的策略。 GAN、VAE、扩散概率模型和流匹配技术的最新进展旨在生成解剖学上准确的图像，同时解决诸如有限的标记数据集、供应商可变性以及通过模型记忆泄露隐私风险等挑战。掩模条件生成通过指导分割图的合成来提高结构保真度，而扩散和流动匹配模型提供强大的边界保留和高效的确定性变换。通过供应商风格的调节和强度标准化等预处理步骤进一步支持跨域泛化。为了确保隐私，研究越来越多地结合成员推理攻击、最近邻分析和差分隐私机制。效用评估通常衡量下游分段性能，有证据表明，解剖学约束的合成数据可以提高多供应商设置的准确性和稳健性。本综述旨在通过保真度、实用性和隐私的角度比较现有的 CMRI 生成方法，强调当前的局限性以及对可靠临床工作流程的集成、评估驱动框架的需求。

Title: AVControl: Efficient Framework for Training Audio-Visual Controls

Authors: Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2603.24793
Pdf URL: https://arxiv.org/pdf/2603.24793
Copy Paste: [[2603.24793]] AVControl: Efficient Framework for Training Audio-Visual Controls(https://arxiv.org/abs/2603.24793)
Keywords: generation
Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
摘要：控制视频和音频生成需要多种模式，从深度和姿势到相机轨迹和音频转换，但现有方法要么为一组固定的控制训练单个整体模型，要么为每种新模式引入昂贵的架构更改。我们引入了 AVControl，这是一个基于 LTX-2 的轻量级可扩展框架，LTX-2 是一个联合视听基础模型，其中每个控制模态都在并行画布上作为单独的 LoRA 进行训练，提供参考信号作为注意层中的附加标记，除了 LoRA 适配器本身之外，不需要进行任何架构更改。我们证明，简单地将基于图像的上下文方法扩展到视频无法实现结构控制，而我们的并行画布方法解决了这个问题。在 VACE 基准测试中，我们在深度和姿势引导生成、修复和修复方面优于所有评估的基准，并在相机控制和视听基准方面显示出有竞争力的结果。我们的框架支持多种独立训练的模式：空间对齐的控制，例如深度、姿态和边缘、具有内在函数的相机轨迹、稀疏运动控制、视频编辑，以及据我们所知，用于联合生成模型的第一个模块化视听控制。我们的方法具有计算效率和数据效率：每种模式仅需要一个小数据集，并在几百到几千个训练步骤内收敛，这只是整体替代方案预算的一小部分。我们公开发布我们的代码和经过训练的 LoRA 检查点。

Title: Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

Authors: Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24800
Pdf URL: https://arxiv.org/pdf/2603.24800
Copy Paste: [[2603.24800]] Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration(https://arxiv.org/abs/2603.24800)
Keywords: generation, generative
Abstract: In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.
摘要：在本文中，我们揭示了扩散变压器（DiT）显着增强生成任务的隐藏潜力。通过对去噪过程的深入分析，我们证明引入单个学习缩放参数可以显着提高 DiT 块的性能。基于这一见解，我们提出了 Calibri，这是一种参数高效的方法，可以优化校准 DiT 组件以提高生成质量。 Calibri 将 DiT 校准视为一个黑盒奖励优化问题，可以使用进化算法有效解决该问题，并且仅修改约 100 个参数。实验结果表明，尽管采用轻量级设计，Calibri 仍能持续提高各种文本到图像模型的性能。值得注意的是，Calibri 还减少了图像生成所需的推理步骤，同时保持高质量的输出。

Title: Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting

Authors: Alabi Mehzabin Anisha, Guangjing Wang, Sriram Chellappan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24821
Pdf URL: https://arxiv.org/pdf/2603.24821
Copy Paste: [[2603.24821]] Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting(https://arxiv.org/abs/2603.24821)
Keywords: generative
Abstract: State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (i.e., across both density map-based models and point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a 7X increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from 0.55 to 1.69. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies. The source code is available at this https URL
摘要：最先进的人群计数和定位主要使用两种范式进行建模：密度图和点回归。考虑到该领域的安全影响，人们对模型抵御对抗性攻击的鲁棒性产生了积极的兴趣。最近的研究证明了通过对抗性补丁在基于密度图的方法之间的可转移性，但跨范式攻击（即跨基于密度图的模型和基于点回归的模型）仍未被探索。我们引入了一种新颖的对抗性框架，该框架通过全面的多任务损失优化来妥协密度图和点回归架构范例。对于点回归模型，我们采用场景密度特定的高置信度逻辑抑制；对于密度图方法，我们使用峰值目标密度图抑制。两者都与模型无关的感知约束相结合，以确保扰动有效且人眼无法察觉。大量实验证明了我们的攻击的有效性，与干净图像相比，平均绝对误差平均提高了 7 倍，同时保持有竞争力的视觉质量，并成功地在七个最先进的人群模型之间进行传输，传输率范围从 0.55 到 1.69。与最先进的可转移攻击策略相比，我们的方法在攻击有效性和不可察觉性之间取得了平衡。源代码可在此 https URL 获取

Title: DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation

Authors: Junyi Ouyang, Wenbin Teng, Gonglin Chen, Yajie Zhao, Haiwei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24835
Pdf URL: https://arxiv.org/pdf/2603.24835
Copy Paste: [[2603.24835]] DCARL: A Divide-and-Conquer Framework for Autoregressive Long-Trajectory Video Generation(https://arxiv.org/abs/2603.24835)
Keywords: generation
Abstract: Long-trajectory video generation is a crucial yet challenging task for world modeling primarily due to the limited scalability of existing video diffusion models (VDMs). Autoregressive models, while offering infinite rollout, suffer from visual drift and poor controllability. To address these issues, we propose DCARL, a novel divide-and-conquer, autoregressive framework that effectively combines the structural stability of the divide-and-conquer scheme with the high-fidelity generation of VDMs. Our approach first employs a dedicated Keyframe Generator trained without temporal compression to establish long-range, globally consistent structural anchors. Subsequently, an Interpolation Generator synthesizes the dense frames in an autoregressive manner with overlapping segments, utilizing the keyframes for global context and a single clean preceding frame for local coherence. Trained on a large-scale internet long trajectory video dataset, our method achieves superior performance in both visual quality (lower FID and FVD) and camera adherence (lower ATE and ARE) compared to state-of-the-art autoregressive and divide-and-conquer baselines, demonstrating stable and high-fidelity generation for long trajectory videos up to 32 seconds in length.
摘要：长轨迹视频生成对于世界建模来说是一项至关重要但具有挑战性的任务，这主要是由于现有视频扩散模型（VDM）的可扩展性有限。自回归模型虽然提供无限的推出，但存在视觉漂移和可控性差的问题。为了解决这些问题，我们提出了 DCARL，一种新颖的分治自回归框架，它有效地将分治方案的结构稳定性与 VDM 的高保真生成结合起来。我们的方法首先采用未经时间压缩训练的专用关键帧生成器来建立远程、全局一致的结构锚点。随后，插值生成器以自回归方式与重叠片段合成密集帧，利用全局上下文的关键帧和用于局部连贯性的单个干净的前一帧。在大规模互联网长轨迹视频数据集上进行训练后，与最先进的自回归和分而治之基线相比，我们的方法在视觉质量（较低的 FID 和 FVD）和相机依从性（较低的 ATE 和 ARE）方面实现了卓越的性能，展示了长度长达 32 秒的长轨迹视频的稳定和高保真生成。

Title: Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Authors: Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2603.24844
Pdf URL: https://arxiv.org/pdf/2603.24844
Copy Paste: [[2603.24844]] Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models(https://arxiv.org/abs/2603.24844)
Keywords: generative
Abstract: Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at this https URL.
摘要：给定一个问题，语言模型 (LM) 隐式编码可能答案的分布。在实践中，LM 的后训练程序通常会将这种分布压缩为单一主导模式。虽然对于假设一个正确答案的基准式评估来说，这通常不是问题，但许多现实世界的任务本质上涉及多个有效答案或不可减少的不确定性。例如，医疗诊断、模糊的问答以及信息不完整的设置。在这些情况下，我们希望 LM 生成多个似是而非的假设，理想情况下对每个假设都有置信度估计，并且不需要计算密集型重复采样来生成非模态答案。本文描述了一种多答案强化学习方法，用于训练 LM 在推理过程中对多个答案执行分布式推理。我们修改了 RL 目标，使模型能够在一次前向传递中显式生成多个候选答案，将推理时间搜索的各个方面内化到模型的生成过程中。在问答、医疗诊断和编码基准方面，我们观察到与单一答案训练的基线相比，多样性、覆盖范围和集合级别校准分数有所提高。与竞争方法相比，使用我们的方法训练的模型需要更少的令牌来生成多个答案。在编码任务上，它们也更加准确。这些结果将多答案强化学习定位为推理时间缩放程序（例如 best-of-k）的一种有原则且计算效率高的替代方案。代码和更多信息可以在此 https URL 中找到。

Title: OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Authors: Xiaoyu Tang, Jun Dong, Jintao Cheng, Rui Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24876
Pdf URL: https://arxiv.org/pdf/2603.24876
Copy Paste: [[2603.24876]] OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding(https://arxiv.org/abs/2603.24876)
Keywords: generative
Abstract: Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.
摘要：遥感视觉接地（RSVG）旨在使用自然语言表达来定位遥感图像中的特定目标。然而，现有方法仅限于单传感器域，即光学或合成孔径雷达（SAR），限制了它们的实际适用性。在本文中，我们介绍了跨域 RSVG (CD-RSVG) 任务并构建了 OptSAR-RSVG，这是该设置的第一个大规模基准数据集。为了解决跨域特征建模、计算效率低下和细粒度语义辨别的挑战，我们提出了 OptiSAR-Net++。我们的框架具有补丁级低阶适应专家混合（PL-MoE），可实现高效的跨域特征解耦。为了减轻 Transformer 解码框架的大量计算开销，我们采用基于 CLIP 的对比范式，并进一步结合动态对抗性负采样，从而将生成回归转化为高效的跨模态匹配过程。此外，还引入了文本引导双门融合模块（TGDF-SSA）和区域感知辅助头来增强语义视觉对齐和空间建模。大量实验表明，OptiSAR-Net++ 在 OptSAR-RSVG 和 DIOR-RSVG 基准上均实现了 SOTA 性能，在定位精度和效率方面具有显着优势。我们的代码和数据集将公开。

Title: Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML

Authors: Yassien Shaalan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.24916
Pdf URL: https://arxiv.org/pdf/2603.24916
Copy Paste: [[2603.24916]] Once-for-All Channel Mixers (HYPERTINYPW): Generative Compression for TinyML(https://arxiv.org/abs/2603.24916)
Keywords: generation, generative
Abstract: Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) mixers often dominate memory even after INT8 quantization across vision, audio, and wearable sensing. We present HYPER-TINYPW, a compression-as-generation approach that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and adds only a one-off synthesis cost; steady-state latency and energy match INT8 separable CNN baselines. Enforcing a shared latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We contribute (i) TinyML-faithful packed-byte accounting covering generator, heads/factorization, codes, kept PW1, and backbone; (ii) a unified evaluation with validation-tuned t* and bootstrap confidence intervals; and (iii) a deployability analysis covering integer-only inference and boot versus lazy synthesis. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPER-TINYPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a roughly 1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, indicating a wider role for compression-as-generation in resource-constrained ML systems. Beyond ECG, HYPER-TINYPW transfers to TinyML audio: on Speech Commands it reaches 96.2% test accuracy (98.2% best validation), supporting broader applicability to embedded sensing workloads where repeated linear mixers dominate memory.
摘要：在微控制器上部署神经网络受到数千字节闪存和 SRAM 的限制，其中 1x1 逐点 (PW) 混频器通常在内存中占主导地位，即使在跨视觉、音频和可穿戴传感的 INT8 量化之后也是如此。我们提出了 HYPER-TINYPW，一种压缩即生成方法，用生成的权重替换大多数存储的 PW 权重：共享的微 MLP 在加载时从微小的每层代码合成一次 PW 内核，缓存它们，并使用标准整数运算符执行它们。这保留了商用 MCU 运行时间，并且仅增加了一次性综合成本；稳态延迟和能量与 INT8 可分离 CNN 基线相匹配。跨层强制共享潜在基础可以消除跨层冗余，同时将 PW1 保留在 INT8 中可以稳定早期的形态敏感混合。我们贡献 (i) TinyML 忠实的打包字节记账，涵盖生成器、头/因式分解、代码、保留的 PW1 和主干； (ii) 使用经过验证调整的 t* 和引导程序置信区间进行统一评估； (iii) 可部署性分析，涵盖纯整数推理以及启动与延迟综合。在三个 ECG 基准（Apnea-ECG、PTB-XL、MIT-BIH）上，HYPER-TINYPW 改变了宏 F1 与闪存帕累托前沿：在大约 225 kB 的大小下，它与大约 1.4 MB 的 CNN 匹配，同时缩小了 6.31 倍（字节数减少了 84.15%），保留了至少 95% 的大型模型宏 F1。在 32-64 kB 预算下，它可以在紧凑基线退化的情况下维持平衡检测。该机制广泛适用于其他一维生物信号、设备上语音和嵌入式传感任务，其中每层冗余占主导地位，这表明压缩即生成在资源受限的机器学习系统中发挥着更广泛的作用。除了 ECG 之外，HYPER-TINYPW 还可传输至 TinyML 音频：在语音命令上，它达到 96.2% 的测试准确度（98.2% 最佳验证），支持更广泛的适用性，适用于重复线性混合器占内存的嵌入式传感工作负载。

Title: GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

Authors: Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth
Subjects: cs.LG, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.24925
Pdf URL: https://arxiv.org/pdf/2603.24925
Copy Paste: [[2603.24925]] GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation(https://arxiv.org/abs/2603.24925)
Keywords: generation
Abstract: Semantic search in retrieval-augmented generation (RAG) systems is often insufficient for complex information needs, particularly when relevant evidence is scattered across multiple sources. Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries. However, these methods do not fully leverage the organizational structure of the data and instead rely on iterative exploration, which can lead to inefficient retrieval. Another class of approaches employs knowledge graphs to model non-semantic relationships through graph edges. Although effective in capturing richer proximities, such methods incur significant maintenance costs and are often incompatible with the vector stores used in most production systems. To address these limitations, we propose GraphER, a graph-based enrichment and reranking method that captures multiple forms of proximity beyond semantic similarity. GraphER independently enriches data objects during offline indexing and performs graph-based reranking over candidate objects at query time. This design does not require a knowledge graph, allowing GraphER to integrate seamlessly with standard vector stores. In addition, GraphER is retriever-agnostic and introduces negligible latency overhead. Experiments on multiple retrieval benchmarks demonstrate the effectiveness of the proposed approach.
摘要：检索增强生成（RAG）系统中的语义搜索通常不足以满足复杂的信息需求，特别是当相关证据分散在多个来源时。解决此问题的先前方法包括代理检索策略，该策略通过生成附加查询来扩展语义搜索空间。然而，这些方法并没有充分利用数据的组织结构，而是依赖于迭代探索，这可能导致检索效率低下。另一类方法利用知识图通过图边对非语义关系进行建模。虽然这种方法可以有效地捕获更丰富的近似值，但会产生大量的维护成本，并且通常与大多数生产系统中使用的矢量存储不兼容。为了解决这些限制，我们提出了 GraphER，这是一种基于图的丰富和重新排序方法，可以捕获超出语义相似性的多种形式的邻近性。 GraphER 在离线索引期间独立丰富数据对象，并在查询时对候选对象执行基于图的重新排序。这种设计不需要知识图，允许 GraphER 与标准向量存储无缝集成。此外，GraphER 与检索器无关，并且引入的延迟开销可以忽略不计。多个检索基准的实验证明了所提出方法的有效性。

Title: TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization

Authors: Xuepeng Jing, Wenhuan Lu, Hao Meng, Zhizhi Yu, Jianguo Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24936
Pdf URL: https://arxiv.org/pdf/2603.24936
Copy Paste: [[2603.24936]] TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization(https://arxiv.org/abs/2603.24936)
Keywords: generation, generative
Abstract: Human trajectory forecasting is important for intelligent multimedia systems operating in visually complex environments, such as autonomous driving and crowd surveillance. Although Conditional Flow Matching (CFM) has shown strong ability in modeling trajectory distributions from spatio-temporal observations, existing approaches still focus primarily on supervised fitting, which may leave social norms and scene constraints insufficiently reflected in generated trajectories. To address this issue, we propose TIGFlow-GRPO, a two-stage generative framework that aligns flow-based trajectory generation with behavioral rules. In the first stage, we build a CFM-based predictor with a Trajectory-Interaction-Graph (TIG) module to model fine-grained visual-spatial interactions and strengthen context encoding. This stage captures both agent-agent and agent-scene relations more effectively, providing more informative conditional features for subsequent alignment. In the second stage, we perform Flow-GRPO post-training,where deterministic flow rollout is reformulated as stochastic ODE-to-SDE sampling to enable trajectory exploration, and a composite reward combines view-aware social compliance with map-aware physical feasibility. By evaluating trajectories explored through SDE rollout, GRPO progressively steers multimodal predictions toward behaviorally plausible futures. Experiments on the ETH/UCY and SDD datasets show that TIGFlow-GRPO improves forecasting accuracy and long-horizon stability while generating trajectories that are more socially compliant and physically feasible. These results suggest that the proposed framework provides an effective way to connect flow-based trajectory modeling with behavior-aware alignment in dynamic multimedia environments.
摘要：人体轨迹预测对于在视觉复杂的环境中运行的智能多媒体系统非常重要，例如自动驾驶和人群监控。尽管条件流匹配（CFM）在根据时空观测对轨迹分布进行建模方面表现出强大的能力，但现有方法仍然主要关注监督拟合，这可能导致社会规范和场景约束在生成的轨迹中没有得到充分反映。为了解决这个问题，我们提出了 TIGFlow-GRPO，这是一个两阶段生成框架，它将基于流的轨迹生成与行为规则结合起来。在第一阶段，我们构建了一个基于 CFM 的预测器，其中包含轨迹交互图（TIG）模块，以模拟细粒度的视觉空间交互并加强上下文编码。此阶段更有效地捕获代理-代理和代理-场景关系，为后续对齐提供更多信息的条件特征。在第二阶段，我们执行 Flow-GRPO 后训练，其中确定性流推出被重新表述为随机 ODE 到 SDE 采样，以实现轨迹探索，并且复合奖励将视图感知的社会顺从性与地图感知的物理可行性结合起来。通过评估通过 SDE 推出探索的轨迹，GRPO 逐步将多模式预测转向行为合理的未来。在 ETH/UCY 和 SDD 数据集上的实验表明，TIGFlow-GRPO 提高了预测准确性和长期稳定性，同时生成更符合社会要求和物理上可行的轨迹。这些结果表明，所提出的框架提供了一种有效的方法，将基于流的轨迹建模与动态多媒体环境中的行为感知对齐连接起来。

Title: Infinite Gaze Generation for Videos with Autoregressive Diffusion

Authors: Jenna Kang, Colin Groth, Tong Wu, Finley Torrens, Patsorn Sangkloy, Gordon Wetzstein, Qi Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24938
Pdf URL: https://arxiv.org/pdf/2603.24938
Copy Paste: [[2603.24938]] Infinite Gaze Generation for Videos with Autoregressive Diffusion(https://arxiv.org/abs/2603.24938)
Keywords: generation, generative
Abstract: Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.
摘要：预测视频中的人类注视对于推进场景理解和多模式交互至关重要。虽然传统的显着性图提供了空间概率分布，而扫描路径提供了有序的注视点，但这两种抽象通常都会破坏原始注视的细粒度时间动态。此外，现有模型通常仅限于短期窗口（约 3-5 秒），无法捕获现实世界内容中固有的长期行为依赖性。我们提出了一个生成框架，用于任意长度视频中的无限视野原始凝视预测。通过利用自回归扩散模型，我们合成了以连续空间坐标和高分辨率时间戳为特征的注视轨迹。我们的模型以显着性感知的视觉潜在空间为条件。定量和定性评估表明，我们的方法在远程时空精度和轨迹真实性方面显着优于现有方法。

Title: BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

Authors: Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24942
Pdf URL: https://arxiv.org/pdf/2603.24942
Copy Paste: [[2603.24942]] BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation(https://arxiv.org/abs/2603.24942)
Keywords: generation
Abstract: Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.
摘要：最近的扩散和流动匹配模型通过迭代采样逐步消除噪声，在图像生成和编辑方面表现出了强大的能力。虽然这可以实现语义保留编辑的灵活反转，但少步采样机制的前向过程近似较差，导致编辑质量下降。现有的几步反演方法通常依赖于预训练的生成器和辅助模块，限制了跨不同架构的可扩展性和泛化性。为了解决这些限制，我们提出了 BiFM（双向流匹配），这是一个统一的框架，可以在单个模型中联合学习生成和反演。 BiFM 直接估计“图像$\to$噪声”和“噪声$\to$图像”方向上的平均速度场，受到从预定义的时间表或预训练的多步扩散模型导出的共享瞬时速度场的约束。此外，BiFM 引入了一种使用连续时间间隔监督的新颖训练策略，通过双向一致性目标和轻量级时间间隔嵌入来稳定。这种双向公式还可以实现一步反演，并且可以无缝集成到流行的扩散和流量匹配主干中。在不同的图像编辑和生成任务中，BiFM 始终优于现有的几步方法，实现了卓越的性能和可编辑性。

Title: Self-Corrected Image Generation with Explainable Latent Rewards

Authors: Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24965
Pdf URL: https://arxiv.org/pdf/2603.24965
Copy Paste: [[2603.24965]] Self-Corrected Image Generation with Explainable Latent Rewards(https://arxiv.org/abs/2603.24965)
Keywords: generation, generative
Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at this https URL.
摘要：尽管在文本到图像生成方面取得了重大进展，但将输出与复杂的提示对齐仍然具有挑战性，特别是对于细粒度的语义和空间关系。这一困难源于生成的前馈性质，这需要在不完全理解输出的情况下预测对齐。相比之下，评估生成的图像更容易处理。受这种不对称性的推动，我们提出了 xLARD，这是一种自我校正框架，它使用多模态大语言模型来通过可解释的 LAtent RewarD 来指导生成。 xLARD 引入了一个轻量级校正器，可以根据模型生成的参考的结构化反馈来细化潜在表示。一个关键组成部分是从潜在编辑到可解释奖励信号的可微映射，从而能够从不可微分的图像级评估中提供连续的潜在级指导。这种机制允许模型在生成过程中理解、评估和纠正自身。跨不同生成和编辑任务的实验表明，xLARD 在保持生成先验的同时提高了语义对齐和视觉保真度。代码可从此 https URL 获取。

Title: PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration

Authors: Yilin Ni, Wenjie Li, Zhengxue Wang, Juncheng Li, Guangwei Gao, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.24969
Pdf URL: https://arxiv.org/pdf/2603.24969
Copy Paste: [[2603.24969]] PASDiff: Physics-Aware Semantic Guidance for Joint Real-world Low-Light Face Enhancement and Restoration(https://arxiv.org/abs/2603.24969)
Keywords: restoration
Abstract: Face images captured in real-world low light suffer multiple degradations-low illumination, blur, noise, and low visibility, etc. Existing cascaded solutions often suffer from severe error accumulation, while generic joint models lack explicit facial priors and struggle to resolve clear face structures. In this paper, we propose PASDiff, a Physics-Aware Semantic Diffusion with a training-free manner. To achieve a plausible illumination and color distribution, we leverage inverse intensity weighting and Retinex theory to introduce photometric constraints, thereby reliably recovering visibility and natural chromaticity. To faithfully reconstruct facial details, our Style-Agnostic Structural Injection (SASI) extracts structures from an off-the-shelf facial prior while filtering out its intrinsic photometric biases, seamlessly harmonizing identity features with physical constraints. Furthermore, we construct WildDark-Face, a real-world benchmark of 700 low-light facial images with complex degradations. Extensive experiments demonstrate that PASDiff significantly outperforms existing methods, achieving a superior balance among natural illumination, color recovery, and identity consistency.
摘要：在现实世界的低光条件下捕获的人脸图像会遭受多种退化——低照度、模糊、噪声和低可见度等。现有的级联解决方案经常遭受严重的错误累积，而通用的联合模型缺乏明确的面部先验，难以解析清晰的面部结构。在本文中，我们提出了 PASDiff，一种无需训练的物理感知语义扩散。为了实现合理的照明和颜色分布，我们利用逆强度加权和 Retinex 理论引入光度约束，从而可靠地恢复可见度和自然色度。为了忠实地重建面部细节，我们的风格无关结构注入 (SASI) 从现成的面部先验中提取结构，同时过滤掉其固有的光度偏差，从而将身份特征与物理约束无缝协调。此外，我们构建了 WildDark-Face，这是一个由 700 张具有复杂退化的低光面部图像组成的真实世界基准。大量实验表明，PASDiff 显着优于现有方法，在自然光照、颜色恢复和身份一致性之间实现了卓越的平衡。

Title: GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

Authors: Zhangyu Jin, Maksim Siniukov, Deuksin Kwon, Ashutosh Chaubey, Mohammad Soleymani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25020
Pdf URL: https://arxiv.org/pdf/2603.25020
Copy Paste: [[2603.25020]] GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization(https://arxiv.org/abs/2603.25020)
Keywords: generation
Abstract: Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.
摘要：为二元交互生成逼真的 3D 头部运动是虚拟人体合成中的一个重大挑战。虽然最近的方法在头部说话方面取得了令人印象深刻的结果，但它们经常遇到听众运动中的“均值回归”问题，陷入静态面部，并且缺乏复杂非语言运动的参数空间。在本文中，我们提出了 GDPO-Listener，这是一种新颖的框架，可以实现高度表现力的口语和听力动作生成。首先，我们引入了一种自回归流匹配架构，可实现稳定的监督学习。其次，为了克服运动静止，我们应用了群体奖励解耦策略优化（GDPO）。通过隔离不同 FLAME 参数组的奖励标准化，GDPO 明确激励高方差表达世代。最后，我们为可定制的响应启用显式语义文本控制。对 Seamless Interaction 和 DualTalk 数据集的广泛评估表明，与现有基线相比，在长期运动学方差、视觉表现力和语义可控性方面具有卓越的性能。

Title: CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering

Authors: Xu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25026
Pdf URL: https://arxiv.org/pdf/2603.25026
Copy Paste: [[2603.25026]] CARE: Training-Free Controllable Restoration for Medical Images via Dual-Latent Steering(https://arxiv.org/abs/2603.25026)
Keywords: restoration, generative
Abstract: Medical image restoration is essential for improving the usability of noisy, incomplete, and artifact-corrupted clinical scans, yet existing methods often rely on task-specific retraining and offer limited control over the trade-off between faithful reconstruction and prior-driven enhancement. This lack of controllability is especially problematic in clinical settings, where overly aggressive restoration may introduce hallucinated details or alter diagnostically important structures. In this work, we propose CARE, a training-free controllable restoration framework for real-world medical images that explicitly balances structure preservation and prior-guided refinement during inference. CARE uses a dual-latent restoration strategy, in which one branch enforces data fidelity and anatomical consistency while the other leverages a generative prior to recover missing or degraded information. A risk-aware adaptive controller dynamically adjusts the contribution of each branch based on restoration uncertainty and local structural reliability, enabling conservative or enhancement-focused restoration modes without additional model training. We evaluate CARE on noisy and incomplete medical imaging scenarios and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions and show that it achieves strong restoration quality while better preserving clinically relevant structures and reducing the risk of implausible reconstructions. The proposed approach offers a practical step toward safer, more controllable, and more deployment-ready medical image restoration.
摘要：医学图像恢复对于提高噪声、不完整和伪影损坏的临床扫描的可用性至关重要，但现有方法通常依赖于特定于任务的再训练，并且对忠实重建和先验驱动增强之间的权衡提供有限的控制。这种缺乏可控性在临床环境中尤其成问题，过度激进的修复可能会引入幻觉细节或改变诊断上重要的结构。在这项工作中，我们提出了 CARE，一种用于真实世界医学图像的免训练可控恢复框架，它在推理过程中明确平衡结构保留和先验引导细化。 CARE 使用双潜在恢复策略，其中一个分支强制执行数据保真度和解剖一致性，而另一个分支则利用生成先验来恢复丢失或降级的信息。风险感知自适应控制器根据恢复不确定性和局部结构可靠性动态调整每个分支的贡献，从而无需额外的模型训练即可实现保守或增强型恢复模式。我们在噪声和不完整的医学成像场景中评估了 CARE，并表明它实现了强大的恢复质量，同时更好地保留了临床相关结构并降低了不合理重建的风险，并表明它实现了强大的恢复质量，同时更好地保留了临床相关结构并降低了不合理重建的风险。所提出的方法为实现更安全、更可控、更易于部署的医学图像恢复迈出了实际的一步。

Title: GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

Authors: Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, Iro Armeni
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25053
Pdf URL: https://arxiv.org/pdf/2603.25053
Copy Paste: [[2603.25053]] GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator(https://arxiv.org/abs/2603.25053)
Keywords: generation
Abstract: We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications.
摘要：我们提出了 GaussFusion，这是一种通过几何信息视频生成来改进野外 3D 高斯泼溅 (3DGS) 重建的新方法。 GaussFusion 可减轻常见的 3DGS 伪影，包括由相机姿势错误、不完整的覆盖范围和嘈杂的几何初始化引起的漂浮物、闪烁和模糊。与之前仅限于单个重建管道的基于 RGB 的方法不同，我们的方法引入了几何信息视频到视频生成器，该生成器可以跨基于优化的方法和前馈方法细化 3DGS 渲染。给定现有的重建，我们渲染一个编码深度、法线、不透明度和协方差的高斯原始视频缓冲区，生成器对其进行细化以产生时间连贯的、无伪影的帧。我们进一步引入了一个工件合成管道，可以模拟不同的退化模式，确保鲁棒性和泛化性。 GaussFusion 在新颖视图合成基准上实现了最先进的性能，并且高效的变体以 21 FPS 实时运行，同时保持类似的性能，从而支持交互式 3D 应用程序。

Title: SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning

Authors: Xinyu Wang, Fei Dou, Jinbo Bi, Minghu Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.25062
Pdf URL: https://arxiv.org/pdf/2603.25062
Copy Paste: [[2603.25062]] SIGMA: Structure-Invariant Generative Molecular Alignment for Chemical Language Models via Autoregressive Contrastive Learning(https://arxiv.org/abs/2603.25062)
Keywords: generation, generative
Abstract: Linearized string representations serve as the foundation of scalable autoregressive molecular generation; however, they introduce a fundamental modality mismatch where a single molecular graph maps to multiple distinct sequences. This ambiguity leads to \textit{trajectory divergence}, where the latent representations of structurally equivalent partial graphs drift apart due to differences in linearization history. To resolve this without abandoning the efficient string formulation, we propose Structure-Invariant Generative Molecular Alignment (SIGMA). Rather than altering the linear representation, SIGMA enables the model to strictly recognize geometric symmetries via a token-level contrastive objective, which explicitly aligns the latent states of prefixes that share identical suffixes. Furthermore, we introduce Isomorphic Beam Search (IsoBeam) to eliminate isomorphic redundancy during inference by dynamically pruning equivalent paths. Empirical evaluations on standard benchmarks demonstrate that SIGMA bridges the gap between sequence scalability and graph fidelity, yielding superior sample efficiency and structural diversity in multi-parameter optimization compared to strong baselines.
摘要：线性化字符串表示作为可扩展自回归分子生成的基础；然而，它们引入了基本的模态不匹配，其中单个分子图映射到多个不同的序列。这种歧义导致 \textit{轨迹发散}，其中结构等效的部分图的潜在表示由于线性化历史的差异而分开。为了解决这个问题而不放弃有效的字符串公式，我们提出了结构不变生成分子比对（SIGMA）。 SIGMA 没有改变线性表示，而是使模型能够通过标记级对比目标严格识别几何对称性，该目标明确对齐共享相同后缀的前缀的潜在状态。此外，我们引入同构波束搜索（IsoBeam），通过动态修剪等效路径来消除推理过程中的同构冗余。对标准基准的实证评估表明，SIGMA 弥合了序列可扩展性和图形保真度之间的差距，与强基线相比，在多参数优化中产生了卓越的样本效率和结构多样性。

Title: Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Authors: Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, Jifeng Guo, Yalan Qin, Yeying Jin, Hongwei Zheng, Faguo Wu, Wenjun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25074
Pdf URL: https://arxiv.org/pdf/2603.25074
Copy Paste: [[2603.25074]] Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers(https://arxiv.org/abs/2603.25074)
Keywords: generation
Abstract: Concept erasure serves as a vital safety mechanism for removing unwanted concepts from text-to-image (T2I) models. While extensively studied in U-Net and dual-stream architectures (e.g., Flux), this task remains under-explored in the recent emerging paradigm of single-stream diffusion transformers (e.g., Z-Image). In this new paradigm, text and image tokens are processed as a single unified sequence via shared parameters. Consequently, directly applying prior erasure methods typically leads to generation collapse. To bridge this gap, we introduce Z-Erase, the first concept erasure method tailored for single-stream T2I models. To guarantee stable image generation, Z-Erase first proposes a Stream Disentangled Concept Erasure Framework that decouples updates and enables existing methods on single-stream models. Subsequently, within this framework, we introduce Lagrangian-Guided Adaptive Erasure Modulation, a constrained algorithm that further balances the sensitive erasure-preservation trade-off. Moreover, we provide a rigorous convergence analysis proving that Z-Erase can converge to a Pareto stationary point. Experiments demonstrate that Z-Erase successfully overcomes the generation collapse issue, achieving state-of-the-art performance across a wide range of tasks.
摘要：概念擦除是一种重要的安全机制，用于从文本到图像 (T2I) 模型中删除不需要的概念。虽然在 U-Net 和双流架构（例如 Flux）中进行了广泛的研究，但在最近出现的单流扩散变压器范例（例如 Z-Image）中，这项任务仍然没有得到充分探索。在这个新范例中，文本和图像标记通过共享参数作为单个统一序列进行处理。因此，直接应用现有的擦除方法通常会导致生成崩溃。为了弥补这一差距，我们引入了 Z-Erase，这是第一个专为单流 T2I 模型量身定制的概念擦除方法。为了保证稳定的图像生成，Z-Erase 首先提出了一种流解耦概念擦除框架，该框架可解耦更新并在单流模型上启用现有方法。随后，在此框架内，我们引入了拉格朗日引导自适应擦除调制，这是一种进一步平衡敏感擦除与保存权衡的约束算法。此外，我们提供了严格的收敛分析，证明 Z-Erase 可以收敛到 Pareto 驻点。实验表明，Z-Erase 成功克服了生成崩溃问题，在各种任务中实现了最先进的性能。

Title: MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Authors: Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25108
Pdf URL: https://arxiv.org/pdf/2603.25108
Copy Paste: [[2603.25108]] MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning(https://arxiv.org/abs/2603.25108)
Keywords: generation, generative
Abstract: Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable RL for MRMs with limited multimodal data. MSRL replaces the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data, and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement-learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench), without requiring additional multimodal preference annotations. Our code is available at: this https URL.
摘要：多模式奖励模型的最新进展很大程度上是由判别性方法到生成性方法的范式转变所推动的。在此进展的基础上，最近的研究进一步利用可验证奖励（RLVR）的强化学习来增强多模式奖励模型（MRM）。尽管取得了成功，但基于 RLVR 的训练通常依赖于标记的多模态偏好数据，而获取这些数据的成本高昂且需要大量劳动力，因此难以扩展 MRM 训练。为了克服这一限制，我们提出了一种多阶段强化学习（MSRL）方法，该方法可以利用有限的多模态数据为 MRM 实现可扩展的 RL。 MSRL 取代了传统的基于 RLVR 的训练范式，首先从大规模文本偏好数据中学习可概括的奖励推理能力，然后通过基于字幕和完全多模态的强化学习阶段逐步将该能力转移到多模态任务。此外，我们引入了跨模式知识蒸馏方法来提高 MSRL 内的偏好泛化。大量实验表明，MSRL 有效地扩展了基于 RLVR 的生成 MRM 训练，并显着提高了其在视觉理解和视觉生成任务中的性能（例如，在 VL-RewardBench 上从 66.6% 提高到 75.9%，在 GenAI-Bench 上从 70.2% 提高到 75.7%），而不需要额外的多模态偏好注释。我们的代码位于：此 https URL。

Title: MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

Authors: Yuto Matsuo, Yoshihiro Fukuhara, Yuki M. Asano, Rintaro Yanagi, Hirokatsu Kataoka, Akio Nakamura
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25109
Pdf URL: https://arxiv.org/pdf/2603.25109
Copy Paste: [[2603.25109]] MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness(https://arxiv.org/abs/2603.25109)
Keywords: generative
Abstract: Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.
摘要：数据增强是提高图像分类模型鲁棒性的关键技术。然而，许多最近的方法依赖于基于扩散的合成或复杂的特征混合策略，这会引入大量的计算开销或需要外部数据集。在这项工作中，我们探索了一个不同的方向：基于分析干扰模式的程序增强。与依赖随机噪声、特征混合或生成模型的传统增强方法不同，我们的方法利用莫尔干涉来生成跨越广泛空间频率的结构化扰动。我们提出了一种轻量级增强方法，该方法使用封闭式数学公式在程序上动态生成莫尔纹理。这些模式直接在内存中合成，计算成本可以忽略不计（每张图像 0.0026 秒），在训练期间与训练图像混合，并立即丢弃，从而实现无需外部数据的无存储增强管道。使用 Vision Transformers 进行的大量实验表明，所提出的方法持续提高了多个基准的鲁棒性，包括 ImageNet-C、ImageNet-R 和对抗性基准，优于标准增强基线和现有的无外部数据增强方法。这些结果表明，分析干扰模式为数据驱动的生成增强方法提供了实用且有效的替代方案。

Title: SEVerA: Verified Synthesis of Self-Evolving Agents

Authors: Debangshu Banerjee, Changming Xu, Gagandeep Singh
Subjects: cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.25111
Pdf URL: https://arxiv.org/pdf/2603.25111
Copy Paste: [[2603.25111]] SEVerA: Verified Synthesis of Self-Evolving Agents(https://arxiv.org/abs/2603.25111)
Keywords: generation, generative
Abstract: Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($\tau^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.
摘要：最近的进展表明，自我进化的 LLM 代理在程序修复和科学发现等任务上的有效性。在此范例中，规划器 LLM 综合了一个代理程序，该程序调用参数模型（包括 LLM），然后根据任务进行调整以提高性能。然而，现有的自我进化代理框架没有提供安全性或正确性的正式保证。由于此类程序通常是根据看不见的输入自动执行的，因此缺乏保证会引发可靠性和安全性问题。我们将代理代码生成制定为受限学习问题，将硬形式规范与捕获任务效用的软目标相结合。我们引入了形式保护生成模型（FGGM），它允许规划器 LLM 使用一阶逻辑为每个生成模型调用指定正式的输出契约。每个 FGGM 调用都将底层模型包装在带有经过验证的回退的拒绝采样器中，确保每个返回的输出都满足任何输入和参数设置的约定。在 FGGM 的基础上，我们提出了 SEVerA（自我进化验证代理），这是一个三阶段框架：搜索综合包含 FGGM 调用的候选参数程序；验证证明所有参数值的硬约束的正确性，将问题简化为无约束学习；学习应用可扩展的基于梯度的优化，包括 GRPO 式的微调，以改进软目标，同时保持正确性。我们在 Dafny 程序验证、符号数学综合和符合策略的代理工具使用（$\tau^2$-bench）上评估 SEVerA。在各个任务中，SEVerA 实现了零约束违规，同时提高了无约束和 SOTA 基线的性能，这表明正式的行为约束不仅保证了正确性，而且还引导综合向更高质量的智能体方向发展。

Title: AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

Authors: Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25118
Pdf URL: https://arxiv.org/pdf/2603.25118
Copy Paste: [[2603.25118]] AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization(https://arxiv.org/abs/2603.25118)
Keywords: generation
Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models (MLLMs) to achieve three practical document generation tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issue observed during fine-tuning, AnyDoc further incorporates a height-aware reinforcement learning (HARL) post-training procedure. By defining a reward function based on the difference between predicted and target document heights, overflow is penalized and gradually mitigated during HARL, thereby enhancing overall performance. Qualitative and quantitative experiments demonstrate that AnyDoc outperforms both general-purpose MLLMs and task-specific baselines across all three tasks.
摘要：文档生成在人工智能驱动的内容创建领域受到越来越多的关注。在这项工作中，我们通过引入 AnyDoc 来突破其界限，AnyDoc 是一个能够处理各种文档类别的多个生成任务的框架，所有这些任务都以统一的 HTML/CSS 格式表示。为了克服现有人工文档数据集的覆盖范围和规模有限的问题，AnyDoc 首先建立了一个可扩展的数据合成管道，以自动生成 HTML/CSS 形式的文档。该管道生成 DocHTML，这是一个包含 265,206 个文档样本的大型数据集，同时涵盖 111 个类别和 32 种不同的样式。此外，所有文档都配备了全面的元数据，包括设计意图、HTML/CSS 源代码、视觉资产和渲染的屏幕截图。 AnyDoc 以精选数据集为基础，对多模态大语言模型 (MLLM) 进行微调，以实现三个实用的文档生成任务：意图文档、文档去渲染和元素到文档。为了解决微调期间观察到的内容溢出问题，AnyDoc 进一步整合了高度感知强化学习 (HARL) 训练后程序。通过根据预测文档高度和目标文档高度之间的差异定义奖励函数，在 HARL 期间对溢出进行惩罚并逐渐减轻，从而提高整体性能。定性和定量实验表明，AnyDoc 在所有三项任务中均优于通用 MLLM 和特定于任务的基线。

Title: EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Authors: Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, Hyung-Sin Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25135
Pdf URL: https://arxiv.org/pdf/2603.25135
Copy Paste: [[2603.25135]] EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions(https://arxiv.org/abs/2603.25135)
Keywords: restoration, generation
Abstract: Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at this https URL
摘要：智能玻璃正在成为一种有用的设备，因为它可以在双手忙碌、专注于任务的情况下提供大量见解。为了了解佩戴者的环境，以自我为中心的视图中的 6D 物体姿态估计变得至关重要。然而，现有的 6D 物体姿态估计基准无法捕捉现实世界中以自我为中心的应用程序的挑战，这些应用程序通常受到严重的运动模糊、动态照明和视觉障碍的影响。这种差异在受控实验室数据和混乱的现实应用之间造成了巨大差距。为了弥补这一差距，我们引入了 EgoXtreme，这是一个完全从自我中心角度捕获的新的大规模 6D 姿态估计数据集。 EgoXtreme 具有三个具有挑战性的场景 - 工业维护、运动和紧急救援 - 旨在通过极端照明、严重运动模糊和烟雾引入严重的感知模糊性。对 EgoXtreme 上最先进的可泛化姿态估计器的评估表明，它们的泛化在极端条件下无法成立，尤其是在弱光下。我们进一步证明，简单地应用图像恢复（例如去模糊）对于极端条件并没有带来积极的改善。虽然基于跟踪的方法已经出现了性能增益，但这意味着在快速运动场景中使用时间信息是有意义的。我们的结论是，EgoXtreme 是开发和评估下一代姿势估计模型的重要资源，该模型对于现实世界的自我中心视觉来说足够强大。数据集和代码可在此 https URL 获取

Title: FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation

Authors: Hongxu Ma, Guang Li, Shijie Wang, Dongzhan Zhou, Baoli Sun, Takahiro Ogawa, Miki Haseyama, Zhihui Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25144
Pdf URL: https://arxiv.org/pdf/2603.25144
Copy Paste: [[2603.25144]] FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation(https://arxiv.org/abs/2603.25144)
Keywords: generation
Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set, reducing storage and training cost, and has shown strong results on general benchmarks. Decoupled DD further improves efficiency by splitting the pipeline into pretraining, sample distillation, and soft-label generation. However, existing decoupled methods largely rely on coarse class-label supervision and optimize samples within each class in a nearly identical manner. On fine-grained datasets, this often yields distilled samples that (i) retain large intra-class variation with subtle inter-class differences and (ii) become overly similar within the same class, limiting localized discriminative cues and hurting recognition. To solve the above-mentioned problems, we propose FD$^{2}$, a dedicated framework for Fine-grained Dataset Distillation. FD$^{2}$ localizes discriminative regions and constructs fine-grained representations for distillation. During pretraining, counterfactual attention learning aggregates discriminative representations to update class prototypes. During distillation, a fine-grained characteristic constraint aligns each sample with its class prototype while repelling others, and a similarity constraint diversifies attention across same-class samples. Experiments on multiple fine-grained and general datasets show that FD$^{2}$ integrates seamlessly with decoupled DD and improves performance in most settings, indicating strong transferability.
摘要：数据集蒸馏（DD）将大型训练集压缩为小型合成集，降低了存储和训练成本，并在一般基准测试中显示出强劲的结果。解耦 DD 通过将流程分为预训练、样本蒸馏和软标签生成，进一步提高了效率。然而，现有的解耦方法很大程度上依赖于粗略的类标签监督，并以几乎相同的方式优化每个类内的样本。在细粒度数据集上，这通常会产生蒸馏样本，这些样本（i）保留较大的类内变异和细微的类间差异，并且（ii）在同一类内变得过于相似，限制局部判别线索并损害识别。为了解决上述问题，我们提出了FD$^{2}$，一个细粒度数据集蒸馏的专用框架。 FD$^{2}$ 定位判别区域并构建用于蒸馏的细粒度表示。在预训练期间，反事实注意力学习聚合判别性表征来更新类原型。在蒸馏过程中，细粒度的特征约束将每个样本与其类原型对齐，同时排斥其他样本，并且相似性约束使同一类样本之间的注意力分散。在多个细粒度和通用数据集上的实验表明，FD$^{2}$ 与解耦 DD 无缝集成，并提高了大多数设置下的性能，表明具有很强的可迁移性。

Title: Learning to Rank Caption Chains for Video-Text Alignment

Authors: Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2603.25145
Pdf URL: https://arxiv.org/pdf/2603.25145
Copy Paste: [[2603.25145]] Learning to Rank Caption Chains for Video-Text Alignment(https://arxiv.org/abs/2603.25145)
Keywords: generation
Abstract: Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
摘要：直接偏好优化 (DPO) 是一种有效的技术，可训练语言模型以生成偏好而非偏好的响应。然而，这种二元“赢家通吃”的方法对于视觉语言模型来说并不是最优的，因为视觉语言模型的响应质量高度依赖于视觉内容。特别是，即使响应不如替代方案优选，它仍然可能忠实于视觉输入。标准的 Bradley-Terry DPO 公式缺乏这种细微差别，增加了获胜响应的权重，而没有充分考虑“失败”响应是否仍然保持高视觉保真度。在这项工作中，我们研究排名优化作为替代方案，更准确地定位响应对视觉输入的忠实度。我们专注于使用详细视频字幕的视频文本对齐，提出了一种通过重复字幕降级来大规模生成具有挑战性的、完全有序的字幕链的方法。我们的结果表明，对于长格式内容生成和评估，排名优化优于二进制 DPO，而且重要的是，我们发现这些方法需要对视觉编码器进行微调才能有效，这挑战了 DPO 作为纯粹的语言重新加权过程的观点。

Title: Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Authors: Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25155
Pdf URL: https://arxiv.org/pdf/2603.25155
Copy Paste: [[2603.25155]] Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models(https://arxiv.org/abs/2603.25155)
Keywords: restoration
Abstract: Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.
摘要：多模态大语言模型对于临床视觉问答任务很有希望，但扩展到 3D 成像却受到高计算成本的阻碍。先前的方法通常依赖于 2D 切片或固定长度的令牌压缩，破坏体积连续性并模糊微妙的发现。我们提出了 Photon，一个用可变长度的标记序列来表示 3D 医学体积的框架。 Photon 引入了指令条件令牌调度和代理梯度传播，以在训练和推理过程中自适应地减少令牌，从而降低计算成本，同时减轻冗余令牌造成的注意力稀释。它结合了自定义反向传播规则和梯度恢复，以在离散令牌下降的情况下实现可微优化。为了稳定令牌压缩并确保视觉证据的可靠使用，Photon 进一步应用了正则化目标，以减轻仅语言偏差并提高可靠性。对各种医学视觉问答任务的实验表明，Photon 实现了最先进的准确性，同时减少了资源使用并加速了训练和推理。

Title: Vision Hopfield Memory Networks

Authors: Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Mykyta Smyrnov, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2603.25157
Pdf URL: https://arxiv.org/pdf/2603.25157
Copy Paste: [[2603.25157]] Vision Hopfield Memory Networks(https://arxiv.org/abs/2603.25157)
Keywords: generation
Abstract: Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.
摘要：最近的视觉和多模式基础骨干，例如 Transformer 系列和 Mamba 等状态空间模型，取得了显着的进展，实现了跨图像、文本等的统一建模。尽管它们在实证上取得了成功，但这些架构仍然远离人脑的计算原理，通常需要大量的训练数据，同时提供有限的可解释性。在这项工作中，我们提出了 Vision Hopfield 记忆网络（V-HMN），这是一种受大脑启发的基础主干，它将分层记忆机制与迭代细化更新相结合。具体来说，V-HMN 结合了在图像块级别提供关联记忆动态的局部 Hopfield 模块、充当上下文调制情景记忆的全局 Hopfield 模块，以及用于迭代纠错的受预测编码启发的细化规则。通过分层组织这些基于内存的模块，V-HMN 在统一框架中捕获局部和全局动态。内存检索揭示了输入和存储模式之间的关系，使决策更易于解释，而存储模式的重用则提高了数据效率。因此，这种受大脑启发的设计增强了可解释性和数据效率，超越了现有的基于自我关注或状态空间的方法。我们对公共计算机视觉基准进行了广泛的实验，V-HMN 与广泛采用的骨干架构相比取得了有竞争力的结果，同时提供了更好的可解释性、更高的数据效率和更强的生物学合理性。这些发现凸显了 V-HMN 作为下一代视觉基础模型的潜力，同时还为文本和音频等领域的多模态主干提供了通用的蓝图，从而将大脑启发计算与大规模机器学习联系起来。

Title: SportSkills: Physical Skill Learning from Sports Instructional Videos

Authors: Kumar Ashutosh, Chi Hsuan Wu, Kristen Grauman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25163
Pdf URL: https://arxiv.org/pdf/2603.25163
Copy Paste: [[2603.25163]] SportSkills: Physical Skill Learning from Sports Instructional Videos(https://arxiv.org/abs/2603.25163)
Keywords: generation
Abstract: Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.
摘要：当前的大规模视频数据集侧重于一般人类活动，但缺乏对解决身体技能学习所需的细粒度活动的深度覆盖。我们推出了 SportSkills，这是第一个大规模运动数据集，旨在通过野外视频进行身体技能学习。 SportSkills 拥有超过 36 万个教学视频，其中包含超过 63 万个视觉演示，并配有教学旁白，解释 55 种不同运动项目动作背后的专业知识。通过一系列实验，我们证明了 SportSkills 解锁了理解身体动作之间细微差异的能力。使用在传统以活动为中心的数据集上训练的相同模型，我们的表示实现了高达 4 倍的增益。至关重要的是，在 SportSkills 的基础上，我们引入了第一个大规模任务制定，包括错误条件教学视频检索、桥接表征学习和可操作的反馈生成（例如，“这是我执行的一项技能；我应该观看哪个视频剪辑来改进它？”）。专业教练的正式评估表明，我们的检索方法显着提高了视频模型针对用户查询个性化视觉指令的能力。

Title: Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Authors: Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2603.25178
Pdf URL: https://arxiv.org/pdf/2603.25178
Copy Paste: [[2603.25178]] Bilingual Text-to-Motion Generation: A New Benchmark and Baselines(https://arxiv.org/abs/2603.25178)
Keywords: generation
Abstract: Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual diffusion models and translation baselines on BiHumanML3D, underscoring the critical necessity and reliability of our dataset and the effectiveness of our alignment strategy for cross-lingual motion synthesis. The dataset and code are released at \href{this https URL}{this https URL}
摘要：文本到动作的生成对于跨语言应用具有巨大的潜力，但由于缺乏双语数据集和现有语言模型的跨语言语义理解较差而受到阻碍。为了解决这些差距，我们引入了 BiHumanML3D，这是第一个双语文本到运动基准，通过法学硕士辅助注释和严格的手动校正构建。此外，我们提出了一个简单而有效的基线，即双语运动扩散（BiMD），其特征是跨语言对齐（CLA）。 CLA 明确地调整了跨语言的语义表示，创建了一个强大的条件空间，可以从双语输入生成高质量的运动，包括零样本代码切换场景。大量实验表明，带有 CLA 的 BiMD 的 FID 为 0.045 vs. 0.169，R@3 为 82.8\% vs. 80.8\%，显着优于 BiHumanML3D 上的单语言扩散模型和翻译基线，强调了我们数据集的关键必要性和可靠性以及我们跨语言运动合成对齐策略的有效性。数据集和代码发布于\href{这个https URL}{这个https URL}

Title: VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian André, Sandy Engelhardt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25181
Pdf URL: https://arxiv.org/pdf/2603.25181
Copy Paste: [[2603.25181]] VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers(https://arxiv.org/abs/2603.25181)
Keywords: generation, generative
Abstract: Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at this https URL.
摘要：扩散模型已成为高保真医学图像合成的主要方法。然而，大多数现有的 3D 医学图像生成方法依赖于潜在扩散框架内的卷积 U-Net 主干。虽然有效，但这些架构强加了强烈的局部性偏差和有限的感受野，这可能会限制可扩展性、全局上下文集成和灵活的调节。在这项工作中，我们介绍了 VolDiT，这是第一个纯粹基于变压器的 3D 扩散变压器，用于体积医学图像合成。我们的方法通过体积补丁嵌入和直接在 3D 令牌上运行的全局自注意力，将扩散变换器扩展到原生 3D 数据。为了实现结构化控制，我们提出了一种时间步门控控制适配器，它将分段掩码映射到可学习的控制令牌中，以在去噪期间调制变压器层。这种令牌级调节机制允许精确的空间引导，同时保留变压器架构的建模优势。我们在高分辨率 3D 医学图像合成任务上评估我们的模型，并将其与基于 U-Net 的最先进的 3D 潜在扩散模型进行比较。结果表明，整体一致性得到改善，生成保真度更高，可控性也得到增强。我们的研究结果表明，完全基于变压器的扩散模型为体积医学图像合成提供了灵活的基础。使用公共数据训练的代码和模型可在此 https URL 中获取。

Title: Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Authors: Adam Jakobsen, Sushant Gautam, Hugo Lewi Hammer, Susanne Olofsdotter, Miriam S Johanson, Pål Halvorsen, Vajira Thambawita
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.25186
Pdf URL: https://arxiv.org/pdf/2603.25186
Copy Paste: [[2603.25186]] Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation(https://arxiv.org/abs/2603.25186)
Keywords: generation
Abstract: AI systems in healthcare research have shown potential to increase patient throughput and assist clinicians, yet progress is constrained by limited access to real patient data. To address this issue, we present a zero-shot, knowledge-guided framework for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD-10). We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data. The resulting models were benchmarked against two state-of-the-art deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. Evaluation was performed on six anxiety-related disorders: specific phobia, social anxiety disorder, agoraphobia, generalized anxiety disorder, separation anxiety disorder, and panic disorder. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a no-retrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score. Overall, grounding an LLM in clinical knowledge enables high-quality, privacy-preserving synthetic psychiatric data when real datasets are unavailable or cannot be shared.
摘要：医疗保健研究中的人工智能系统已显示出提高患者吞吐量和帮助临床医生的潜力，但进展受到对真实患者数据的访问有限的限制。为了解决这个问题，我们提出了一个用于精神科表格数据的零样本知识引导框架，其中使用精神疾病诊断和统计手册（DSM-5）和国际疾病分类（ICD-10）通过检索增强生成来引导大型语言模型（LLM）。我们使用不同的知识库组合进行了实验，以生成保护隐私的合成数据。由此产生的模型与用于合成表格数据生成的两种最先进的深度学习模型（即 CTGAN 和 TVAE）进行了基准测试，这两种模型都依赖于真实数据，因此存在潜在的隐私风险。对六种与焦虑相关的疾病进行了评估：特定恐惧症、社交焦虑症、广场恐惧症、广泛性焦虑症、分离焦虑症和恐慌症。 CTGAN 通常可以实现最佳的边际和多变量结构，而知识增强的 LLM 在配对结构上具有竞争力，并且在分离焦虑和社交焦虑方面实现了最低的配对误差。消融研究表明，与无检索法学硕士相比，临床检索可靠地提高了单变量和成对保真度。隐私分析表明，与 CTGAN 相比，真正的无数据 LLM 产生适度的重叠和较低的平均链接风险，而 TVAE 尽管 k-map 得分较低，但仍表现出广泛的重复。总体而言，当真实数据集不可用或无法共享时，以临床知识为基础的法学硕士可以提供高质量、保护隐私的合成精神病学数据。

Title: AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References

Authors: Jiahao Wang, Hualian Sheng, Sijia Cai, Yuxiao Yang, Weizhan Zhang, Caixia Yan, Bing Deng, Jieping Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25188
Pdf URL: https://arxiv.org/pdf/2603.25188
Copy Paste: [[2603.25188]] AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References(https://arxiv.org/abs/2603.25188)
Keywords: generation
Abstract: Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.
摘要：保留身份的视频生成为创意表达提供了强大的工具，允许用户自定义包含他们喜爱的角色的视频。然而，流行的方法通常是针对单个身份参考进行设计和优化的。这一基本假设无法充分适应现实世界的不同输入格式，从而限制了创造性的灵活性。依赖单一来源也构成了一种不适定的场景，导致固有的模糊设置，使得模型很难在新的上下文中忠实地再现身份。为了解决这些问题，我们提出了 AnyID，这是一种超保真身份保存视频生成框架，具有两个核心贡献。首先，我们引入了一种可扩展的全参考架构，该架构可以有效地将异构身份输入（例如，面部、肖像和视频）统一为一个有凝聚力的表示。其次，我们提出了一种主要引用生成范例，它将一个引用指定为规范锚，并使用一种新颖的差异提示来实现精确的属性级可控性。我们在大规模、精心策划的数据集上进行训练，以确保鲁棒性和高保真度，然后使用强化学习执行最终的微调阶段。此过程利用根据人类评估构建的偏好数据集，其中注释者根据两个关键标准对视频进行成对比较：身份保真度和提示可控性。广泛的评估验证了 AnyID 在不同的任务设置中实现了超高的身份保真度以及卓越的属性级可控性。

Title: CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis

Authors: Marvin Seyfarth, Sarah Kaye Müller, Arman Ghanaat, Isabelle Ayx, Fabian Fastenrath, Philipp Wild, Alexander Hertel, Theano Papavassiliu, Salman Ul Hassan Dar, Sandy Engelhardt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25194
Pdf URL: https://arxiv.org/pdf/2603.25194
Copy Paste: [[2603.25194]] CardioDiT: Latent Diffusion Transformers for 4D Cardiac MRI Synthesis(https://arxiv.org/abs/2603.25194)
Keywords: generative
Abstract: Latent diffusion models (LDMs) have recently achieved strong performance in 3D medical image synthesis. However, modalities like cine cardiac MRI (CMR), representing a temporally synchronized 3D volume across the cardiac cycle, add an additional dimension that most generative approaches do not model directly. Instead, they factorize space and time or enforce temporal consistency through auxiliary mechanisms such as anatomical masks. Such strategies introduce structural biases that may limit global context integration and lead to subtle spatiotemporal discontinuities or physiologically inconsistent cardiac dynamics. We investigate whether a unified 4D generative model can learn continuous cardiac dynamics without architectural factorization. We propose CardioDiT, a fully 4D latent diffusion framework for short-axis cine CMR synthesis based on diffusion transformers. A spatiotemporal VQ-VAE encodes 2D+t slices into compact latents, which a diffusion transformer then models jointly as complete 3D+t volumes, coupling space and time throughout the generative process. We evaluate CardioDiT on public CMR datasets and a larger private cohort, comparing it to baselines with progressively stronger spatiotemporal coupling. Results show improved inter-slice consistency, temporally coherent motion, and realistic cardiac function distributions, suggesting that explicit 4D modeling with a diffusion transformer provides a principled foundation for spatiotemporal cardiac image synthesis. Code and models trained on public data are available at this https URL.
摘要：潜在扩散模型 (LDM) 最近在 3D 医学图像合成中取得了强大的性能。然而，像电影心脏 MRI (CMR) 这样的模式代表了整个心动周期的时间同步 3D 体积，增加了大多数生成方法无法直接建模的额外维度。相反，它们分解空间和时间或通过解剖掩模等辅助机制强制时间一致性。这种策略引入了结构偏差，可能会限制全局上下文整合并导致微妙的时空不连续性或生理上不一致的心脏动力学。我们研究统一的 4D 生成模型是否可以在无需架构分解的情况下学习连续的心脏动力学。我们提出了 CardioDiT，这是一种基于扩散变压器的短轴电影 CMR 合成的全 4D 潜在扩散框架。时空 VQ-VAE 将 2D+t 切片编码为紧凑的潜在变量，然后扩散转换器将其联合建模为完整的 3D+t 体积，在整个生成过程中耦合空间和时间。我们在公共 CMR 数据集和更大的私人队列上评估 CardioDiT，将其与时空耦合逐渐增强的基线进行比较。结果显示，切片间一致性、时间相干运动和真实的心脏功能分布得到改善，这表明使用扩散变换器的显式 4D 建模为时空心脏图像合成提供了原则基础。此 https URL 提供了基于公共数据训练的代码和模型。

Title: Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Authors: Jiahao Tian, Chenxi Song, Wei Cheng, Chi Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25209
Pdf URL: https://arxiv.org/pdf/2603.25209
Copy Paste: [[2603.25209]] Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction(https://arxiv.org/abs/2603.25209)
Keywords: generation
Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at this https URL.
摘要：使用预先训练的视频扩散模型（通常在短片上进行训练）生成长视频提出了重大挑战。直接应用这些模型进行长视频推理通常会导致视觉质量显着下降。本文指出，这个问题主要源于两个分布外（O.O.D）问题：帧级相对位置 O.O.D 和上下文长度 O.O.D。为了应对这些挑战，我们提出了 FreeLOC，一种新颖的免训练、层自适应框架，它引入了两种核心技术：用于帧级相对位置 O.O.D 的基于视频的相对位置重新编码（VRPR），一种对时间相对位置进行分层重新编码以与模型的预训练分布保持一致的多粒度策略，以及用于上下文长度 O.O.D 的分层稀疏注意力（TSA），它保留了局部细节和远程细节通过跨不同时间尺度构建注意力密度来建立依赖关系。至关重要的是，我们引入了一种层自适应探测机制，该机制可识别每个变压器层对这些 O.O.D 问题的敏感性，从而允许选择性且高效地应用我们的方法。大量的实验表明，我们的方法明显优于现有的免训练方法，在时间一致性和视觉质量方面均取得了最先进的结果。代码可从此 https URL 获取。

Title: Efficient Preemptive Robustification with Image Sharpening

Authors: Jiaming Liang, Chi-Man Pun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25244
Pdf URL: https://arxiv.org/pdf/2603.25244
Copy Paste: [[2603.25244]] Efficient Preemptive Robustification with Image Sharpening(https://arxiv.org/abs/2603.25244)
Keywords: generation
Abstract: Despite their great success, deep neural networks rely on high-dimensional, non-robust representations, making them vulnerable to imperceptible perturbations, even in transfer scenarios. To address this, both training-time defenses (e.g., adversarial training and robust architecture design) and post-attack defenses (e.g., input purification and adversarial detection) have been extensively studied. Recently, a limited body of work has preliminarily explored a pre-attack defense paradigm, termed preemptive robustification, which introduces subtle modifications to benign samples prior to attack to proactively resist adversarial perturbations. Unfortunately, their practical applicability remains questionable due to several limitations, including (1) reliance on well-trained classifiers as surrogates to provide robustness priors, (2) substantial computational overhead arising from iterative optimization or trained generators for robustification, and (3) limited interpretability of the optimization- or generation-based robustification processes. Inspired by recent studies revealing a positive correlation between texture intensity and the robustness of benign samples, we show that image sharpening alone can efficiently robustify images. To the best of our knowledge, this is the first surrogate-free, optimization-free, generator-free, and human-interpretable robustification approach. Extensive experiments demonstrate that sharpening yields remarkable robustness gains with low computational cost, especially in transfer scenarios.
摘要：尽管深度神经网络取得了巨大成功，但它依赖于高维、非鲁棒的表示，这使得它们很容易受到难以察觉的扰动，即使在传输场景中也是如此。为了解决这个问题，训练时防御（例如，对抗性训练和稳健的架构设计）和攻击后防御（例如，输入净化和对抗性检测）都得到了广泛的研究。最近，有限的工作初步探索了一种攻击前防御范式，称为先发制人的鲁棒性，它在攻击前对良性样本进行微妙的修改，以主动抵抗对抗性扰动。不幸的是，由于存在一些局限性，它们的实际适用性仍然存在疑问，包括（1）依赖训练有素的分类器作为替代来提供鲁棒性先验，（2）迭代优化或经过训练的鲁棒性生成器产生大量计算开销，以及（3）基于优化或基于生成的鲁棒性过程的可解释性有限。最近的研究揭示了纹理强度与良性样本的鲁棒性之间存在正相关性，受到启发，我们表明仅图像锐化就可以有效地鲁棒图像。据我们所知，这是第一个无代理、无优化、无生成器且人类可解释的鲁棒性方法。大量实验表明，锐化可以以较低的计算成本带来显着的鲁棒性增益，尤其是在传输场景中。

Title: Semantic-Aware Prefix Learning for Token-Efficient Image Generation

Authors: Qingfeng Li, Haoxian Zhang, Xu He, Songlin Tang, Zhixue Fang, Xiaoqiang Liu, Pengfei Wan Guoqi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25249
Pdf URL: https://arxiv.org/pdf/2603.25249
Copy Paste: [[2603.25249]] Semantic-Aware Prefix Learning for Token-Efficient Image Generation(https://arxiv.org/abs/2603.25249)
Keywords: generation, generative
Abstract: Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.
摘要：视觉分词器通过连接高维图像和易于处理的生成模型，在潜在图像生成中发挥着核心作用。然而，大多数现有的分词器仍然以重建为主的目标进行训练，这些目标通常会产生仅弱基于高级语义的潜在表示。最近的方法改进了语义对齐，但通常将语义信号视为辅助正则化，而不是使它们在功能上成为表示学习所必需的。我们提出了 SMAP，一种语义感知前缀分词器，它将类级语义条件注入到基于查询的一维分词框架中。为了使语义在训练过程中不可或缺，SMAP 引入了尾部令牌丢弃策略，该策略迫使语义条件和早期潜在前缀在逐渐减少的令牌预算下承担越来越大的责任。为了验证生成的潜在空间对于生成而不是单独重建有用，我们进一步引入了 CARD，一种混合因果自回归-扩散生成器。 ImageNet 上的大量实验表明，SMAP 在离散和连续标记化设置中持续提高了重建质量，并且其基于语义的潜在空间在紧凑的标记预算下产生了强大的下游生成性能。

Title: Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework

Authors: Hongru Han, Tingrui Guo, Liming Zhang, Yan Su, Qiwen Xu, Zhuohua Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25296
Pdf URL: https://arxiv.org/pdf/2603.25296
Copy Paste: [[2603.25296]] Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework(https://arxiv.org/abs/2603.25296)
Keywords: restoration
Abstract: Low-light image enhancement (LLIE) has traditionally been formulated as a deterministic mapping. However, this paradigm often struggles to account for the ill-posed nature of the task, where unknown ambient conditions and sensor parameters create a multimodal solution space. Consequently, state-of-the-art methods frequently encounter luminance discrepancies between predictions and labels, often necessitating "gt-mean" post-processing to align output luminance for evaluation. To address this fundamental limitation, we propose a transition toward Controllable Low-light Enhancement (CLE), explicitly reformulating the task as a well-posed conditional problem. To this end, we introduce CLE-RWKV, a holistic framework supported by Light100, a new benchmark featuring continuous real-world illumination transitions. To resolve the conflict between luminance control and chromatic fidelity, a noise-decoupled supervision strategy in the HVI color space is employed, effectively separating illumination modulation from texture restoration. Architecturally, to adapt efficient State Space Models (SSMs) for dense prediction, we leverage a Space-to-Depth (S2D) strategy. By folding spatial neighborhoods into channel dimensions, this design allows the model to recover local inductive biases and effectively bridge the "scanning gap" inherent in flattened visual sequences without sacrificing linear complexity. Experiments across seven benchmarks demonstrate that our approach achieves competitive performance and robust controllability, providing a real-world multi-illumination alternative that significantly reduces the reliance on gt-mean post-processing.
摘要：低光图像增强 (LLIE) 传统上被表述为确定性映射。然而，这种范式常常难以解释任务的不适定性质，其中未知的环境条件和传感器参数创建了多模态解决方案空间。因此，最先进的方法经常遇到预测和标签之间的亮度差异，通常需要“gt-mean”后处理来对齐输出亮度以进行评估。为了解决这一根本限制，我们建议向可控低光增强（CLE）过渡，明确地将任务重新表述为适定条件问题。为此，我们推出了 CLE-RWKV，这是一个由 Light100 支持的整体框架，Light100 是一个具有连续现实世界照明过渡的新基准。为了解决亮度控制和色彩保真度之间的冲突，采用了 HVI 色彩空间中的噪声解耦监督策略，有效地将照明调制与纹理恢复分开。在架构上，为了适应高效的状态空间模型（SSM）进行密集预测，我们利用了空到深度（S2D）策略。通过将空间邻域折叠成通道维度，这种设计允许模型恢复局部归纳偏差，并有效地弥合扁平化视觉序列中固有的“扫描间隙”，而无需牺牲线性复杂性。七个基准测试的实验表明，我们的方法实现了具有竞争力的性能和强大的可控性，提供了现实世界的多照明替代方案，显着减少了对 gt 均值后处理的依赖。

Title: MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Authors: Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25319
Pdf URL: https://arxiv.org/pdf/2603.25319
Copy Paste: [[2603.25319]] MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data(https://arxiv.org/abs/2603.25319)
Keywords: generation, generative
Abstract: Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.
摘要：生成基于多个视觉参考的图像对于现实世界的应用（例如多主题合成、叙事插图和新颖的视图合成）至关重要，但随着输入参考数量的增加，当前模型的性能会严重下降。我们将根本原因确定为基本数据瓶颈：现有数据集以单个或少数参考对为主，缺乏学习密集的相互参考依赖关系所需的结构化、长上下文监督。为了解决这个问题，我们引入了 MacroData，这是一个包含 40 万个样本的大型数据集，每个样本最多包含 10 张参考图像，系统地组织在四个互补维度（定制、插图、空间推理和时间动态）中，以提供多参考生成空间的全面覆盖。认识到同时缺乏标准化评估协议，我们进一步提出了 MacroBench，这是一个包含 4,000 个样本的基准，用于评估分级任务维度和输入规模之间的生成一致性。大量实验表明，对 MacroData 进行微调可以显着改进多参考生成，消融研究进一步揭示了跨任务协同训练的协同效益和处理长上下文复杂性的有效策略。数据集和基准将公开发布。

Title: CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration

Authors: Keming Ye, Zhou Zhao, Fan Wu, Shengyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25463
Pdf URL: https://arxiv.org/pdf/2603.25463
Copy Paste: [[2603.25463]] CIAR: Interval-based Collaborative Decoding for Image Generation Acceleration(https://arxiv.org/abs/2603.25463)
Keywords: generation
Abstract: Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.
摘要：自回归（AR）模型最近在图像生成方面取得了显着进展，其性能可与基于扩散的方法相媲美。然而，它们的计算强度和顺序性质阻碍了设备上的部署，导致破坏性的延迟。我们通过云设备协作框架 \textbf{CIAR} 来解决这个问题，该框架利用设备上的自我验证来处理视觉合成的两个关键属性：高保真图像所需的 \textit{大量令牌词汇} 和 \textit{固有的空间冗余} ，这导致同质区域中的极端可预测性，而对象边界表现出高度的不确定性。统一验证在这种冗余令牌上浪费了资源。我们的解决方案以设备上的令牌不确定性量化器为中心，它采用连续概率间隔来加速处理，并使其适用于大型视觉词汇表，而不是传统的离散解决方案集。此外，我们还采用了间隔增强解码模块，以进一步加快解码速度，同时通过分布对齐训练策略保持视觉保真度和语义一致性。大量实验表明，与现有方法相比，CIAR 实现了 2.18 倍的加速并减少了 70% 的云请求，同时保持了图像质量。

Title: RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

Authors: Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25502
Pdf URL: https://arxiv.org/pdf/2603.25502
Copy Paste: [[2603.25502]] RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models(https://arxiv.org/abs/2603.25502)
Keywords: restoration
Abstract: Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.
摘要：现实世界退化下的图像恢复对于自动驾驶和物体检测等下游任务至关重要。然而，现有的恢复模型通常受到训练数据的规模和分布的限制，导致对现实场景的泛化能力较差。最近，大规模图像编辑模型在恢复任务中表现出了很强的泛化能力，特别是像Nano Banana Pro这样的闭源模型，可以在恢复图像的同时保持一致性。然而，利用这些大型通用模型实现这样的性能需要大量的数据和计算成本。为了解决这个问题，我们构建了一个涵盖九种常见现实世界退化类型的大规模数据集，并训练了一个最先进的开源模型，以缩小与闭源替代方案的差距。此外，我们还引入了 RealIR-Bench，其中包含 464 个真实世界的退化图像和专注于退化消除和一致性保持的定制评估指标。大量的实验表明我们的模型在开源方法中排名第一，实现了最先进的性能。

Title: Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training

Authors: Xiangyang Luo, Qingyu Li, Yuming Li, Guanbo Huang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Shao-Lun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25527
Pdf URL: https://arxiv.org/pdf/2603.25527
Copy Paste: [[2603.25527]] Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training(https://arxiv.org/abs/2603.25527)
Keywords: generation
Abstract: Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion intensity inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model's learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for motion-rich data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.
摘要：视频生成模型的最新进展取得了令人印象深刻的成果。然而，这些模型严重依赖于结合高视觉质量和高运动质量的高质量数据的使用。在本文中，我们确定了视频数据管理中的一个关键挑战：运动视觉质量困境。我们发现，视觉质量和运动强度本质上表现出负相关性，因此很难获得在这两个方面都表现出色的黄金数据。为了应对这一挑战，我们首先检查视频扩散模型的分层学习动态，并对质量下降的样本进行基于梯度的分析。我们发现质量不平衡的数据可以在适当的时间步产生类似于黄金数据的梯度。在此基础上，我们在训练过程中引入了时间步选择的新概念。我们提出了时间步感知质量解耦（TQD），它修改数据采样分布以更好地匹配模型的学习过程。对于某些类型的数据，对于运动丰富的数据，采样分布偏向于较高的时间步长，而高视觉质量的数据更有可能在较低的时间步长期间采样。通过大量的实验，我们证明了 TQD 能够专门对分离的不平衡数据进行训练，从而以更好的数据实现超越传统训练的性能，挑战了视频生成中完美数据的必要性。此外，我们的方法在高质量数据上进行训练时还可以提高模型性能，展示其在不同数据场景中的有效性。

Title: BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

Authors: Ning Ding, Keisuke Fujii, Toru Tamaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25533
Pdf URL: https://arxiv.org/pdf/2603.25533
Copy Paste: [[2603.25533]] BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning(https://arxiv.org/abs/2603.25533)
Keywords: generation
Abstract: Understanding tactical dynamics in badminton requires analyzing entire matches rather than isolated clips. However, existing badminton datasets mainly focus on short clips or task-specific annotations and rarely provide full-match data with dense multimodal annotations. This limitation makes it difficult to generate accurate shot captions and perform match-level analysis. To address this limitation, we introduce the first Badminton Full Match Dense (BFMD) dataset, with 19 broadcast matches (including both singles and doubles) covering over 20 hours of play, comprising 1,687 rallies and 16,751 hit events, each annotated with a shot caption. The dataset provides hierarchical annotations including match segments, rally events, and dense rally-level multimodal annotations such as shot types, shuttle trajectories, player pose keypoints, and shot captions. We develop a VideoMAE-based multimodal captioning framework with a Semantic Feedback mechanism that leverages shot semantics to guide caption generation and improve semantic consistency. Experimental results demonstrate that multimodal modeling and semantic feedback improve shot caption quality over RGB-only baselines. We further showcase the potential of BFMD by analyzing the temporal evolution of tactical patterns across full matches.
摘要：了解羽毛球比赛的战术动态需要分析整个比赛而不是孤立的片段。然而，现有的羽毛球数据集主要关注短片或特定于任务的注释，很少提供具有密集多模态注释的全匹配数据。这种限制使得生成准确的镜头说明和执行比赛级别分析变得困难。为了解决这个限制，我们引入了第一个羽毛球全场比赛密集（BFMD）数据集，其中包含 19 场转播比赛（包括单打和双打），涵盖超过 20 个小时的比赛，包括 1,687 场比赛和 16,751 场比赛，每场比赛都带有镜头说明。该数据集提供分层注释，包括比赛片段、拉力赛事件和密集的拉力赛级多模态注释，例如击球类型、穿梭轨迹、球员姿势关键点和击球说明。我们开发了一个基于 VideoMAE 的多模态字幕框架，具有语义反馈机制，利用镜头语义来指导字幕生成并提高语义一致性。实验结果表明，多模态建模和语义反馈比仅 RGB 基线提高了镜头字幕质量。我们通过分析整场比赛中战术模式的时间演变，进一步展示了 BFMD 的潜力。

Title: An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae

Authors: Neha K. Nair, Aaron D'Souza
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2603.25561
Pdf URL: https://arxiv.org/pdf/2603.25561
Copy Paste: [[2603.25561]] An Integrative Genome-Scale Metabolic Modeling and Machine Learning Framework for Predicting and Optimizing Biofuel-Relevant Biomass Production in Saccharomyces cerevisiae(https://arxiv.org/abs/2603.25561)
Keywords: generative
Abstract: Saccharomyces cerevisiae is a cornerstone organism in industrial biotechnology, valued for its genetic tractability and robust fermentative capacity. Accurately predicting biomass flux across diverse environmental and genetic perturbations remains a significant challenge for rational strain design. We present a computational framework combining the Yeast9 genome-scale metabolic model with machine learning and optimization to predict, interpret, and enhance biomass flux. Flux balance analysis generated 2,000 flux profiles by varying glucose, oxygen, and ammonium uptake rates. Random Forest and XGBoost regressors achieved R2 of 0.99989 and 0.9990, respectively. A variational autoencoder revealed four distinct metabolic clusters, and SHAP analysis identified glycolysis, the TCA cycle, and lipid biosynthesis as key biomass determinants. In silico overexpression achieved a biomass flux of 0.979 gDW/hr, while Bayesian optimization of nutrient constraints produced a 12-fold increase (0.0858 to 1.041 gDW/hr). A generative adversarial network proposed stoichiometrically feasible novel flux configurations. This framework demonstrates how genome-scale simulation, interpretable ML, and generative modeling can advance yeast metabolic engineering.
摘要：酿酒酵母是工业生物技术的基石生物，因其遗传易处理性和强大的发酵能力而受到重视。准确预测不同环境和遗传扰动下的生物量通量仍然是合理菌株设计的重大挑战。我们提出了一个计算框架，将 Yeast9 基因组规模代谢模型与机器学习和优化相结合，以预测、解释和增强生物量通量。通量平衡分析通过改变葡萄糖、氧气和铵的吸收率生成 2,000 个通量曲线。随机森林和 XGBoost 回归器的 R2 分别为 0.99989 和 0.9990。变分自动编码器揭示了四个不同的代谢簇，SHAP 分析确定了糖酵解、TCA 循环和脂质生物合成是关键的生物量决定因素。在计算机上过表达实现了 0.979 gDW/hr 的生物量通量，而营养限制的贝叶斯优化产生了 12 倍的增加（0.0858 至 1.041 gDW/hr）。生成对抗网络提出了化学计量上可行的新型通量配置。该框架展示了基因组规模模拟、可解释的机器学习和生成模型如何推进酵母代谢工程。

Title: GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing

Authors: Xuran Hu, Zhitong Xiong, Zhongcheng Hong, Yifang Ban, Xiaoxiang Zhu, Wufan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25565
Pdf URL: https://arxiv.org/pdf/2603.25565
Copy Paste: [[2603.25565]] GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing(https://arxiv.org/abs/2603.25565)
Keywords: generation
Abstract: Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.
摘要：当前地球观测中的大型多模态模型（LMM）通常忽略关键的“垂直”维度，限制了它们在复杂的遥感几何形状和物理空间结构通常超过平面视觉纹理的灾难场景中的推理能力。为了弥补这一差距，我们引入了一个专门用于高度感知遥感理解的综合评估框架。首先，为了克服注释数据的严重缺乏，我们利用系统提示工程和元数据提取开发了一个可扩展的、VLM 驱动的数据生成管道。该管道构建了两个互补的基准：用于相对高度分析的 GeoHeight-Bench，以及用于整体地形感知推理的更具挑战性的 GeoHeight-Bench+。此外，为了验证高度感知的必要性，我们提出了 GeoHeightChat，第一个高度感知的遥感 LMM 基线。作为强有力的概念证明，我们的基线表明，将视觉语义与隐式注入的高度几何特征协同有效地减轻了“垂直盲点”，成功地解锁了现有光学模型中交互式高度推理的新范例。

Title: Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25706
Pdf URL: https://arxiv.org/pdf/2603.25706
Copy Paste: [[2603.25706]] Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training(https://arxiv.org/abs/2603.25706)
Keywords: generation
Abstract: Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.
摘要：最近的统一模型在理解和生成方面都取得了前所未有的进步。然而，虽然它们中的大多数接受多模态输入，但它们通常只产生单模态输出。生成交错内容的挑战主要是由于训练数据稀缺以及对远程跨模式上下文建模的困难。为了解决这个问题，我们将交错生成分解为文本规划和视觉一致性建模，并引入一个由规划器和可视化器组成的框架。规划器为视觉内容生成密集的文本描述，而可视化器则相应地合成图像。在此指导下，我们构建大规模文本代理交错数据（其中视觉内容以文本表示）来训练规划器，并策划参考引导图像数据来训练可视化器。这些设计催生了 Wan-Weaver，它展示了具有远程文本连贯性和视觉一致性的紧急交错生成能力。同时，将不同的理解和生成数据整合到规划器训练中，使 Wan-Weaver 能够实现强大的任务推理和生成能力。为了评估模型在交错生成方面的能力，我们进一步构建了一个跨越多个维度的广泛用例的基准。大量实验表明，即使无法访问任何真实的交错数据，Wan-Weaver 也能实现优于现有方法的性能。

Title: Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25711
Pdf URL: https://arxiv.org/pdf/2603.25711
Copy Paste: [[2603.25711]] Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs(https://arxiv.org/abs/2603.25711)
Keywords: generation
Abstract: Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.
摘要：多模态扩散大型语言模型（MDLLM）通过并行掩码解码实现高并发生成，但架构仍然容易出现多模态幻觉。这种结构漏洞源于算法缺陷：解码器根据文本可能性对候选标记进行排名，而不验证本地化的视觉支持。我们发现，这种仅语言的排名会导致客观不匹配，其中语言概率质量充当了预期多模态任务的错误指定代理。因此，我们将幻觉重新解释为局部优化错误，这是一种解码器利用语言快捷方式以牺牲视觉基础为代价来最大化代理分数的现象。为了解决这种目标不匹配的问题，我们引入了 VISAGE，这是一种无需训练的解码框架，可以在推理时校准目标。 VISAGE 通过量化交叉注意力分布的空间熵来估计代理差异。通过在注意力头之间强制达成本地化共识，该方法对空间均匀分布进行惩罚，并对令牌承诺进行重新排序，以支持基于视觉的结果。我们提供分析稳定性保证，确定 VISAGE 在估计误差下保持有限的客观损失。对幻觉敏感和通用基准的评估证明了该框架的稳健性，在 MMMU-val 上获得了 8.59% 的相对收益，在 HallusionBench 上获得了 7.75% 的相对收益。

Title: Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Authors: Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25716
Pdf URL: https://arxiv.org/pdf/2603.25716
Copy Paste: [[2603.25716]] Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models(https://arxiv.org/abs/2603.25716)
Keywords: generation
Abstract: Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
摘要：视频世界模型在模拟物理世界方面显示出巨大的潜力，但现有的记忆机制主要将环境视为静态画布。当动态主体隐藏在视线之外并随后重新出现时，当前的方法常常会遇到困难，导致主体冻结、扭曲或消失。为了解决这个问题，我们引入了混合记忆，这是一种新颖的范式，要求模型同时充当静态背景的精确档案管理员和动态主体的警惕跟踪器，确保在视野外间隔期间的运动连续性。为了促进这个方向的研究，我们构建了 HM-World，这是第一个专用于混合内存的大规模视频数据集。它具有 59K 个高保真剪辑，具有解耦的摄像机和主题轨迹，涵盖 17 个不同的场景、49 个不同的主题，以及精心设计的退出进入事件，以严格评估混合一致性。此外，我们提出了 HyDRA，一种专门的内存架构，它将内存压缩为令牌并利用时空相关性驱动的检索机制。通过选择性地关注相关运动线索，Hydra 有效地保留了隐藏对象的身份和运动。 HM-World 上的大量实验表明，我们的方法在动态主题一致性和整体生成质量方面都显着优于最先进的方法。

Title: PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Authors: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25730
Pdf URL: https://arxiv.org/pdf/2603.25730
Copy Paste: [[2603.25730]] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference(https://arxiv.org/abs/2603.25730)
Keywords: generation
Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. this https URL
摘要：自回归视频扩散模型已经取得了显着的进步，但它们仍然受到棘手的线性 KV 缓存增长、时间重复和长视频生成过程中的复合错误的瓶颈。为了应对这些挑战，我们提出了 PackForcing，这是一个统一的框架，通过新颖的三分区 KV 缓存策略有效管理生成历史。具体来说，我们将历史上下文分为三种不同的类型：（1）Sink tokens，它以全分辨率保留早期锚帧以维护全局语义； (2) Mid tokens，通过融合渐进 3D 卷积和低分辨率 VAE 重新编码的双分支网络实现大规模时空压缩（32 倍 token 减少）； (3) 最近的标记，保持全分辨率以确保局部时间一致性。为了在不牺牲质量的情况下严格限制内存占用，我们为中间令牌引入了动态 top-$k$ 上下文选择机制，再加上连续的时间 RoPE 调整，可以无缝地重新对齐由丢弃令牌引起的位置间隙，而开销可以忽略不计。借助这种原则性的分层上下文压缩，PackForcing 可以在单个 H200 GPU 上以 16 FPS 生成连贯的 2 分钟、832x480 视频。它实现了仅 4 GB 的有界 KV 缓存，并实现了出色的 24 倍时间外推（5 秒到 120 秒），可以有效地进行零样本操作或仅在 5 秒剪辑上进行训练。 VBench 上的大量结果展示了最先进的时间一致性（26.07）和动态程度（56.25），证明短视频监督足以进行高质量的长视频合成。这个 https 网址

Title: BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Authors: Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25732
Pdf URL: https://arxiv.org/pdf/2603.25732
Copy Paste: [[2603.25732]] BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation(https://arxiv.org/abs/2603.25732)
Keywords: generation, generative
Abstract: Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.
摘要：图像生成模型的最新进展已将其应用范围从审美图像扩展到实际的视觉内容创建。然而，现有的基准主要关注自然图像合成，未能在现实商业设计任务的结构化和多约束要求下系统地评估模型。在这项工作中，我们介绍了 BizGenEval，这是商业视觉内容生成的系统基准。该基准涵盖幻灯片、图表、网页、海报、科学图表五种代表性文档类型，评估文本渲染、布局控制、属性绑定、基于知识的推理四个关键能力维度，形成20个多样化的评估任务。 BizGenEval 包含 400 个精心策划的提示和 8000 个经过人工验证的清单问题，以严格评估生成的图像是否满足复杂的视觉和语义约束。我们对 26 个流行的图像生成系统进行大规模基准测试，包括最先进的商业 API 和领先的开源模型。结果揭示了当前生成模型与专业视觉内容创作的要求之间存在巨大的能力差距。我们希望 BizGenEval 成为现实世界商业视觉内容生成的标准化基准。

Title: Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Authors: Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25734
Pdf URL: https://arxiv.org/pdf/2603.25734
Copy Paste: [[2603.25734]] Unleashing Guidance Without Classifiers for Human-Object Interaction Animation(https://arxiv.org/abs/2603.25734)
Keywords: generation
Abstract: Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.
摘要：生成逼真的人与物体交互 (HOI) 动画仍然具有挑战性，因为它需要对动态人类动作和不同的物体几何形状进行联合建模。先前基于扩散的方法通常依赖于手工设计的接触先验或人为施加的运动学约束来提高接触质量。我们提出了 LIGHT，一种数据驱动的替代方案，其中指导来自去噪速度本身，减少对手动设计先验的依赖。在扩散强迫的基础上，我们将表示分解为特定于模态的组件，并使用异步去噪计划分配个性化的噪声水平。在这种范式中，更干净的组件通过交叉注意力来引导噪音更大的组件，从而在没有辅助分类器的情况下产生指导。我们发现这种数据驱动的指导本质上是接触感知的，并且当通过广泛的合成对象几何形状增强训练时可以得到增强，从而促进接触语义对几何多样性的不变性。大量实验表明，与传统的无分类器引导相比，步速诱导引导更有效地反映了接触先验的好处，同时实现了更高的接触保真度、更真实的 HOI 生成以及对看不见的物体和任务更强的泛化能力。

Title: How good was my shot? Quantifying Player Skill Level in Table Tennis

Authors: Akihiro Kubota, Tomoya Hasegawa, Ryo Kawahara, Ko Nishino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25736
Pdf URL: https://arxiv.org/pdf/2603.25736
Copy Paste: [[2603.25736]] How good was my shot? Quantifying Player Skill Level in Table Tennis(https://arxiv.org/abs/2603.25736)
Keywords: generative
Abstract: Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports -- specifically table tennis -- where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player's tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context -- including player positioning and opponent behaviors -- the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.
摘要：衡量个人的技能水平至关重要，因为它本质上塑造了他们的行为。然而，量化技能具有挑战性，因为它对于观察到的行为是潜在的。为了探索人类行为中的技能理解，我们专注于二元运动——特别是乒乓球——其中技能不仅体现在复杂的动作中，而且体现在根据比赛环境执行的微妙细微差别中。我们的关键想法是学习每个球员战术球拍击球的生成模型，并将它们共同嵌入到一个共同的潜在空间中，该潜在空间编码个人特征，包括与技能水平相关的特征。通过在 3D 重建职业比赛的大规模数据集上训练这些球员模型，并根据全面的比赛环境（包括球员定位和对手行为）对它们进行调节，这些模型可以捕获其潜在空间中的个人战术身份。我们探索了这个学习的玩家空间，发现它反映了共同代表技能的不同游戏风格和属性。通过在这些嵌入上训练一个简单的相对排名网络，我们证明可以实现相对和绝对技能预测。这些结果表明，学习的玩家空间有效地量化了技能水平，为复杂的交互式行为中的自动化技能评估提供了基础。

Title: Vega: Learning to Drive with Natural Language Instructions

Authors: Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2603.25741
Pdf URL: https://arxiv.org/pdf/2603.25741
Copy Paste: [[2603.25741]] Vega: Learning to Drive with Natural Language Instructions(https://arxiv.org/abs/2603.25741)
Keywords: generation
Abstract: Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
摘要：视觉-语言-动作模型重塑了自动驾驶，将语言纳入决策过程。然而，大多数现有管道仅利用语言模态进行场景描述或推理，缺乏遵循不同用户指令进行个性化驾驶的灵活性。为了解决这个问题，我们首先构建了一个大规模驾驶数据集（InstructScene），其中包含大约 100,000 个场景，并用不同的驾驶指令和相应的轨迹进行注释。然后，我们提出了一个统一的视觉-语言-世界-动作模型 Vega，用于基于指令的生成和规划。我们采用自回归范式来处理视觉输入（视觉）和语言指令（语言），并采用扩散范式来生成未来预测（世界建模）和轨迹（行动）。我们进行联合关注以实现模态之间的交互，并为不同模态使用单独的投影层以获得更多功能。大量的实验表明，我们的方法不仅实现了卓越的规划性能，而且表现出强大的指令跟踪能力，为更智能和个性化的驾驶系统铺平了道路。

Title: RefAlign: Representation Alignment for Reference-to-Video Generation

Authors: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25743
Pdf URL: https://arxiv.org/pdf/2603.25743
Copy Paste: [[2603.25743]] RefAlign: Representation Alignment for Reference-to-Video Generation(https://arxiv.org/abs/2603.25743)
Keywords: generation
Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.
摘要：参考视频（R2V）生成是一种可控视频合成范例，它使用文本提示和参考图像来约束生成过程，从而支持个性化广告和虚拟试穿等应用。在实践中，现有的 R2V 方法通常会在参考图像的 VAE 潜在表示旁边引入额外的高级语义或跨模态特征，并将它们联合输入到扩散变换器 (DiT) 中。这些辅助表示提供语义指导并充当隐式对齐信号，可以部分缓解 VAE 潜在空间中的像素级信息泄漏。然而，他们可能仍然难以解决由于异构编码器功能之间的模态不匹配而导致的复制粘贴伪影和多主体混淆。在本文中，我们提出了 RefAlign，一种表示对齐框架，可将 DiT 参考分支特征显式对齐到视觉基础模型 (VFM) 的语义空间。 RefAlign的核心是参考对齐损失，将同一主题的参考特征和VFM特征拉近以提高身份一致性，同时推开不同主题的相应特征以增强语义可辨别性。这种简单而有效的策略仅在训练期间应用，不会产生推理时间开销，并且在文本可控性和参考保真度之间实现了更好的平衡。 OpenS2V-Eval 基准测试的大量实验表明，RefAlign 在 TotalScore 中优于当前最先进的方法，验证了 R2V 任务的显式参考对齐的有效性。

Title: ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Authors: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2603.25746
Pdf URL: https://arxiv.org/pdf/2603.25746
Copy Paste: [[2603.25746]] ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling(https://arxiv.org/abs/2603.25746)
Keywords: generation
Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our
摘要：多镜头视频生成对于长篇叙事故事至关重要，但当前的双向架构存在交互性有限和延迟高的问题。我们提出了 ShotStream，一种新颖的因果多镜头架构，可以实现交互式故事讲述和高效的动态帧生成。通过将任务重新表述为以历史背景为条件的下一个镜头生成，ShotStream 允许用户通过流提示动态指导正在进行的叙述。我们通过首先将文本到视频模型微调为双向下一个镜头生成器来实现这一目标，然后通过分布匹配蒸馏将其蒸馏为因果学生。为了克服自回归生成中固有的镜头间一致性和误差积累的挑战，我们引入了两项关键创新。首先，双缓存机制保持视觉连贯性：全局上下文缓存保留条件帧以实现镜头间一致性，而本地上下文缓存则保留当前镜头内生成的帧以实现镜头内一致性。并且采用 RoPE 不连续性指示器来明确区分两个缓存以消除歧义。其次，为了减少误差积累，我们提出了两阶段蒸馏策略。这首先是基于真实历史镜头的镜头内自我强迫，并逐渐扩展到使用自我生成的历史记录的镜头间自我强迫，有效地弥合了训练测试差距。大量实验表明，ShotStream 可以生成亚秒级延迟的连贯多镜头视频，在单个 GPU 上实现 16 FPS。它的质量与较慢的双向模型相当或超过，为实时交互式讲故事铺平了道路。训练和推理代码以及模型可在我们的网站上找到