2025-03-19

Title: Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

Authors: Yucheng Suo, Fan Ma, Kaixin Shen, Linchao Zhu, Yi Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13500
Pdf URL: https://arxiv.org/pdf/2503.13500
Copy Paste: [[2503.13500]] Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection(https://arxiv.org/abs/2503.13500)
Keywords: generation
Abstract: Visual instructions for long-horizon tasks are crucial as they intuitively clarify complex concepts and enhance retention across extended steps. Directly generating a series of images using text-to-image models without considering the context of previous steps results in inconsistent images, increasing cognitive load. Additionally, the generated images often miss objects or the attributes such as color, shape, and state of the objects are inaccurate. To address these challenges, we propose LIGER, the first training-free framework for Long-horizon Instruction GEneration with logic and attribute self-Reflection. LIGER first generates a draft image for each step with the historical prompt and visual memory of previous steps. This step-by-step generation approach maintains consistency between images in long-horizon tasks. Moreover, LIGER utilizes various image editing tools to rectify errors including wrong attributes, logic errors, object redundancy, and identity inconsistency in the draft images. Through this self-reflection mechanism, LIGER improves the logic and object attribute correctness of the images. To verify whether the generated images assist human understanding, we manually curated a new benchmark consisting of various long-horizon tasks. Human-annotated ground truth expressions reflect the human-defined criteria for how an image should appear to be illustrative. Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods.
摘要：长期任务的视觉说明至关重要，因为它们直观地阐明复杂的概念并增强了跨扩展步骤的保留率。直接使用文本对图像模型直接生成一系列图像，而无需考虑上述步骤的上下文会导致图像不一致，从而增加了认知负载。此外，生成的图像通常会错过对象或诸如对象的颜色，形状和状态之类的属性。为了应对这些挑战，我们提出了Liger，这是逻辑和属性自我反思的第一个无训练框架。 Liger首先使用先前步骤的历史提示和视觉记忆为每个步骤生成草稿图像。这种逐步生成方法在长距离任务中保持图像之间的一致性。此外，Liger利用各种图像编辑工具来纠正错误，包括错误属性，逻辑错误，对象冗余和身份不一致。通过这种自我反射机制，Liger改善了图像的逻辑和对象属性。为了验证生成的图像是否有助于人类的理解，我们手动策划了一个由各种长途任务组成的新基准。人类宣告的地面真理表达式反映了人类定义的标准，即形象应该如何说明性。实验证明了与基线方法相比，Liger产生的视觉指令更全面。

Title: Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

Authors: Jin Kim, Byunghwee Lee, Taekho You, Jinhyuk Yun
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13531
Pdf URL: https://arxiv.org/pdf/2503.13531
Copy Paste: [[2503.13531]] Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution(https://arxiv.org/abs/2503.13531)
Keywords: generative
Abstract: The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.
摘要：多模式生成AI的兴起正在改变技术和艺术的交集，从而更深入地了解大型艺术品。尽管它的创造力已广泛探索，但它在潜在空间中代表艺术品的潜力仍然没有散发出来。我们使用尖端生成的AI，特别是稳定的扩散，通过提取两种类型的潜在信息，以模型：形式方面（例如颜色）和上下文方面（例如主题）来分析500年的西方绘画。我们的发现表明，上下文信息比正式元素更成功地区分艺术时期，风格和个人艺术家。此外，使用从绘画中提取的上下文关键字，我们展示了艺术表达如何随社会变化而演变。我们的生成实验将前瞻性环境注入历史艺术品中，成功地重现了艺术品的进化轨迹，突出了社会与艺术之间相互互动的重要性。这项研究表明，多模式AI如何通过整合时间，文化和历史背景来扩展传统的形式分析。

Title: Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

Authors: Dingkang Liang, Dingyuan Zhang, Xin Zhou, Sifan Tu, Tianrui Feng, Xiaofan Li, Yumeng Zhang, Mingyang Du, Xiao Tan, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13587
Pdf URL: https://arxiv.org/pdf/2503.13587
Copy Paste: [[2503.13587]] Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception(https://arxiv.org/abs/2503.13587)
Keywords: generation
Abstract: We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at this https URL.
摘要：我们提出Unifure，这是一种简单而有效的驾驶世界模型，将未来的场景产生和感知无缝整合到一个框架中。与仅关注像素级的未来预测或几何推理的现有模型不同，我们的方法共同对未来的外观（即RGB图像）和几何形状（即深度）进行建模，从而确保相干预测。具体来说，在培训期间，我们首先引入了双层共享方案，该方案在共享潜在空间中转移图像和深度序列，从而使这两种方式都可以从共享的特征学习中受益。此外，我们提出了一种多尺度的潜在相互作用机制，该机制有助于在多个空间尺度下图像和深度特征之间的双向细化，从而有效地增强了几何学的一致性和感知对齐方式。在测试过程中，我们的统一可以通过仅将当前图像作为输入来轻松预测未来的图像深度对。 Nuscenes数据集的广泛实验表明，Unifure在未来的生成和感知任务上的表现优于专业模型，强调了统一的，结构上意识到的世界模型的优势。项目页面位于此HTTPS URL。

Title: FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

Authors: Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, Mengyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13684
Pdf URL: https://arxiv.org/pdf/2503.13684
Copy Paste: [[2503.13684]] FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models(https://arxiv.org/abs/2503.13684)
Keywords: generation
Abstract: Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: this https URL
摘要：最近出现了许多文本对视频（T2V）编辑方法，但是缺乏公平评估的标准化基准导致索赔不一致，并且无法评估模型对超参数的敏感性。细粒度的视频编辑对于实现精确的对象级修改至关重要，同时保持上下文和时间一致性。为了解决这个问题，我们介绍了五个，这是一种精细的视频编辑基准，用于评估新出现的扩散和整流的流程模型。我们的基准包括74个现实世界的视频和26个生成的视频，其中包含6种精细的编辑类型，420个对象级编辑提示对及其相应的面具。此外，我们通过引入FlowedIt来调整最新的整流流（RF）T2V生成模型，金字塔 - 流和WAN2.1，从而导致无训练和无倒置的视频编辑模型金字塔和WAN-EDIT。我们使用15个指标在我们的五个基准测试上评估了五种基于扩散的和两种基于RF的编辑方法，涵盖了背景保护，文本视频相似性，时间一致性，视频质量和运行时。为了进一步增强对象级别的评估，我们介绍了五-ACC，这是一种新型的度量杠杆语言模型（VLM），以评估细粒视频编辑的成功。实验结果表明，基于RF的编辑显着优于基于扩散的方法，Wan-Edit实现了最佳的整体性能，并且对超参数的敏感性最低。匿名网站上可用的更多视频演示：此HTTPS URL

Title: SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

Authors: Zhenlong Yuan, Zhidong Yang, Yujun Cai, Kuangxin Wu, Mufan Liu, Dapeng Zhang, Hao Jiang, Zhaoxin Li, Zhaoqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13721
Pdf URL: https://arxiv.org/pdf/2503.13721
Copy Paste: [[2503.13721]] SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint(https://arxiv.org/abs/2503.13721)
Keywords: restoration
Abstract: Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks & Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.
摘要：最近，由于重建无纹理区域的可变形且可扩展的贴剂，补丁信息的方法在多视图立体声中表现出显着的有效性。但是，这种方法主要强调扩大无纹理区域中的接受场，同时忽略了易于忽视的边缘衬托引起的变形不稳定，这可能导致匹配的扭曲。为了解决这个问题，我们提出了SED-MV，该SED-MV采用了泛型分割和多个针对性扩散策略，用于分割驱动和边缘对准的斑块变形。具体而言，为了防止意外的边缘滑动，我们首先采用SAM2作为深度边缘指导，以指导斑块变形，然后采用多门象扩散策略，以确保斑块与深度边缘全面对齐。此外，为了避免潜在的随机初始化不准确，我们将loftr和单眼深度图的稀疏点结合在一起，从深度透射率V2恢复可靠且逼真的深度图，以进行初始化和监督指导。最后，我们将分割图像与单眼深度图集成在一起，以利用实体遮挡关系，然后进一步将它们视为遮挡图，以实现两个不同的边缘约束，从而促进了咬合遮挡意识到的斑块变形。 ETH3D，TANK和TEMPELS，BLEDENDMVS和Strecha数据集的广泛结果验证了我们所提出的方法的最先进性能和鲁棒的概括能力。

Title: Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

Authors: Shristi Das Biswas, Efstathia Soufleri, Arani Roy, Kaushik Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13724
Pdf URL: https://arxiv.org/pdf/2503.13724
Copy Paste: [[2503.13724]] Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition(https://arxiv.org/abs/2503.13724)
Keywords: generation
Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.
摘要：训练强大的深度视频表示已被证明是在计算上具有挑战性的，这是由于大量解码开销，巨大的原始视频流尺寸及其固有的高时间冗余。与现有方案不同，专门在压缩视频域中运行，并利用所有免费可用的模式，即i-frames和p-frames（运动向量和残差）提供了一个计算效率的替代方案。现有方法将此任务视为一个天真的多模式问题，忽略了跨P框架的时间相关性和隐式稀疏性，以模拟与同一动作的视频更强大的共享表示形式，从而使培训和概括更加容易。通过重新审视优势视频理解主链的高级设计，我们将推理速度提高了56美元，同时保持相似的性能。为此，我们提出了一个混合端到端框架，该框架将跨三个关键概念的学习分配，以将推理成本降低$ 330 \ times $ $ vestres ART：首先，一种专门设计的双编码方案，具有有效的峰值时间调节器，以最大程度地减少延迟，同时保留交叉数字特征聚合。其次，统一的变压器模型，使用全局自我注意来捕获模式间依赖性，以增强i框架-P框架上下文相互作用。第三，一个多模式混合器块，用于模拟关节时空令牌嵌入的丰富表示形式。实验表明，我们的方法可实现轻巧的体系结构，以实现UCF-101，HMDB-51，K-400，K-600，K-600和SS-V2数据集的最新视频识别性能，其成本有利（$ 0.73 $ j/v）和快速推论（$ 16 $ v/s）。我们的观察结果为有效的下一代时空学习者带来了实用设计选择的新见解。代码可用。

Title: TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

Authors: Forouzan Fallah, Maitreya Patel, Agneet Chatterjee, Vlad I. Morariu, Chitta Baral, Yezhou Yang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.13730
Pdf URL: https://arxiv.org/pdf/2503.13730
Copy Paste: [[2503.13730]] TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark(https://arxiv.org/abs/2503.13730)
Keywords: generation
Abstract: Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
摘要：用嵌入式文本生成图像对于自动生产视觉和多模式文档（例如教育材料和广告）至关重要。但是，现有的基于扩散的文本对图像模型通常难以将文本准确地嵌入图像中，面临拼写准确性，上下文相关性和视觉连贯性的挑战。由于缺乏全面的基准，评估此类模型将文本嵌入文本嵌入文本的能力变得复杂。在这项工作中，我们介绍了TextInvision，这是一个大规模，文本和迅速复杂性驱动的基准测试，旨在评估扩散模型有效地将视觉文本整合到图像中的能力。我们精心设计了一套各种提示和文本，这些提示和文本考虑了各种属性和文本特征。此外，我们准备了一个图像数据集，以测试不同角色表示的变异自动编码器（VAE）模型，这强调了VAE体系结构还可以在扩散框架内的文本生成中构成挑战。通过对多个模型的广泛分析，我们确定了常见错误，并突出了诸如拼写不准确和上下文不匹配等问题。通过查明不同提示和文本的故障点，我们的研究为AI生成的多模式内容的未来进步奠定了基础。

Title: C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

Authors: Yuxuan Jiang, Chengxi Zeng, Siyue Teng, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13740
Pdf URL: https://arxiv.org/pdf/2503.13740
Copy Paste: [[2503.13740]] C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales(https://arxiv.org/abs/2503.13740)
Keywords: super-resolution
Abstract: In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at this http URL.
摘要：近年来，注意力机制已在单图像超分辨率（SISR）中被利用，从而获得了令人印象深刻的重建结果。但是，这些进步仍然受到对简单培训策略和网络体系结构的依赖限制，这些策略和网络体系结构旨在离散采样量表，这阻碍了该模型有效访问多个量表的信息的能力。为了解决这些局限性，我们提出了一个新颖的框架\ textbf {c2d-isr}，用于从性能和复杂性角度优化基于注意力的图像超分辨率模型。我们的方法基于两阶段的训练方法和分层编码机制。新的培训方法涉及对离散量表模型的连续规模培训，从而可以学习尺度间相关性和多尺度特征表示。此外，我们通过现有的基于注意力的网络结构概括了层次编码机制，这些机制可以改善空间特征融合，跨尺度信息聚合，更重要的是，推断更快。我们已经根据三个有效的基于注意力的主机（Swinir-L，Srformer-l和Mambairv2-L）评估了C2D-ISR框架，并在超级分辨率性能（最高0.2DB）和计算复杂性降低（最高11％）方面，对其他现有优化框架进行了显着改善。源代码将在此HTTP URL上公开可用。

Title: FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Authors: Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour
Subjects: cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2503.13745
Pdf URL: https://arxiv.org/pdf/2503.13745
Copy Paste: [[2503.13745]] FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution(https://arxiv.org/abs/2503.13745)
Keywords: super-resolution
Abstract: Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: this https URL
摘要：视频超分辨率（VSR）从低分辨率输入中重建高分辨率视频，以恢复细节并提高视觉清晰度。尽管基于深度学习的VSR方法取得了令人印象深刻的结果，但它们的集中性质引起了严重的隐私问题，尤其是在严格的隐私要求的应用中。联邦学习（FL）提供了一种替代方法，但是现有的FL方法在低级视力任务方面遇到了困难，从而导致了次优重建。为了解决这个问题，我们提出了FedVSR1，这是一个新颖的，独立于建筑的且无状态的FL框架的VSR框架。我们的方法引入了一个轻巧的损失术语，该损失术语改善了本地优化，并以最小的计算开销指导全球聚合。据我们所知，这是联邦VSR的首次尝试。广泛的实验表明，FedVSR在PSNR中平均超过一般的FL方法，突出了其有效性。代码可用：此HTTPS URL

Title: Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion

Authors: Kartik Thakral, Tamar Glaser, Tal Hassner, Mayank Vatsa, Richa Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13769
Pdf URL: https://arxiv.org/pdf/2503.13769
Copy Paste: [[2503.13769]] Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion(https://arxiv.org/abs/2503.13769)
Keywords: generation, generative
Abstract: How can we effectively unlearn selected concepts from pre-trained generative foundation models without resorting to extensive retraining? This research introduces `continual unlearning', a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models, incrementally. We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts while preserving the generation of related, non-targeted concepts and alleviating generalization erosion. For this, DUGE targets three losses: a cross-attention loss that steers the focus towards images devoid of the target concept; a prior-preservation loss that safeguards knowledge related to non-target concepts; and a regularization loss that prevents the model from suffering from generalization erosion. Experimental results demonstrate the ability of the proposed approach to exclude certain concepts without compromising the overall integrity and performance of the model. This offers a pragmatic solution for refining generative models, adeptly handling the intricacies of model training and concept management lowering the risks of copyright infringement, personal or licensed material misuse, and replication of distinctive artistic styles. Importantly, it maintains the non-targeted concepts, thereby safeguarding the model's core capabilities and effectiveness.
摘要：我们如何在不诉诸于广泛的重新培训的情况下，有效地从预训练的生成基础模型中删除了选择的概念？这项研究介绍了“持续学习”，这是一种新颖的范式，可逐步从基础生成模型中靶向去除多个特定概念。我们提出了不含概括侵蚀（Duge）算法的减少学位，该算法有选择地取消了不希望的概念的产生，同时保留了相关的，非靶向的概念并减轻概括侵蚀的产生。为此，Duge针对三个损失：交叉意识损失将注意力集中在缺乏目标概念的图像上；维护与非目标概念有关的知识的先前保护损失；以及使模型遭受概括侵蚀的正规化损失。实验结果证明了所提出的方法排除某些概念的能力，而不会损害模型的整体完整性和性能。这提供了一种务实的解决方案，可用于精炼生成模型，熟练处理模型培训和概念管理的复杂性，以降低侵犯版权的风险，个人或有执照的物质滥用以及对独特艺术风格的复制。重要的是，它维护了非目标的概念，从而维护了模型的核心能力和有效性。

Title: LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Authors: Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Dimitris N. Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13794
Pdf URL: https://arxiv.org/pdf/2503.13794
Copy Paste: [[2503.13794]] LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation(https://arxiv.org/abs/2503.13794)
Keywords: generation
Abstract: Large foundation models trained on large-scale visual-text data can significantly enhance Open Vocabulary Object Detection (OVD) through data generation. However, this may lead to biased synthetic data and overfitting to specific configurations. It can sidestep biases of manually curated data generation by directly leveraging hidden states of Large Language Models (LLMs), which is surprisingly rarely explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of a MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge transfer from LLMs to object detectors, an new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We demonstrate that intermediate hidden states from early LLM layers retain strong spatial-semantic correlations that are beneficial to grounding tasks. Experiments show that our adaptation strategy significantly enhances the performance on complex free-form text queries while remaining the same on plain categories. With our adaptation, Qwen2-0.5B with Swin-T as the vision encoder improves GroundingDINO by 2.33% on Omnilabel, at the overhead of 8.7% more GFLOPs. Qwen2-0.5B with a larger vision encoder can further boost the performance by 6.22%. We further validate our design by ablating on varied adapter architectures, sizes of LLMs, and which layers to add adaptation.
摘要：大规模视觉文本数据训练的大型基础模型可以通过数据生成显着增强开放词汇对象检测（OVD）。但是，这可能会导致综合数据有偏见，并且过度适合特定配置。它可以通过直接利用大型语言模型（LLMS）的隐藏状态来避开手动策划数据生成的偏见，这是令人惊讶的很少探索的。本文提出了一种系统的方法来通过利用MLLM LLM的解码器层来增强视觉接地。我们引入了一个零定位的交叉注意适配器，以使从LLMS到对象检测器有效地传递知识转移，这是一种称为LED的新方法（LLM增强了开放式唱机对象检测）。我们证明，从早期LLM层中的中间隐藏状态保留了有益于接地任务的强烈空间语义相关性。实验表明，我们的适应策略可显着提高复杂的自由形式文本查询的性能，同时在普通类别上保持不变。随着我们的适应，Qwen2-0.5b具有Swin-T的QWEN2-0.5B，视觉编码器将Omnilabel的地面DINGINGINO提高了2.33％，该omnilabel的开销高度为8.7％。具有更大视觉编码器的QWEN2-0.5B可以将性能进一步提高6.22％。我们通过消融各种适配器体系结构，LLMS的大小以及层以增加适应性的层面来进一步验证我们的设计。

Title: FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

Authors: Jinping Wang, Weiwei Song, Hao Chen, Jinchang Ren, Huimin Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13814
Pdf URL: https://arxiv.org/pdf/2503.13814
Copy Paste: [[2503.13814]] FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification(https://arxiv.org/abs/2503.13814)
Keywords: generation
Abstract: World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at this https URL.
摘要：世界模型可显着增强层次结构的理解，提高数据整合和学习效率。为了探索遥感（RS）字段中世界模型的潜力，本文提出了用于多模式数据融合（Fusdreamer）的标签有效的遥感世界模型。 Fusdreamer将世界模型用作统一表示容器来抽象共同和高级知识，促进了不同类型数据的相互作用，\ emph {i.e。}，高光谱（HSI）（HSI），光检测和范围（LIDAR）和文本数据。最初，将新的潜在扩散融合和多模式生成范式（LAMG）用于其非凡的信息集成和详细保留能力。随后，开放世界知识引导的一致性投影（OK-CP）模块结合了视觉描述的对象的提示表示，并通过对比度学习使语言视觉特征对齐。这样，可以通过微调样品有限的预训练的世界模型来弥合域间隙。最后，端到端的多任务组合优化（MUCO）策略可以捕获轻微的特征偏差，并在可学习的方向上限制扩散过程。在四个典型数据集上进行的实验表明了拟议的熔断器的有效性和优势。相应的代码将在此HTTPS URL上发布。

Title: MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Authors: Zhixuan Liu, Haokun Zhu, Rui Chen, Jonathan Francis, Soonmin Hwang, Ji Zhang, Jean Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13816
Pdf URL: https://arxiv.org/pdf/2503.13816
Copy Paste: [[2503.13816]] MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments(https://arxiv.org/abs/2503.13816)
Keywords: generation
Abstract: We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Project page is available at: this https URL
摘要：我们介绍了一种新型基于扩散的方法，用于仅从深度图像中生成多房间室内环境的隐私数字双胞胎。我们方法的核心是一种具有隐性一致性（Mosaic）模型的新型多视图重叠场景对齐，该模型在概率意义上明确考虑了同一场景中的跨视图依赖性。马赛克通过一种新型的推理时间优化运行，该优化避免了基于全景的方法中的顺序或单室约束中常见的错误积累。当添加更多重叠的视图时，镶嵌范围为零额外训练的复杂场景，并证明会减少降低变化的差异，从而提高了发电质量。实验表明，在重建复杂的多房间环境中，镶嵌在图像保真度指标上优于最先进的基线。项目页面可用：此HTTPS URL

Title: Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection

Authors: Chunlei Li, Yilei Shi, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13828
Pdf URL: https://arxiv.org/pdf/2503.13828
Copy Paste: [[2503.13828]] Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection(https://arxiv.org/abs/2503.13828)
Keywords: generative
Abstract: Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at this https URL.
摘要：由于其广泛的适用性，尤其是在标记为异常数据的医学成像中，使用深度学习的无监督异常检测吸引了大量的研究注意力。虽然早期的方法利用了自动编码器和生成对抗网络（GAN）等生成模型，但由于过度笼统，它们通常会缺乏。最近的方法探讨了各种策略，包括记忆库，规范化流动，自我监督的学习和知识蒸馏，以增强歧视。其中，知识蒸馏，尤其是反向蒸馏，已显示出希望。在此范式之后，我们提出了一种新型的比对比反向蒸馏模型，该模型解决了现有反向蒸馏方法的两个关键局限性：功能不足的可区分性和无法处理异常尺度变化。具体而言，我们引入了一种对比的学生教师学习方法，通过产生和探索正常的分布来得出更多的歧视性表示。此外，我们设计了一种比例适应机制，以在不同尺度上轻轻重量对比蒸馏损失，以解决量表变化问题。基准数据集的广泛实验表明了最先进的性能，从而验证了所提出的方法的功效。代码可在此HTTPS URL上找到。

Title: SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Authors: Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, Junyong Noh
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13836
Pdf URL: https://arxiv.org/pdf/2503.13836
Copy Paste: [[2503.13836]] SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing(https://arxiv.org/abs/2503.13836)
Keywords: generation
Abstract: Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
摘要：随着脱氧扩散模型的兴起，文本驱动的运动产生已显着提高。但是，以前的方法经常过分简化骨骼关节，时间框架和文字单词的表示，从而限制了它们在每种模式中充分捕获信息及其相互作用中的信息的能力。此外，当使用预训练的模型进行下游任务（例如编辑）时，它们通常需要额外的努力，包括手动干预，优化或微调。在本文中，我们引入了一种骨骼感知的潜在扩散（沙拉），该模型明确捕获了关节，框架和单词之间复杂的相互关系。此外，通过利用在生成过程中产生的跨注意地图，我们可以使用预训练的沙拉模型启用基于注意力的零击文本驱动运动编辑，不需要除文本提示以外的其他用户输入。我们的方法在不损害发电质量的情况下，在文本运动对齐方面大大优于以前的方法，并通过提供多种编辑功能来表现出实用的多功能性。代码可在项目页面上找到。

Title: Less is More: Improving Motion Diffusion Models with Sparse Keyframes

Authors: Jinseok Bae, Inwoo Hwang, Young Yoon Lee, Ziyu Guo, Joseph Liu, Yizhak Ben-Shabat, Young Min Kim, Mubbasir Kapadia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13859
Pdf URL: https://arxiv.org/pdf/2503.13859
Copy Paste: [[2503.13859]] Less is More: Improving Motion Diffusion Models with Sparse Keyframes(https://arxiv.org/abs/2503.13859)
Keywords: generation, generative
Abstract: Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.
摘要：运动扩散模型的最新进展导致了各种运动生成任务（包括文本到动作综合）的显着进步。但是，现有方法表示动作作为密集的框架序列，要求该模型处理冗余或更少信息的框架。密集的动画框架的处理施加了重大的训练复杂性，尤其是在学习大型运动数据集的复杂分布即使使用现代神经体系结构时。这严重限制了下游任务的生成运动模型的性能。受专业动画师的启发，主要专注于稀疏密钥帧，我们提出了一个新颖的扩散框架，该框架围绕稀疏和几何有意义的密钥帧进行了明确设计。我们的方法通过掩盖非按键框并有效插值缺失框架来减少计算。在推断过程中，我们会动态完善钥匙帧掩模，以在以后的扩散步骤中优先考虑信息框架。广泛的实验表明，我们的方法在文本对齐和运动现实主义中始终优于最先进的方法，同时也有效地保持高性能在更少的扩散步骤中。我们通过将框架用作生成性先验并将其调整为不同的下游任务来进一步验证框架的鲁棒性。源代码和预培训模型将在接受后发布。

Title: RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving

Authors: Yujin Wang, Quanfeng Liu, Zhengxin Jiang, Tianyi Wang, Junfeng Jiao, Hongqing Chu, Bingzhao Gao, Hong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13861
Pdf URL: https://arxiv.org/pdf/2503.13861
Copy Paste: [[2503.13861]] RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving(https://arxiv.org/abs/2503.13861)
Keywords: generation
Abstract: Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.
摘要：准确地理解和确定高级荟萃行为对于确保可靠且安全的自主驾驶系统至关重要。尽管视觉语言模型（VLM）在各种自动驾驶任务中表现出了巨大的潜力，但它们通常会遭受诸如空间感知和幻觉不足之类的局限性，从而降低了它们在复杂的自动驾驶场景中的有效性。为了应对这些挑战，我们提出了一个检索成绩的决策（RAD）框架，这是一种新颖的体系结构，旨在增强VLMS在自主驾驶场景中可靠地产生元行为的能力。 RAD利用检索功能的生成（RAG）管道通过由嵌入流，检索流量和产生流动的三阶段过程动态提高决策精度。此外，我们在源自Nuscenes数据集的特定策划数据集上微调VLM，以增强其空间感知和鸟类的视图图像理解能力。对基于Nuscenes的数据集进行了广泛的实验评估表明，RAD在关键评估指标（包括匹配的准确性和F1得分）和自定义的总体得分中均优于基线方法，并突出了其在提高自主驱动任务的荟萃行动决策方面的有效性。

Title: MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

Authors: Zhengsheng Guo, Linwei Zheng, Xinyang Chen, Xuefeng Bai, Kehai Chen, Min Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13882
Pdf URL: https://arxiv.org/pdf/2503.13882
Copy Paste: [[2503.13882]] MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments(https://arxiv.org/abs/2503.13882)
Keywords: generation
Abstract: While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.
摘要：尽管人类认知固有地从决策过程中从多种和专业知识来源中检索信息，但当前的检索效果生成（RAG）系统通常通过单源知识检索来运行，从而导致认知 - 偏金属的差异。为了弥合这一差距，我们介绍了MOK-RAG，这是一种新型的多源RAG框架，通过大型语言模型（LLM）语料库的功能分配来实现知识路径的混合，从而增强了检索机制，从而使其分为不同的部分，从而从多个专业的知识路径中获得了检索。我们提出的MOK-RAG3D应用于3D模拟环境的生成，通过将3D资产分为不同的部分并根据层次知识树结构将其分配到不同的部分，从而增强了此范式。与以前仅使用手动评估的方法不同，我们率先引入了3D场景的自动化评估方法。在我们的实验中，自动评估和人类评估都表明，Mok-rag3d可以帮助体现的AI代理产生各种场景。

Title: Where do Large Vision-Language Models Look at when Answering Questions?

Authors: Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13891
Pdf URL: https://arxiv.org/pdf/2503.13891
Copy Paste: [[2503.13891]] Where do Large Vision-Language Models Look at when Answering Questions?(https://arxiv.org/abs/2503.13891)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at this https URL.
摘要：大型视觉模型（LVLM）在视觉理解和推理任务中表现出了有希望的表现。但是，他们的视觉理解行为仍然没有被忽视。出现一个基本问题：LVLM在多大程度上取决于视觉输入，哪些图像区域有助于其反应？由于其复杂的视觉体系结构（例如，多个编码器和多分辨率）和可变长度输出，解释LVLM的自由形式生成是不平凡的。在本文中，我们扩展了现有的热图可视化方法（例如IGOS ++），以支持开放式视觉问题回答的LVLM。我们提出了一种选择视觉上相关令牌的方法，以反映生成的答案和输入图像之间的相关性。此外，我们对旨在需要视觉信息回答的基准的最先进的LVLM进行了全面分析。我们的发现为LVLM行为提供了几种见解，包括焦点区域与答案正确性之间的关系，跨体系结构的视觉注意力差异以及LLM量表对视觉理解的影响。该代码和数据可在此HTTPS URL上找到。

Title: ChatBEV: A Visual Language Model that Understands BEV Maps

Authors: Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, Ya Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13938
Pdf URL: https://arxiv.org/pdf/2503.13938
Copy Paste: [[2503.13938]] ChatBEV: A Visual Language Model that Understands BEV Maps(https://arxiv.org/abs/2503.13938)
Keywords: generation
Abstract: Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.
摘要：交通现场的理解对于智能运输系统和自动驾驶至关重要，从而确保安全有效的车辆操作。尽管VLM的最新进展显示出对整体场景理解的希望，但VLM在流量方案（尤其是使用BEV地图）上的应用仍在探索中。现有方法通常受到任务设计有限和数据量狭窄的影响，从而阻碍了全面的场景理解。为了应对这些挑战，我们介绍了chatbev-qa，一种新颖的BEV VQA基准包含137k的问题，旨在涵盖广泛的场景理解任务，包括全球场景理解，车道互动和车辆车辆的交互。该基准是使用新颖的数据收集管道构建的，该数据收集管道为BEV地图生成可扩展且有用的VQA数据。我们进一步调整了专门的视觉语言模型chatbev，使其能够解释各种问题提示并从BEV地图中提取相关的上下文感知信息。此外，我们提出了一个以语言为导向的流量场景生成管道，在该管道中，CHATBEV促进了地图的理解和文本一致的导航指南，从而大大增强了现实且一致的交通情况的生成。数据集，代码和微调模型将发布。

Title: Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization

Authors: Long Tang, Dengpan Ye, Sirun Chen, Xiuwen Shi, Yunna Lv, Ziyi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13945
Pdf URL: https://arxiv.org/pdf/2503.13945
Copy Paste: [[2503.13945]] Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization(https://arxiv.org/abs/2503.13945)
Keywords: generation
Abstract: The fine-tuning technique for text-to-image diffusion models facilitates image customization but risks privacy breaches and opinion manipulation. Current research focuses on prompt- or image-level adversarial attacks for anti-customization, yet it overlooks the correlation between these two levels and the relationship between internal modules and inputs. This hinders anti-customization performance in practical threat scenarios. We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization, which, for the first time, integrates the adversarial prompt-level attack into the generation process of image-level adversarial examples. In stage 1, we generate prompt-level adversarial vectors to guide the subsequent image-level attack. In stage 2, besides conducting the end-to-end attack on the UNet model, we disrupt its self- and cross-attention modules, aiming to break the correlations between image pixels and align the cross-attention results computed using instance prompts and adversarial prompt vectors within the images. Furthermore, we introduce a local random timestep gradient ensemble strategy, which updates adversarial perturbations by integrating random gradients from multiple segmented timesets. Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization with DADiff compared to existing methods.
摘要：文本到图像扩散模型的微调技术有助于图像定制，但风险违反隐私和舆论操纵。当前的研究重点是及时或图像水平的对抗性攻击，以进行反向定性化，但它忽略了这两个级别与内部模块与输入之间的关系之间的相关性。这阻碍了在实际威胁情景中的反燃烧性能。我们提出了双反扩散（DADIFF），这是一种针对扩散自定义的两阶段对抗攻击，该攻击首次将对抗性的及时升级攻击集成到图像级别对抗性示例的生成过程中。在第1阶段，我们生成及时的对抗向量，以指导后续图像级攻击。在第2阶段，除了对UNET模型进行端到端攻击外，我们还破坏了其自我和交叉注意模块，旨在打破图像像素之间的相关性并对齐使用实例提示和图像中的对抗性提示向量计算得出的跨注意结果。此外，我们引入了局部随机时间段梯度集合策略，该策略通过整合来自多个分段时间表的随机梯度来更新对抗性扰动。与现有方法相比，各种主流面部数据集的实验结果表明，交叉预报，关键字不匹配，交叉模型和跨机械抗客户化的10％-30％改善。

Title: Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Authors: Sayak Nag, Udita Ghosh, Sarosij Bose, Calvin-Khang Ta, Jiachen Li, Amit K Roy Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13947
Pdf URL: https://arxiv.org/pdf/2503.13947
Copy Paste: [[2503.13947]] Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation(https://arxiv.org/abs/2503.13947)
Keywords: generation
Abstract: Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.
摘要：场景图生成（SGG）旨在通过识别对象及其成对关系来表示视觉场景，从而提供对图像内容的结构化理解。但是，诸如长尾类别分布和预测变异性之类的固有挑战需要SGG的不确定性量化，以实现其实际生存能力。在本文中，我们介绍了一种基于任何现有SGG方法的基于任何现有SGG方法的新颖保形预测（CP）框架，以通过在其生成的场景图上构造良好的预测集来量化其预测性不确定性。这些场景图预测集旨在实现统计上严格的覆盖范围保证。此外，为确保这些预测集包含最实际可解释的场景图，我们设计了一种有效的基于MLLM的后处理策略，用于在这些预测集中选择最视觉和语义上最合理的场景图。我们表明，我们提出的方法可以从图像中产生各种可能的场景图，评估SGG方法的可靠性并提高整体SGG性能。

Title: Light4GS: Lightweight Compact 4D Gaussian Splatting Generation via Context Model

Authors: Mufan Liu, Qi Yang, He Huang, Wenjie Huang, Zhenlong Yuan, Zhu Li, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13948
Pdf URL: https://arxiv.org/pdf/2503.13948
Copy Paste: [[2503.13948]] Light4GS: Lightweight Compact 4D Gaussian Splatting Generation via Context Model(https://arxiv.org/abs/2503.13948)
Keywords: generation
Abstract: 3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a \textbf{Light}weight \textbf{4}D\textbf{GS} framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS that is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: (1) a spatio-temporal significance pruning strategy that eliminates over 64\% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and (2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure to enable efficient multiscale latent embedding compression. Our approach achieves over 120x compression and increases rendering FPS up to 20\% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods, revealing the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality.
摘要：3D高斯裂（3DGS）已成为一种有效且高保真的范式，用于新型视图合成。为了适应3DGS的动态内容，可变形的3DG与可学习的潜在嵌入式的时间变形原始素结合在一起，以捕获复杂的运动。尽管性能令人印象深刻，但高维的嵌入和大量原语会导致大量的存储要求。在本文中，我们介绍了一个\ textbf {light} weight \ textbf {4} d \ textbf {gs}框架，称为light4gs，该框架采用了具有深层上下文模型的显着性修剪，以提供轻巧的存储效率动态效率动态3DGS表示。所提出的Light4GS基于4DGS，它是可变形3DG的典型表示。具体而言，我们的框架建立在两个核心组件上：（1）时空的显着性修剪策略，消除了超过64％的可变形原始素，然后是熵受约束的球形谐波压缩，应用于其余的谐波；（2）深层上下文模型，该模型将内部和间预测与超级优势整合到粗到精细的上下文结构中，以实现有效的多尺度潜在潜在嵌入压缩。与基线4DG相比，我们的方法可实现超过120倍的压缩，并将渲染fps提高到20％，并且优于框架最先进的3DGS压缩方法，从而揭示了我们的Light4Gs在不牺牲质量质量的情况下以内在和预测方法的效果。

Title: SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model

Authors: Xinqing Li, Ruiqi Song, Qingyu Xie, Ye Wu, Nanxin Zeng, Yunfeng Ai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13952
Pdf URL: https://arxiv.org/pdf/2503.13952
Copy Paste: [[2503.13952]] SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model(https://arxiv.org/abs/2503.13952)
Keywords: generation, generative
Abstract: With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at this https URL.
摘要：随着自动驾驶技术的快速发展，缺乏数据已成为提高感知模型准确性的主要障碍。研究人员现在正在使用世界模型来探索可控的数据生成，以使数据集多样化。但是，以前的工作仅限于研究特定公共数据集的图像产生质量。关于如何为现实世界应用场景构建数据生成引擎的研究仍然相对较少，以实现大规模的数据生成，以实现具有挑战性的场景。在本文中，提出了基于世界模型的模拟器条件的场景生成引擎。通过构建与现实世界场景，仿真数据和标签一致的仿真系统，这些模拟系统可以作为世界模型中数据生成条件的条件，以收集任何场景。通过将模拟引擎的强大场景模拟功能与世界模型的强大数据生成功能相结合，这是一种新颖的数据生成管道。此外，还提供了具有成比例构造的虚拟和真实数据的基准，用于探索现实场景中世界模型的功能。定量结果表明，这些生成的图像显着改善了下游感知模型的性能。最后，我们探讨了世界自主驾驶场景中世界模型的生成性能。所有数据和代码将在此HTTPS URL上可用。

Title: DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation

Authors: Mu Chen, Liulei Li, Wenguan Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13957
Pdf URL: https://arxiv.org/pdf/2503.13957
Copy Paste: [[2503.13957]] DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation(https://arxiv.org/abs/2503.13957)
Keywords: generation
Abstract: Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
摘要：视频场景图生成（VSGG）的顶级领先解决方案通常采用离线管道。尽管展示了有希望的性能，但他们仍然无法处理实时视频流并消耗大量的GPU内存。此外，这些方法在时间推理中缺乏，仅在时间上下文中汇总了框架级预测。作为响应，我们介绍了DifFVSGG，这是一种在线VSGG解决方案，将此任务框起来是迭代场景图更新问题。从潜在扩散模型（LDMS）中汲取灵感，该模型通过将潜在特征嵌入来生成图像，我们使用一个共享功能嵌入的对象分类，边界框回归和图形生成三个任务的解码。然后，给定包含对象对的统一特征的嵌入，我们在LDMS中进行了逐步的denoing，以便提供清洁的嵌入，清楚地表明对象之间的关系。然后，此嵌入是针对特定任务的头部的输入，以进行对象分类，场景图生成等。DIFFVSGG进一步促进了连续的时间推理，在这些预测中，对后续帧的预测利用了过去帧的结果作为LDMS的条件输入，以指导当前帧的反向扩散过程。对三个动作基因组设置的广泛实验证明了diffvsgg的优越性。

Title: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Authors: Siwei Han, Peng Xia, Ruiyi Zhang, Tong Sun, Yun Li, Hongtu Zhu, Huaxiu Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13964
Pdf URL: https://arxiv.org/pdf/2503.13964
Copy Paste: [[2503.13964]] MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding(https://arxiv.org/abs/2503.13964)
Keywords: generation
Abstract: Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at this https URL.
摘要：文档问答（DOCQA）是一项非常普遍的任务。使用大语言模型（LLM）或大型视觉语言模型（LVLM）和检索增强生成（RAG）的现有方法通常优先考虑单个模式的信息，从而无法有效整合文本和视觉提示。这些方法在复杂的多模式推理方面遇到了困难，从而限制了它们在实际文档上的性能。我们提出了Mdocagent（一个多模式的多模式多代理框架，用于文档理解），这是一个利用文本和图像的新颖的抹布和多代理框架。我们的系统采用五个专业代理：一般代理，关键代理，文本代理，图像代理和汇总代理。这些代理商进行了多模式环境检索，结合了他们的个人见解，以对文档的内容有更全面的了解。这种协作方法使该系统能够从文本和视觉组件中综合信息，从而提高了回答的准确性。与当前的最新方法相比，在Mmmlongbench等五个基准（例如Mmlongbench）等五个基准测试的初步实验证明了我们的Mdocagent的有效性。这项工作有助于开发更强大，更全面的DOCQA系统，能够处理包含丰富文本和视觉信息的现实世界文档的复杂性。我们的数据和代码可在此HTTPS URL上找到。

Title: DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection

Authors: Jaewoo Song, Daemin Park, Kanghyun Baek, Sangyub Lee, Jooyoung Choi, Eunji Kim, Sungroh Yoon
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.13985
Pdf URL: https://arxiv.org/pdf/2503.13985
Copy Paste: [[2503.13985]] DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection(https://arxiv.org/abs/2503.13985)
Keywords: generation
Abstract: Developing effective visual inspection models remains challenging due to the scarcity of defect data. While image generation models have been used to synthesize defect images, producing highly realistic defects remains difficult. We propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. It leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions incorporating defect, object, and attention terms. It enables precise capture of detailed, localized defect features and their seamless integration into defect-free objects. Additionally, our Low-Fidelity Selection method further enhances the defect sample quality. Experiments show that DefectFill generates high-quality defect images, enabling visual inspection models to achieve state-of-the-art performance on the MVTec AD dataset.
摘要：由于缺陷数据缺乏，开发有效的视觉检查模型仍然具有挑战性。尽管图像生成模型已用于合成缺陷图像，但产生高度逼真的缺陷仍然很困难。我们提出了缺陷，这是一种现实缺陷生成的新方法，仅需要几个参考缺陷图像。它利用了一个微调的介入扩散模型，该模型通过包括缺陷，对象和注意力术语的自定义损失功能进行了优化。它可以精确捕获详细的局部缺陷特征及其无缝集成到无缺陷对象中。此外，我们的低保真选择方法进一步提高了缺陷样品质量。实验表明，缺陷会生成高质量的缺陷图像，从而使视觉检查模型能够在MVTEC AD数据集中实现最新性能。

Title: MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling

Authors: Damian Boborzi, Phillip Mueller, Jonas Emrich, Dominik Schmid, Sebastian Mueller, Lars Mikelsons
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14002
Pdf URL: https://arxiv.org/pdf/2503.14002
Copy Paste: [[2503.14002]] MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling(https://arxiv.org/abs/2503.14002)
Keywords: generative
Abstract: Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.
摘要：生成模型最近在3D对象的领域取得了显着进度。但是，由于它们无法提供特定领域特定任务所需的准确性，质量和可控性，因此它们在工程等领域中的实际应用仍然有限。微调大生成模型是使这些模型在这些领域中可用的有前途的观点。创建高质量的域特异性3D数据集对于微调大生成模型至关重要，但是数据过滤和注释过程仍然是一个重要的瓶颈。我们提出了Meshfleet，这是从Objaverse-XL提取的经过过滤和注释的3D车辆数据集，这是最广泛的3D对象的公开收集。我们的方法提出了基于质量分类器的自动数据过滤的管道。该分类器经过手动标记的Objaverse子集的训练，并结合了Dinov2和Siglip嵌入，并通过基于字幕的分析和不确定性估计进行了完善。我们通过针对标题和图像美学得分的技术和SV3D的微调实验的比较分析来证明我们的过滤方法的功效，从而强调了目标数据选择对域特异性3D生成模型的重要性。

Title: Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning

Authors: Pengcheng Zhou, Lantian Zhang, Wei Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14013
Pdf URL: https://arxiv.org/pdf/2503.14013
Copy Paste: [[2503.14013]] Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning(https://arxiv.org/abs/2503.14013)
Keywords: generation
Abstract: Semi-supervised learning is of great significance in medical image segmentation by exploiting unlabeled data. Among its strategies, the co-training framework is prominent. However, previous co-training studies predominantly concentrate on network initialization variances and pseudo-label generation, while overlooking the equilibrium between information interchange and model diversity preservation. In this paper, we propose the Masked Image Consistency and Discrepancy Learning (MICD) framework with three key modules. The Masked Cross Pseudo Consistency (MCPC) module enriches context perception and small sample learning via pseudo-labeling across masked-input branches. The Cross Feature Consistency (CFC) module fortifies information exchange and model robustness by ensuring decoder feature consistency. The Cross Model Discrepancy (CMD) module utilizes EMA teacher networks to oversee outputs and preserve branch diversity. Together, these modules address existing limitations by focusing on fine-grained local information and maintaining diversity in a heterogeneous framework. Experiments on two public medical image datasets, AMOS and Synapse, demonstrate that our approach outperforms state-of-the-art methods.
摘要：通过利用未标记的数据，半监督学习在医学图像分割中具有重要意义。在其策略中，共同训练框架是突出的。但是，以前的共同训练研究主要集中在网络初始化方差和伪标签的生成上，同时忽略了信息互换和模型多样性保存之间的平衡。在本文中，我们提出了带有三个关键模块的蒙版图像一致性和差异学习（MICD）框架。掩盖的跨伪伪一致性（MCPC）模块丰富了上下文感知和小型样本学习，并通过伪标记在掩盖输入分支中进行伪标记。交叉特征一致性（CFC）模块通过确保解码器功能一致性来构成信息交换和模型鲁棒性。跨模型差异（CMD）模块利用EMA教师网络监督输出并保留分支多样性。这些模块一起通过关注细粒度的本地信息并保持异质框架中的多样性来解决现有的限制。在两个公共医疗图像数据集（AMOS和Synapse）上进行的实验表明，我们的方法表现优于最先进的方法。

Title: Intra and Inter Parser-Prompted Transformers for Effective Image Restoration

Authors: Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14037
Pdf URL: https://arxiv.org/pdf/2503.14037
Copy Paste: [[2503.14037]] Intra and Inter Parser-Prompted Transformers for Effective Image Restoration(https://arxiv.org/abs/2503.14037)
Keywords: restoration, generation
Abstract: We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.
摘要：我们提出了内部和parser in Intra-prompter的变压器（PPTFormer），该变压器探索了来自视觉基础模型的有用特征，以进行图像恢复。具体而言，PPTFormer包含两个部分：图像恢复网络（IRNET），用于恢复降级观测值的图像和Parser启用的特征生成网络（PPFGNET），用于为IRNET提供可靠的解析器信息以促进恢复。为了增强解析器在IRNET中的整合，我们提出了解析器内引起的注意（内室内）和parser-parser-prompomp的注意（InterPPA），以隐式和明确地学习有用的解析器特征，以促进恢复。内部重新考虑者在解析器和恢复特征之间引起人们的注意，从远距离和内部内的角度来使解析器的隐性感知。相反，InterPPA最初将恢复功能与解析器融合在一起，然后在注意机制中制定这些融合功能，以明确感知解析器信息。此外，我们提出了一个解析器促进的前馈网络，以指导像素门控件内的恢复。实验结果表明，PPTFormer在图像降低，脱焦性，否定和低光增强方面取得了最新的性能。

Title: AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

Authors: Xinhao Xiang, Xiao Liu, Zizhong Li, Zhuosheng Liu, Jiawei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14064
Pdf URL: https://arxiv.org/pdf/2503.14064
Copy Paste: [[2503.14064]] AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark(https://arxiv.org/abs/2503.14064)
Keywords: generation
Abstract: The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.
摘要：AI生成的视频合成的快速发展导致对标准化和有效评估指标的增长需求。现有指标缺乏系统地对方法进行分类的统一框架，从而限制了对评估格局的整体理解。另外，零散的实现和缺乏标准化界面会导致冗余的处理开销。此外，许多先前的方法都受到数据集特异性依赖性的限制，从而限制了它们在各种视频域中的适用性。为了应对这些挑战，我们介绍了Aigve-Tool（AI生成的视频评估工具包），这是一个统一的框架，为全面的AI生成的视频评估提供了结构化且可扩展的评估管道。在新型的五类分类法中，Aigve-Tool在新的五类分类法中整合了多种评估方法，同时允许通过模块化配置系统进行灵活的自定义。此外，我们建议使用基于手工制作的说明和提示，使用五个SOTA视频生成模型创建的大规模基准数据集Aigve Bench。该数据集系统地评估了跨九个关键质量维度的各种视频生成模型。广泛的实验表明，AIGVE工具在提供标准化和可靠的评估结果中的有效性，突出了当前模型的特定优势和局限性，并促进了下一代AI生成的视频技术的进步。

Title: Fast Autoregressive Video Generation with Diagonal Decoding

Authors: Yang Ye, Junliang Guo, Haoyu Wu, Tianyu He, Tim Pearce, Tabish Rashid, Katja Hofmann, Jiang Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14070
Pdf URL: https://arxiv.org/pdf/2503.14070
Copy Paste: [[2503.14070]] Fast Autoregressive Video Generation with Diagonal Decoding(https://arxiv.org/abs/2503.14070)
Keywords: generation, generative
Abstract: Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to $10\times$ speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
摘要：自回归的变压器模型在视频生成中表现出了令人印象深刻的性能，但是它们的顺序令牌的解码过程构成了主要的瓶颈，尤其是对于由成千上万个令牌代表的长视频。在本文中，我们提出了对角线解码（DIAGD），这是一种无训练的推理加速算法，用于自动训练的预训练模型，以利用视频中的空间和时间相关性。我们的方法在空间代币网格中沿对角线路径产生令牌，从而在每个帧内实现并行解码，并在连续帧中部分重叠。所提出的算法用途广泛，并且适应各种生成模型和任务，同时可以灵活地控制推理速度和视觉质量之间的权衡。此外，我们提出了一种具有成本效益的固定策略，该策略将模型的注意力模式与我们的解码顺序保持一致，从而进一步减轻了小规模模型上的训练 - 推动差距。在多个自回旋视频生成模型和数据集上进行的实验表明，与天真的顺序解码相比，DIAGD达到了高达$ 10 \ times $加速，同时保持了可比的视觉保真度。

Title: Growing a Twig to Accelerate Large Vision-Language Models

Authors: Zhenwei Shao, Mingyang Wang, Zhou Yu, Wenwen Pan, Yan Yang, Tao Wei, Hongyuan Zhang, Ning Mao, Wei Chen, Jun Yu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14075
Pdf URL: https://arxiv.org/pdf/2503.14075
Copy Paste: [[2503.14075]] Growing a Twig to Accelerate Large Vision-Language Models(https://arxiv.org/abs/2503.14075)
Keywords: generation
Abstract: Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Code will be made publicly available.
摘要：大型视觉模型（VLM）在开放世界的多模式中表现出了显着的功能，但它们的高计算开销对实际部署构成了巨大的挑战。最近的一些作品提出了通过在VLM早期层的注意力图指导的冗余视觉令牌中提高VLM的拟议方法。尽管这些令牌修剪方法取得了成功，但它们仍然存在两个主要的缺点：（i）由于早期层中不敏感的注意力信号而导致的准确性下降，并且（ii）在产生长时间响应时的速度有限（例如，30个代币）。为了解决上述局限性，我们提出了TwigVLM - 一种简单而通用的架构，通过在基本VLM的早期层上种植轻巧的树枝。与纯粹基于视觉令牌修剪的大多数现有VLM加速度方法相比，我们的TWIGVLM不仅通过采用Twig引导的代币修剪（TTP）策略来实现更好的准确性保留率，而且还通过利用自我调码解码（SSD）策略来产生更高的生成速度。以Llava-1.5-7b为基础VLM，实验结果表明，TwiGVLM在修剪88.9％的视觉令牌并实现了154％的速度后，可以保留96％的原始性能，从而产生长度响应，从而在较长的范围内提供了更高的效果，以更高的效果，以确保较高的状态和速度更好地胜任。代码将公开可用。

Title: Theoretical Foundation of Flow-Based Time Series Generation: Provable Approximation, Generalization, and Efficiency

Authors: Jiangxuan Long, Zhao Song, Chiwun Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14076
Pdf URL: https://arxiv.org/pdf/2503.14076
Copy Paste: [[2503.14076]] Theoretical Foundation of Flow-Based Time Series Generation: Provable Approximation, Generalization, and Efficiency(https://arxiv.org/abs/2503.14076)
Keywords: generation, generative
Abstract: Recent studies suggest utilizing generative models instead of traditional auto-regressive algorithms for time series forecasting (TSF) tasks. These non-auto-regressive approaches involving different generative methods, including GAN, Diffusion, and Flow Matching for time series, have empirically demonstrated high-quality generation capability and accuracy. However, we still lack an appropriate understanding of how it processes approximation and generalization. This paper presents the first theoretical framework from the perspective of flow-based generative models to relieve the knowledge of limitations. In particular, we provide our insights with strict guarantees from three perspectives: $\textbf{Approximation}$, $\textbf{Generalization}$ and $\textbf{Efficiency}$. In detail, our analysis achieves the contributions as follows: $\bullet$ By assuming a general data model, the fitting of the flow-based generative models is confirmed to converge to arbitrary error under the universal approximation of Diffusion Transformer (DiT). $\bullet$ Introducing a polynomial-based regularization for flow matching, the generalization error thus be bounded since the generalization of polynomial approximation. $\bullet$ The sampling for generation is considered as an optimization process, we demonstrate its fast convergence with updating standard first-order gradient descent of some objective.
摘要：最近的研究表明，使用生成模型，而不是时间序列预测（TSF）任务的传统自动回归算法。这些涉及不同生成方法（包括GAN，扩散和时间序列的流动匹配）的非自动退缩方法在经验上证明了高质量的生成能力和准确性。但是，我们仍然对其处理近似和概括的方式缺乏适当的理解。本文从基于流量的生成模型的角度提出了第一个理论框架，以减轻局限性的知识。特别是，我们从三个角度提供了严格保证的见解：$ \ textbf {近似} $，$ \ textbf {generalization} $和$ \ textbf {效率} $。详细介绍，我们的分析可以通过以下内容来实现：$ \ bullet $通过假设一般数据模型，确认基于流的生成模型的拟合将在扩散变压器（DIT）的通用近似下收敛到任意误差。 $ \ bullet $引入基于多项式的流量匹配的正则化，因此，由于多项式近似的概括，因此将概括误差界定。 $ \ bullet $生成的采样被认为是一个优化过程，我们通过更新标准的一阶梯度下降来证明其快速收敛。

Title: Towards properties of adversarial image perturbations

Authors: Egor Kuznetsov, Kirill Aistov, Maxim Koroteev
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14111
Pdf URL: https://arxiv.org/pdf/2503.14111
Copy Paste: [[2503.14111]] Towards properties of adversarial image perturbations(https://arxiv.org/abs/2503.14111)
Keywords: restoration
Abstract: Using stochastic gradient approach we study the properties of adversarial perturbations resulting in noticeable growth of VMAF image quality metric. The structure of the perturbations is investigated depending on the acceptable PSNR values and based on the Fourier power spectrum computations for the perturbations. It is demonstrated that moderate variation of image brightness ($\sim 10$ pixel units in a restricted region of an image can result in VMAF growth by $\sim 60\%$). Unlike some other methods demonstrating similar VMAF growth, the subjective quality of an image remains almost unchanged. It is also shown that the adversarial perturbations may demonstrate approximately linear dependence of perturbation amplitudes on the image brightness. The perturbations are studied based on the direct VMAF optimization in PyTorch. The significant discrepancies between the metric values and subjective judgements are also demonstrated when image restoration from noise is carried out using the same direct VMAF optimization.
摘要：使用随机梯度方法，我们研究了对抗扰动的特性，从而导致VMAF图像质量度量的显着增长。根据可接受的PSNR值，研究了扰动的结构，并基于扰动的傅立叶功率谱计算。已经证明，图像亮度的中等变化（$ \ sim 10 $像素单元在图像的受限区域中可能导致VMAF增长$ \ sim 60 \％$）。与其他一些证明VMAF增长相似的方法不同，图像的主观质量几乎保持不变。还表明，对抗扰动可能表明扰动振幅对图像亮度的近似线性依赖性。根据Pytorch中的直接VMAF优化研究了扰动。当使用相同的直接VMAF优化进行噪声恢复时，也证明了度量值和主观判断之间的显着差异。

Title: Condensing Action Segmentation Datasets via Generative Network Inversion

Authors: Guodong Ding, Rongyu Chen, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14112
Pdf URL: https://arxiv.org/pdf/2503.14112
Copy Paste: [[2503.14112]] Condensing Action Segmentation Datasets via Generative Network Inversion(https://arxiv.org/abs/2503.14112)
Keywords: generative
Abstract: This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500$\times$ while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.
摘要：这项工作介绍了时间动作细分中使用的过程视频数据集的第一种冷凝方法。我们提出了一个凝结框架，该框架利用了从数据集和网络反转中学到的生成性生成性，将数据凝结到紧凑的潜在代码中，并且在时间和频道方面大大降低了大量存储。在正交方面，我们提出采样多种和代表性的作用序列，以最大程度地减少视频冗余。我们对标准基准测试的评估表明，在凝结TAS数据集并实现竞争性能方面表现出一致的有效性。具体来说，在早餐数据集中，我们的方法将存储空间减少了500美元以上，而与完整数据集的培训相比，保留了83％的性能。此外，当应用于下游的增量学习任务时，与最先进的表现相比，它可以产生较高的性能。

Title: Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Authors: Zining Wang, Tongkun Guan, Pei Fu, Chen Duan, Qianyi Jiang, Zhentao Guo, Shan Guo, Junfeng Luo, Wei Shen, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14140
Pdf URL: https://arxiv.org/pdf/2503.14140
Copy Paste: [[2503.14140]] Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding(https://arxiv.org/abs/2503.14140)
Keywords: generation
Abstract: Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at this https URL.
摘要：多模式大型语言模型（MLLM）引入了一个新颖的维度来记录理解，即它们具有视觉理解能力；但是，如何设计合适的图像文本预训练任务，以桥接文档级MLLMS中的视觉和语言模式。在这项研究中，我们介绍了一种新颖的视觉语言对准方法，该方法将关键问题作为视觉问题作为一个视觉问题回答（VQAMASK）任务，同时优化了两个任务：基于VQA的文本解析和蒙版生成。前者允许模型在语义层面上隐式地对齐图像和文本。后者引入了一个额外的掩码生成器（在推理期间丢弃），以明确确保图像中的视觉文本及其相应的图像区域之间的对齐。他们一起在解析视觉文本时可以防止模型幻觉，并有效地促进空间感知的特征表示学习。为了支持所提出的VQAMASK任务，我们构建了一个全面的图像掩码生成管道，并提供了一个具有6M数据（MTMASK6M）的大型数据集。随后，我们证明，引入提议的面具生成任务会产生竞争性文档级别的理解绩效。在利用拟议的VQAMASK的情况下，我们介绍了Marten，这是一种针对文档级别理解的培训效率的MLLM。广泛的实验表明，我们的Marten在以文档为中心的任务中始终在8B-MLLM之间取得重大改进。代码和数据集可在此HTTPS URL上找到。

Title: Concat-ID: Towards Universal Identity-Preserving Video Synthesis

Authors: Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Chongxuan Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14151
Pdf URL: https://arxiv.org/pdf/2503.14151
Copy Paste: [[2503.14151]] Concat-ID: Towards Universal Identity-Preserving Video Synthesis(https://arxiv.org/abs/2503.14151)
Keywords: generation
Abstract: We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs Variational Autoencoders to extract image features, which are concatenated with video latents along the sequence dimension, leveraging solely 3D self-attention mechanisms without the need for additional modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.
摘要：我们介绍Concat-ID，这是一个统一的保留视频生成的框架。 Concat-ID采用各种自动编码器来提取图像特征，这些图像特征与沿序列维度沿视频潜在的串联，仅利用仅3D自我发项机制而无需其他模块。引入了一种新型的跨录像配对策略和多阶段训练方案，以平衡身份一致性和面部编辑性，同时增强视频自然性。广泛的实验证明了Concat-ID比单一和多个身份生成中现有方法的优越性，以及其对多主体场景的无缝可扩展性，包括虚拟的尝试和可控制的生成。 Concat-ID为具有身份的视频综合建立了一个新的基准，为广泛的应用提供了多功能且可扩展的解决方案。

Title: Speculative Decoding for Verilog: Speed and Quality, All in One

Authors: Changran Xu, Yi Liu, Yunhao Zhou, Shan Huang, Ningyi Xu, Qiang Xu
Subjects: cs.LG, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14153
Pdf URL: https://arxiv.org/pdf/2503.14153
Copy Paste: [[2503.14153]] Speculative Decoding for Verilog: Speed and Quality, All in One(https://arxiv.org/abs/2503.14153)
Keywords: generation
Abstract: The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model's ability to capture Verilog's logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
摘要：大型语言模型（LLM）的快速发展已彻底改变了各种编程语言的代码生成任务。但是，编程语言的独特特征，尤其是那些具有特定语法的Verilog和培训数据集中较低表示的诸如verilog，对常规令牌化和解码方法构成了重大挑战。在本文中，我们介绍了用于生成Verilog代码的投机解码的新颖应用，表明它可以提高推理速度和输出质量，从而有效地实现速度和质量。与标准的LLM令牌化方案（通常是有意义的代码结构）不同，我们的方法与句法具有重要意义的代币保持一致，使模型更容易学习令牌分布。这种改进解决了固有的令牌化问题，并增强了模型更有效地捕获Verilog的逻辑结构的能力。我们的实验结果表明，与传统的培训策略相比，我们的方法在Verilog代码生成中达到了5.05倍的速度，并提高了RTLLM的10个功能准确性。这些发现重点介绍了投机解码是一种有希望的方法，可以弥合专业编程语言代码生成的质量差距。

Title: RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation

Authors: Zhang Chen, Shuai Wan, Siyu Ren, Fuzheng Yang, Mengting Yu, Junhui Hou
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14154
Pdf URL: https://arxiv.org/pdf/2503.14154
Copy Paste: [[2503.14154]] RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation(https://arxiv.org/abs/2503.14154)
Keywords: quality assessment
Abstract: One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.
摘要：点云压缩（PCC）中的主要挑战之一是如何评估感知到的失真，以便可以优化编解码器的感知质量。 PCC中当前的标准实践重点介绍了一个主要问题：虽然单功能指标被广泛用于评估压缩失真，但搜索点对点最近的邻居的经典方法经常无法充分地在点云之间构建精确的对应关系，从而导致对人类感性特征的无效捕获。为了克服相关的局限性，我们提出了一种称为RBFIM的新型评估方法，利用径向基函数（RBF）插值将离散点特征转换为扭曲点云的连续特征函数。通过将原始点云的几何坐标替换为特征函数，我们获得了点特征的徒。这使得在扭曲和原始点云之间建立了精确的相应特征，并显着提高了质量评估的准确性。此外，该方法避免了双向搜索引起的复杂性。对压缩点云的多个主观质量数据集进行了广泛的实验表明，我们的RBFIM擅长解决人类的感知任务，从而为PCC优化工作提供了强有力的支持。

Title: Decision Tree Induction Through LLMs via Semantically-Aware Evolution

Authors: Tennison Liu, Nicolas Huynh, Mihaela van der Schaar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14217
Pdf URL: https://arxiv.org/pdf/2503.14217
Copy Paste: [[2503.14217]] Decision Tree Induction Through LLMs via Semantically-Aware Evolution(https://arxiv.org/abs/2503.14217)
Keywords: generative
Abstract: Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce $\texttt{LLEGO}$, a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce $\textit{fitness-guided}$ crossover to exploit high-performing regions, and $\textit{diversity-guided}$ mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that $\texttt{LLEGO}$ evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.
摘要：决策树是一类至关重要的模型类别，可在包括医疗保健，金融和物流在内的各个领域提供强大的预测性能和固有的解释性。但是，当前的树木诱导方法通常面临限制，例如贪婪方法的次优解决方案或过度的计算成本以及精确优化方法的有限适用性。为了应对这些挑战，我们提出了一种基于遗传编程（GP）的决策树诱导的进化优化方法。我们的关键创新是语义先验和有关搜索空间的特定领域知识的整合到优化算法中。为此，我们介绍了$ \ texttt {llego} $，该框架通过使用大型语言模型（LLMS）将语义先验纳入遗传搜索操作员中，从而提高了搜索效率和搜索空间的靶向区域，从而产生决策树和出色的广泛性绩效。这是通过与结构化的自然语言提示一起工作的新型遗传操作员进行操作的，从而有效利用LLM作为条件生成模型和语义知识的来源。具体来说，我们介绍了$ \ textit {健身指导} $交叉以利用高性能区域，以及$ \ textit {多样性引导} $突变，以有效地全局探索搜索空间。这些操作员通过相应的超参数来控制这些操作员，这可以在整个搜索空间中探索和剥削之间取得更细微的平衡。从经验上讲，我们在各种基准上证明了$ \ texttt {llego} $与现有的树木诱导方法相比，演变出了卓越的树木，并且与常规的GP方法相比，表现出更高效率的搜索性能。

Title: Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis

Authors: Yizhou Li, Yusuke Monno, Masatoshi Okutomi, Yuuichi Tanaka, Seiichi Kataoka, Teruaki Kosiba
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14219
Pdf URL: https://arxiv.org/pdf/2503.14219
Copy Paste: [[2503.14219]] Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis(https://arxiv.org/abs/2503.14219)
Keywords: generation
Abstract: Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
摘要：神经辐射场（NERF）的最新进展在3D重建和新型视图合成中表现出巨大的潜力，特别是对于室内和小型场景。但是，将NERF扩展到大规模的室外环境提出了诸如瞬态对象，稀疏相机和纹理以及不同的照明条件等挑战。在本文中，我们提出了针对NERF的分割引导的增强，以供户外街头场景，重点关注复杂的城市环境。我们的方法扩展了Zipnerf，并利用接地的SAM进行分割掩模，从而有效处理瞬态对象，对天空的建模以及地面的正则化。我们还引入外观嵌入，以适应视图序列之间不一致的照明。实验结果表明，我们的方法的表现优于基线Zipnerf，以更少的伪影和更清晰的细节来提高新型视图合成质量。

Title: Quantization-Free Autoregressive Action Transformer

Authors: Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, Claire Vernade
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14259
Pdf URL: https://arxiv.org/pdf/2503.14259
Copy Paste: [[2503.14259]] Quantization-Free Autoregressive Action Transformer(https://arxiv.org/abs/2503.14259)
Keywords: generative
Abstract: Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.
摘要：当前基于变压器的模仿学习方法引入了离散的动作表示形式，并在所得的潜在代码上训练自动回归的变压器解码器。但是，初始量化破坏了动作空间的连续结构，从而限制了生成模型的能力。我们提出了一种无量化的方法，它利用生成无限 - 唱机变压器（GIVT）作为自回旋变压器的直接，连续的策略参数化。这简化了模仿学习管道，同时在各种流行的模拟机器人技术任务上实现最新性能。我们通过仔细研究采样算法，进一步改善结果来增强政策的推出。

Title: CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution

Authors: Runyi Li, Bin Chen, Jian Zhang, Radu Timofte
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14272
Pdf URL: https://arxiv.org/pdf/2503.14272
Copy Paste: [[2503.14272]] CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution(https://arxiv.org/abs/2503.14272)
Keywords: super-resolution
Abstract: Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.
摘要：现实世界图像超分辨率是一项关键图像处理任务，其中两个关键的评估标准是对原始图像的保真度和生成结果的视觉现实性。尽管基于扩散模型的现有方法通过利用强大的先验而在视觉现实中表现出色，但他们通常很难在忠诚度和现实之间取得有效的平衡。在我们的初步实验中，我们观察到，多种模型的线性组合优于单个模型，激励我们利用不同模型的优势进行更有效的权衡。基于这种见解，我们提出了一种基于蒸馏的方法，该方法利用了忠诚度和现实的几何分解，以及多种教师模型的绩效优势，以进行更加平衡的权衡。此外，我们探讨了这种权衡的可控性，从而实现了灵活且可调节的超分辨率过程，我们称之为CTSR（可控制权衡的超级分辨率）。在几个现实世界图像超分辨率基准上进行的实验表明，我们的方法超过了现有的最新方法，在忠诚度和现实度量指标中都达到了卓越的性能。

Title: Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

Authors: Jiang Qin, Senmao Li, Alexandra Gomez-Villa, Shiqi Yang, Yaxing Wang, Kai Wang, Joost van de Weijer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14275
Pdf URL: https://arxiv.org/pdf/2503.14275
Copy Paste: [[2503.14275]] Free-Lunch Color-Texture Disentanglement for Stylized Image Generation(https://arxiv.org/abs/2503.14275)
Keywords: generation
Abstract: Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.
摘要：文本对图像（T2I）扩散模型的最新进展已转化为图像的生成，仅使用几个样式参考图像就可以在程式化生成中取得重大进展。但是，由于控制多种样式属性（例如颜色和纹理）的挑战，当前基于扩散的方法与精细颗粒样式定制相比。本文介绍了第一种无调的方法，以实现风格化的T2I生成中的自由颜色颜色纹理删除，以满足对无独立控制样式元素的需求，以解决无endentangled的样式化图像生成（disig）问题。我们的方法利用剪辑图像嵌入空间中的图像促进属性属性，以开发用于将色素纸质嵌入（CTE）与单个颜色和纹理参考图像分开和提取颜色质量嵌入（CTE）的技术。为了确保生成图像的调色板与颜色参考紧密对齐，我们应用了美白和着色转换以增强颜色一致性。此外，为了防止由于扩散训练固有的信号裂解偏差而导致的质地损失，我们引入了一个噪声项，该噪声术语可在正则化的美白和着色转化（regwct）期间保留纹理忠诚度。通过这些方法，我们的样式属性删除方法（SADIS）为样式化图像生成提供了更精确和可定制的解决方案。对Wikiart和StyledRop数据集的图像进行实验表明，在定性和定量上，Sadis都超过了《无措施任务》中最新的风格化方法。

Title: Towards synthetic generation of realistic wooden logs

Authors: Fedor Zolotarev, Borek Reich, Tuomas Eerola, Tomi Kauppi, Pavel Zemcik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14277
Pdf URL: https://arxiv.org/pdf/2503.14277
Copy Paste: [[2503.14277]] Towards synthetic generation of realistic wooden logs(https://arxiv.org/abs/2503.14277)
Keywords: generation
Abstract: In this work, we propose a novel method to synthetically generate realistic 3D representations of wooden logs. Efficient sawmilling heavily relies on accurate measurement of logs and the distribution of knots inside them. Computed Tomography (CT) can be used to obtain accurate information about the knots but is often not feasible in a sawmill environment. A promising alternative is to utilize surface measurements and machine learning techniques to predict the inner structure of the logs. However, obtaining enough training data remains a challenge. We focus mainly on two aspects of log generation: the modeling of knot growth inside the tree, and the realistic synthesis of the surface including the regions, where the knots reach the surface. This results in the first log synthesis approach capable of generating both the internal knot and external surface structures of wood. We demonstrate that the proposed mathematical log model accurately fits to real data obtained from CT scans and enables the generation of realistic logs.
摘要：在这项工作中，我们提出了一种新颖的方法来合成生成木材日志的现实3D表示。高效的锯木石在很大程度上取决于对原木的准确测量及其内部结的分布。计算机断层扫描（CT）可用于获得有关结的准确信息，但在锯木厂环境中通常不可行。一个有希望的替代方法是利用表面测量和机器学习技术来预测日志的内部结构。但是，获得足够的培训数据仍然是一个挑战。我们主要关注对数生成的两个方面：树内结的建模，以及表面的逼真的综合，包括区域到达表面的区域。这导致了第一种原木合成方法，能够同时产生木材的内结和外表面结构。我们证明了提出的数学日志模型准确地符合从CT扫描获得的实际数据，并可以生成现实的日志。

Title: Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Authors: Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer Sándor Toth, Samantha Work
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14286
Pdf URL: https://arxiv.org/pdf/2503.14286
Copy Paste: [[2503.14286]] Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs(https://arxiv.org/abs/2503.14286)
Keywords: generation, generative
Abstract: We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.
摘要：我们为使用强化学习提出了一种用于微调大语言模型的新算法。锥形的非政策外增强（TOPR）使用不对称的，重要性抽样的锥形变体来加快学习的速度，同时保持稳定的学习动力学，即使不使用KL正则化。 TOPR可以以完全离线的方式应用，允许在统一框架中处理正面和负面示例，并从蒙特卡洛算法的典型实施简单性中受益。我们通过在GSM8K和数学推理基准的一系列实验中证明了方法的有效性，找到了培训溶液生成模型和作为生成验证者的绩效提高。我们表明，在非政策政权中正确利用正面和负面的例子，同时提高了测试时间的准确性和训练数据效率，同时避免了``浪费的推论''，这与丢弃负面示例有关。我们发现，在多次训练的多次迭代中，这种优势持续存在，并且可以通过数据集策划技术进行扩增，从而使我们能够与8B语言模型匹配70B参数模型性能。作为这项工作的必然性，我们发现增强基线参数在存在负面例子的情况下在定义数据集组成方面起着重要且出乎意料的作用，因此对于推动外部性能至关重要。

Title: PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

Authors: Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14295
Pdf URL: https://arxiv.org/pdf/2503.14295
Copy Paste: [[2503.14295]] PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation(https://arxiv.org/abs/2503.14295)
Keywords: generation
Abstract: Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.
摘要：音频驱动的面孔生成的最新进展在唇部同步方面取得了长足的进步。但是，当前的方法通常缺乏对面部动画的足够控制，例如说话风格和情感表达，从而产生统一的产出。在本文中，我们专注于改善两个关键因素：唇部音调和情感控制，以增强会说话视频的多样性和用户友好性。唇部音调控制的重点是诸如口语风格和唇部运动的范围，而情绪控制则集中在产生逼真的情感表达式上，从而可以对多种属性（例如强度）进行修改。为了获得对面部动画的精确控制，我们提出了一个新颖的框架PC-Talk，该框架可以通过隐式关键点变形来实现唇部音调和情感控制。首先，我们的唇部音调控制模块有助于在单词级别上精确地对口语样式进行精确编辑，并调整唇部运动量表以模拟不同的声音响度级别，从而与音频保持唇部同步。其次，我们的情绪控制模块产生了具有纯粹情绪变形的生动情感面部特征。该模块还可以很好地修改强度和在不同面部区域的多种情绪的结合。我们的方法表明，在广泛的实验中，在HDTF和MEAD数据集上实现了出色的控制功能，并实现了最先进的性能。

Title: DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Authors: Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, Kaicheng Yu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14324
Pdf URL: https://arxiv.org/pdf/2503.14324
Copy Paste: [[2503.14324]] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies(https://arxiv.org/abs/2503.14324)
Keywords: generation
Abstract: The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
摘要：视觉理解和产生所需的不同表示空间在将它们统一的大型语言模型的自回归范式中构成挑战。经过重建的视觉令牌机构擅长捕获低级感知细节，使其非常适合视觉生成，但缺乏用于理解任务的高级语义表示。相反，通过对比度学习训练的视觉编码器与语言良好相符，但努力将其解码回到像素空间中以进行生成任务。为了弥合这一差距，我们提出了DualToken，这种方法可以在单个令牌中统一表示和生成的表示。但是，将重建和语义目标直接整合到单个令牌器中会产生冲突，从而导致重建质量和语义性能的性能下降。 Dualtoken无需强迫单个代码手册来处理语义和感知信息，而是通过引入单独的高级和低级功能的代码手册来将其解散，从而有效地将其固有的冲突转变为协同关系。结果，Dualtoken在重建和语义任务中都达到了最先进的表现，同时在下游MLLM理解和发电任务中表现出了出色的有效性。值得注意的是，我们还表明，作为统一的令牌剂，Dualtoken超过了两种不同类型的视觉编码器的天真组合，在统一的MLLM中提供了出色的性能。

Title: LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Authors: Yu Cheng, Fajie Yuan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14325
Pdf URL: https://arxiv.org/pdf/2503.14325
Copy Paste: [[2503.14325]] LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models(https://arxiv.org/abs/2503.14325)
Keywords: generation
Abstract: Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent this http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video this http URL model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video this http URL models and code are available at this https URL.
摘要：潜在视频扩散模型（LVDM）的最新进展通过利用视频变量自动编码器（视频VAE）来彻底改变视频的生成，以将复杂的视频数据压缩到紧凑的潜在HTTP URL中，因为LVDM训练量表，视频VAE的计算量表是一个关键的瓶颈，特别是用于编码高度分配的视频。为了解决这个问题，我们提出了LeanVae，这是一个新颖而超高的视频VAE框架，介绍了两个关键创新：（1）基于邻里吸引的Feelforward（NAF）模块和非重叠的补丁操作的轻量级体系结构，并大大降低了计算成本，以及（2）增强了RECONTISS和压缩的传感技术。广泛的实验验证了Leanvae在视频重建和发电方面的优势，尤其是在提高现有视频的效率方面，这种HTTP URL模型可提供更少的50倍和更快的推理速度，同时保持竞争性重建质量，在维持可扩展的，高效的视频此HTTPR URL模型和代码下，可在此HTTP URL模型和编码中提供洞察力。

Title: EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

Authors: Yufei Zhu, Yiming Zhong, Zemin Yang, Peishan Cong, Jingyi Yu, Xinge Zhu, Yuexin Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14329
Pdf URL: https://arxiv.org/pdf/2503.14329
Copy Paste: [[2503.14329]] EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment(https://arxiv.org/abs/2503.14329)
Keywords: generation
Abstract: Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
摘要：灵巧的机器人手通常由于对低多样性数据训练的模型的局限性在复杂环境中有效地概括。但是，现实世界呈现出固有的无界场景范围，这使得考虑到所有可能的变化是不切实际的。一种自然的解决方案是使机器人能够从复杂环境中的经验中学习，这种方法类似于进化，在该方法中，系统通过持续反馈，从失败和成功中学习，并迭代到最佳性能。在此激励的情况下，我们提出了EvolvingGrasp，这是一种进化的掌握生成方法，该方法通过有效的偏好对齐方式不断增强抓地性能。具体来说，我们引入了手孔明智的优先优化（HPO），该优化允许模型与正面和负面反馈的偏好不断保持一致，同时逐步完善其抓地力策略。为了进一步提高在线调整期间的效率和可靠性，我们在HPO中纳入了物理学意识到的一致性模型，该模型可以加速推理，减少偏好鉴定所需的时间段的数量，并确保整个过程中的身体上的合理性。四个基准数据集的广泛实验证明了我们方法在掌握成功率和采样效率方面的最新表现。我们的结果证明了EvolvingGrasp可以在模拟和实际场景中确保稳健，可行和偏爱的握把。

Title: Revealing higher-order neural representations with generative artificial intelligence

Authors: Hojjat Azimi Asrari, Megan A. K. Peters
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.14333
Pdf URL: https://arxiv.org/pdf/2503.14333
Copy Paste: [[2503.14333]] Revealing higher-order neural representations with generative artificial intelligence(https://arxiv.org/abs/2503.14333)
Keywords: generative
Abstract: Studies often aim to reveal how neural representations encode aspects of an observer's environment, such as its contents or structure. These are ``first-order" representations (FORs), because they're ``about" the external world. A less-common target is ``higher-order" representations (HORs), which are ``about" FORs -- their contents, stability, or uncertainty. HORs of uncertainty appear critically involved in adaptive behaviors including learning under uncertainty, influencing learning rates and internal model updating based on environmental feedback. However, HORs about uncertainty are unlikely to be direct ``read-outs" of FOR characteristics, instead reflecting estimation processes which may be lossy, bias-prone, or distortive and which may also incorporate estimates of distributions of uncertainty the observer is likely to experience. While some research has targeted neural representations of ``instantaneously" estimated uncertainty, how the brain represents \textit{distributions} of expected uncertainty remains largely unexplored. Here, we propose a novel reinforcement learning (RL) based generative artificial intelligence (genAI) approach to explore neural representations of uncertainty distributions. We use existing functional magnetic resonance imaging data, where humans learned to `de-noise' their brain states to achieve target neural patterns, to train denoising diffusion genAI models with RL algorithms to learn noise distributions similar to how humans might learn to do the same. We then explore these models' learned noise-distribution HORs compared to control models trained with traditional backpropagation. Results reveal model-dependent differences in noise distribution representations -- with the RL-based model offering much higher explanatory power for human behavior -- offering an exciting path towards using genAI to explore neural noise-distribution HORs.
摘要：研究通常旨在揭示神经表示如何编码观察者环境的各个方面，例如其内容或结构。这些是``一阶表示（fors），因为它们``关于外部世界''。一个不太常见的目标是``高阶）表示（hors），它是``fors'' - 其内容，稳定性或不确定性。不确定性的Hors似乎与自适应行为有关，包括在不确定性下学习，学习率和基于环境反馈的内部模型更新。但是，关于不确定性的HRS不太可能直接是特征的``读出''，而是反映估计过程，这些估计过程可能会造成失误，容易偏见或扭曲，并且可能还包含观察者不确定性分布的估计值，而观察者可能会体验到的不确定性。尽管有些研究对````估计''thequest otteriention的估计性估计，而不是估计的“估计”估计性。不确定性仍未得到探索。在这里，我们提出了一种基于新颖的增强学习（RL）生成人工智能（Genai）方法来探索不确定性分布的神经表示。我们使用现有的功能性磁共振成像数据，在其中人类学会了将其大脑状态“消除”以实现目标神经模式，以训练具有RL算法的DeNo deNo genai模型，以学习类似于人类学习方式的噪声分布。然后，我们探索了这些模型与传统返回训练的对照模型相比，这些模型学习的噪声分布HORS。结果揭示了噪声分布表示的模型依赖性差异 - 基于RL的模型为人类行为提供了更高的解释能力 - 为使用Genai探索神经噪声分布HOSS提供了令人兴奋的途径。

Title: PENCIL: Long Thoughts with Short Memory

Authors: Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14337
Pdf URL: https://arxiv.org/pdf/2503.14337
Copy Paste: [[2503.14337]] PENCIL: Long Thoughts with Short Memory(https://arxiv.org/abs/2503.14337)
Keywords: generation
Abstract: While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage -- intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97\% accuracy on the challenging Einstein's puzzle -- a task even large models like GPT-4 struggle with -- using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.
摘要：虽然最近的作品（例如O1，DeepSeek R1）表现出了巨大的希望，即使用长期的经营链（COT）来提高语言模型的推理能力，但由于效率低下的记忆使用情况，在测试时间内将其扩展至关重要，因此中间计算在未来的上下文中无限期地累积了未来的思想。我们提出了铅笔，该铅笔将还原机制纳入自回归生成过程中，从而使模型可以根据从训练中学到的模式进行递归清理中间思想。通过这种降低机制，铅笔大大减少了生成过程中所需的最大上下文长度，因此可以在记忆力有限的情况下产生更长的思想，从而解决了更大的问题，如果有更多的思考时间。例如，我们在充满挑战的爱因斯坦难题上演示了铅笔的精度为97 \％的精度 - 即使是GPT-4（例如GPT-4斗争）的任务，也仅使用具有2048年上下文长度的小型参数变压器。从理论上讲，我们证明铅笔可以通过以最佳的时间和空间复杂性模拟图灵机来执行通用的空间效率计算，从而可以解决任意计算任务，否则给定上下文窗口约束，否则这些任务将是可靠的。

Title: VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

Authors: Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14350
Pdf URL: https://arxiv.org/pdf/2503.14350
Copy Paste: [[2503.14350]] VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation(https://arxiv.org/abs/2503.14350)
Keywords: generation
Abstract: Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
摘要：最近的视频扩散模型增强了视频编辑，但是在统一框架内处理教学编辑和多样化的任务（例如，添加，删除，更改）仍然具有挑战性。在本文中，我们介绍了素食编辑器，这是一家视频编辑器，该编辑从说明中扎根，这是一个简单的端到端框架，该框架根据不同的用户说明统一视频概念编辑，接地和推理。具体而言，给定视频和文本查询，Veggie首先利用MLLM来解释说明中的用户意图，并将其扎根于视频上下文，从而生成针对像素空间响应的特定框架接地任务查询。然后，扩散模型呈现这些计划并生成与用户意图保持一致的编辑视频。为了支持各种任务和复杂的说明，我们采用了一种课程学习策略：首先将MLLM和视频扩散模型与大规模的教学图像编辑数据保持一致，然后对高质量多任务视频数据进行端到端微调。此外，我们引入了一种新型的数据合成管道，以生成配对的教学视频编辑数据以进行模型培训。它通过利用图像到视频模型注入动力学，将静态图像数据转换为多样化的高质量视频编辑样本。素食在教学视频编辑中表现出色，具有不同的编辑技能，优于最佳教学基线作为多功能模型，而其他模型则在多任务处理方面挣扎。素食还擅长于视频对象接地和推理细分，而其他基线失败。我们进一步揭示了多个任务如何互相帮助，并突出显示了有希望的应用程序，例如零击的多模式教学和内在视频编辑。

Title: RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment

Authors: Chao Wang, Giulio Franzese, Alessandro Finamore, Pietro Michiardi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14358
Pdf URL: https://arxiv.org/pdf/2503.14358
Copy Paste: [[2503.14358]] RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment(https://arxiv.org/abs/2503.14358)
Keywords: generation
Abstract: Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.
摘要：经过流量匹配框架训练的校正流动（RF）模型已在条件生成的文本对图像（T2I）上实现了最先进的性能。然而，多个基准测试表明，合成图像仍然可能与及时的一致性不佳，即图像显示错误的属性结合，主题定位，算术等。虽然文献提供了许多改善T2I对齐方式的方法，但它们都仅考虑扩散模型，并且需要辅助数据集，评分模型，以及提示的辅助模型，以及提示的分析。在本文中，我们旨在解决这些差距。首先，我们介绍了RFMI，这是RF模型的新型共同信息（MI）估计器，该模型使用预先训练的模型本身进行MI估计。然后，我们研究了基于RFMI的T2I对齐的一种自我监督的微调方法，该方法不需要辅助信息，而不是预先训练的模型本身。具体而言，通过选择从预训练的RF模型生成的合成图像并在图像和提示之间具有高点MI来构建微调集。我们对MI估计基准测试的实验证明了RFMI的有效性，并且对SD3.5中的经验微调证实了RFMI在维持图像质量的同时提高T2I对齐的有效性。

Title: Impossible Videos

Authors: Zechen Bai, Hai Ci, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14378
Pdf URL: https://arxiv.org/pdf/2503.14378
Copy Paste: [[2503.14378]] Impossible Videos(https://arxiv.org/abs/2503.14378)
Keywords: generation
Abstract: Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.
摘要：如今，合成视频被广泛用于补充数据稀缺性和现实世界视频的多样性。当前的合成数据集主要复制现实世界的场景，使不可能的，反事实和反现实的视频概念尚未被逐渐散布。这项工作旨在回答两个问题：1）当今的视频生成模型可以有效地遵循提示以创建不可能的视频内容吗？ 2）今天的视频理解模型是否足以理解不可能的视频？为此，我们介绍了IPV板凳，这是一种新颖的基准测试，旨在评估和促进视频理解和发电方面的进步。 IPV板凳的基础是全面的分类法，其中包括4个域，14个类别。它具有不同的场景，违背了物理，生物学，地理或社会法律。根据分类法，构建了一个及时套件来评估视频生成模型，从而挑战了他们的及时关注和创造力。此外，设定了视频基准测试，以评估视频插件的理解能力，这尤其需要关于时间动态和世界知识的推理。全面的评估揭示了视频模型未来方向的局限性和见解，为下一代视频模型铺平了道路。

Title: Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance

Authors: Lisha Li, Jingwen Hou, Weide Liu, Yuming Fang, Jiebin Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14402
Pdf URL: https://arxiv.org/pdf/2503.14402
Copy Paste: [[2503.14402]] Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance(https://arxiv.org/abs/2503.14402)
Keywords: generation
Abstract: Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.
摘要：面部美学增强（FAE）旨在通过调整面部图像的结构和外观，同时尽可能地保持其身份，以提高面部吸引力。大多数现有方法都采用了基于特征或基于分数的基于分数的指导来进行FAE的生成模型。尽管这些方法取得了令人鼓舞的结果，但它们可能产生过度美化的结果，具有较低的身份一致性或不足以改善面部吸引力。为了增强身份丧失损失的面部美学，我们提出了基于扩散（NNSG-diffusion）的最接近的邻居结构指导，这是一种基于扩散的FAE方法，它可以通过3D结构指导美化2D面部图像。具体而言，我们建议从最近的邻居参考面中提取FAE指导。为了减少FAE过程中面部结构的更改，通过参考匹配的2D参考面和2D输入面恢复3D面模型，以便可以从3D面模型中提取深度和轮廓指导。然后，深度和轮廓线索可以为FAE提供有效的指导来稳定地扩散。广泛的实验表明，我们的方法在增强面部美学的同时保持面部身份方面优于以前的相关方法。

Title: MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Authors: Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, Jie Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14428
Pdf URL: https://arxiv.org/pdf/2503.14428
Copy Paste: [[2503.14428]] MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation(https://arxiv.org/abs/2503.14428)
Keywords: generation
Abstract: Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: this https URL.
摘要：文本到视频（T2V）的生成已通过扩散模型取得了长足的进步。但是，现有方法仍然在准确地绑定属性，确定空间关系并捕获多个受试者之间的复杂动作相互作用方面困难。为了解决这些局限性，我们提出了MagicComp，这是一种无训练的方法，可通过双相优化增强组成T2V的生成。具体而言，（1）在调节阶段：我们引入语义锚定歧义，以加强特定于主题的语义并通过将语义锚定向向量注入原始文本嵌入中，从而解决主题特定的语义并解决对象间的歧义；（2）在denoising阶段：我们提出了动态布局融合的注意，该融合的注意力集成了接地先验和模型自适应的空间感知，以通过掩盖的注意调制来柔韧地结合对象与其时空区域结合。此外，MagicComp是一种模型和多功能方法，可以将其无缝集成到现有的T2V架构中。在T2V-Compbench和VBench上进行的广泛实验表明，MagicComp的表现优于最先进的方法，突出了其在应用程序中的潜力，例如基于复杂的及时和轨迹控制的视频生成。项目页面：此HTTPS URL。

Title: LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Authors: Nikhil Abhyankar, Parshin Shojaee, Chandan K. Reddy
Subjects: cs.LG, cs.AI, cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2503.14434
Pdf URL: https://arxiv.org/pdf/2503.14434
Copy Paste: [[2503.14434]] LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers(https://arxiv.org/abs/2503.14434)
Keywords: generation
Abstract: Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
摘要：自动化功能工程在改善表格学习任务的预测模型性能中起着至关重要的作用。传统的自动化特征工程方法受到依赖于固定的手动设计搜索空间内预定义转换的限制，通常会忽略域知识。使用大型语言模型（LLM）的最新进展已使域知识集成到功能工程过程中。但是，现有的基于LLM的方法使用直接提示或仅依靠验证分数进行功能选择，无法利用先前功能发现实验中的见解或在功能生成和数据驱动性能之间建立有意义的推理。为了应对这些挑战，我们提出了LLM-FE，这是一个新颖的框架，将进化搜索与LLMS的领域知识和推理能力相结合，以自动发现为表格学习任务的有效功能。 LLM-FE将功能工程作为程序搜索问题制定，LLMS在其中迭代提出了新功能转换程序，并且数据驱动的反馈指导了搜索过程。我们的结果表明，LLM-FE始终优于最先进的基线，从而显着提高了各种分类和回归基准跨表格预测模型的性能。

Title: Graph-CNNs for RF Imaging: Learning the Electric Field Integral Equations

Authors: Kyriakos Stylianopoulos, Panagiotis Gavriilidis, Gabriele Gradoni, George C. Alexandropoulos
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.14439
Pdf URL: https://arxiv.org/pdf/2503.14439
Copy Paste: [[2503.14439]] Graph-CNNs for RF Imaging: Learning the Electric Field Integral Equations(https://arxiv.org/abs/2503.14439)
Keywords: generation
Abstract: Radio-Frequency (RF) imaging concerns the digital recreation of the surfaces of scene objects based on the scattered field at distributed receivers. To solve this difficult inverse scattering problems, data-driven methods are often employed that extract patterns from similar training examples, while offering minimal latency. In this paper, we first provide an approximate yet fast electromagnetic model, which is based on the electric field integral equations, for data generation, and subsequently propose a Deep Neural Network (DNN) architecture to learn the corresponding inverse model. A graph-attention backbone allows for the system geometry to be passed to the DNN, where residual convolutional layers extract features about the objects, while a UNet head performs the final image reconstruction. Our quantitative and qualitative evaluations on two synthetic data sets of different characteristics showcase the performance gains of thee proposed advanced architecture and its relative resilience to signal noise levels and various reception configurations.
摘要：射频成像（RF）成像涉及基于分布式接收器处散射场的场景对象表面的数字娱乐。为了解决这一困难的反向散射问题，通常采用数据驱动的方法来从类似的训练示例中提取模式，同时提供最小的延迟。在本文中，我们首先提供了一个基于电场积分方程的大约但快速的电磁模型，用于数据生成，然后提出了深层神经网络（DNN）体系结构，以学习相应的逆模型。图形主干骨架允许将系统的几何形状传递到DNN，其中残留的卷积层提取了对象的特征，而UNET头则执行最终图像重建。我们对两个不同特征的综合数据集的定量和定性评估展示了您提出的高级体系结构的性能提高及其对信号噪声水平和各种接收配置的相对弹性。

Title: Bolt3D: Generating 3D Scenes in Seconds

Authors: Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14445
Pdf URL: https://arxiv.org/pdf/2503.14445
Copy Paste: [[2503.14445]] Bolt3D: Generating 3D Scenes in Seconds(https://arxiv.org/abs/2503.14445)
Keywords: generation, generative
Abstract: We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
摘要：我们为快速馈送3D场景生成提供了潜在扩散模型。给定一个或多个图像，我们的模型Bolt3D在单个GPU上不到七秒钟内直接采样3D场景表示。我们通过利用强大而可扩展的现有2D扩散网络体系结构来产生一致的高保真3D场景表示形式来实现这一目标。为了训练该模型，我们通过将最新的密集3D重建技术应用于现有的多视图像数据集，创建一个大规模的多视频符合数据集的3D几何和外观。与先前需要每次曲子优化3D重建的多视生成模型相比，Bolt3D将推断成本降低了300倍。

Title: SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

Authors: Yucheng Mao, Boyang Wang, Nilesh Kulkarni, Jeong Joon Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14463
Pdf URL: https://arxiv.org/pdf/2503.14463
Copy Paste: [[2503.14463]] SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model(https://arxiv.org/abs/2503.14463)
Keywords: restoration, super-resolution
Abstract: The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
摘要：计算机视觉社区开发了许多技术，用于从单视图退化的照片中恢复真实的场景信息，这是一项重要但非常不适的任务。在这项工作中，我们从不同的角度解决了图像恢复，通过共同降低同一场景的多张照片。我们的核心假设是，捕获共享场景的降级图像包含互补信息，这些信息在组合后会更好地限制恢复问题。为此，我们实施了一个强大的多视图扩散模型，该模型通过从多视图关系中提取丰富的信息共同生成未腐败的视图。我们的实验表明，我们的多视图方法的表现优于现有的单视图像，甚至优于图像脱张和超分辨率任务的基于视频的方法。至关重要的是，我们的模型经过训练可以输出3D一致的图像，使其成为需要强大的多视图集成的应用程序，例如3D重建或姿势估计。

Title: Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Authors: Xinyu Fang, Zhijian Chen, Kai Lan, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14478
Pdf URL: https://arxiv.org/pdf/2503.14478
Copy Paste: [[2503.14478]] Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM(https://arxiv.org/abs/2503.14478)
Keywords: generative
Abstract: Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on this https URL.
摘要：创造力是智力的一个基本方面，涉及在各种环境中生成新颖和适当的解决方案的能力。尽管大型语言模型（LLM）已被广泛评估其创造力，但该领域中多模式大语言模型（MLLM）的评估仍然很大程度上尚未探索。为了解决这一差距，我们介绍了Creation-Mmbench，这是一种专门旨在评估MLLM在现实世界中基于图像的任务中的创意能力的多模式基准。基准包括765个测试用例，涵盖51个细粒度的任务。为了确保严格的评估，我们为每个测试案例定义了特定实例的评估标准，指导评估一般响应质量和与视觉输入的事实一致性。实验结果表明，与创意任务中的专有模型相比，当前的开源MLLM的表现明显不足。此外，我们的分析表明，视觉微调会对基本LLM的创造力产生负面影响。 Creation-Mmbench为推进MLLM的创造力提供了宝贵的见解，并为多模式生成智能的未来改进奠定了基础。完整的数据和评估代码在此HTTPS URL上发布。

Title: ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

Authors: Yulin Pan, Xiangteng He, Chaojie Mao, Zhen Han, Zeyinzi Jiang, Jingfeng Zhang, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14482
Pdf URL: https://arxiv.org/pdf/2503.14482
Copy Paste: [[2503.14482]] ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing(https://arxiv.org/abs/2503.14482)
Keywords: generation
Abstract: Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.
摘要：在过去的几年中，图像产生见证了显着的进步。但是，评估图像生成模型的性能仍然是一个巨大的挑战。在本文中，我们提出了Ice-Bench，这是一种统一且全面的基准测试，旨在严格评估图像生成模型。它的全面性可以在以下关键特征中进行总结：（1）粗到十个任务：我们根据存在或不存在源图像和参考图像，系统地将图像生成分为四个任务类别：NO-REF/REF图像创建/编辑。并将它们进一步分解为31个细粒度的任务，涵盖了一系列图像生成要求，最终以全面的基准为基准。（2）多维指标：评估框架评估6个维度的图像产生能力：美学质量，成像质量，及时关注，源一致性，参考一致性和可控性。引入11个指标以支持多维评估。值得注意的是，我们引入了VLLM-QA，这是一种创新的度量，旨在通过利用大型模型来评估图像编辑的成功。（3）混合数据：数据来自真实的场景和虚拟生成，这有效地改善了数据多样性并减轻模型评估中的偏见问题。通过冰台，我们对现有生成模型进行了彻底的分析，揭示了基准的挑战性质以及当前模型功能和现实世界一代需求之间的差距。为了促进该领域的进一步进步，我们将开放源冰台，包括其数据集，评估代码和模型，从而为研究社区提供宝贵的资源。

Title: DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Authors: Minglei Shi, Ziyang Yuan, Haotian Yang, Xintao Wang, Mingwu Zheng, Xin Tao, Wenliang Zhao, Wenzhao Zheng, Jie Zhou, Jiwen Lu, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14487
Pdf URL: https://arxiv.org/pdf/2503.14487
Copy Paste: [[2503.14487]] DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers(https://arxiv.org/abs/2503.14487)
Keywords: generation
Abstract: Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: this https URL
摘要：扩散模型在各种图像生成任务中都取得了显着的成功，但是它们的性能通常受到跨不同条件和噪声水平的输入的统一处理的限制。为了解决这一局限性，我们提出了一种利用扩散过程固有异质性的新方法。我们的方法diffmoe引入了批处理级的全球令牌池，使专家能够在培训期间访问全球令牌分布，从而促进专业的专家行为。为了释放扩散过程的全部潜力，DIFFMOE结合了一个能力预测因子，该预测因子根据噪声水平和样品复杂性动态分配计算资源。通过全面的评估，DIFFMOE在Imagenet基准上的扩散模型之间实现了最先进的性能，在维持1X激活的参数的同时，具有3倍激活的参数和现有的MOE方法，大大优于两个密集体系结构。我们方法的有效性扩展到了课堂条件的生成，到更具挑战性的任务，例如文本到图像生成，证明了其在不同扩散模型应用中的广泛适用性。项目页面：此HTTPS URL

Title: Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Authors: Jensen (Jinghao)Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14489
Pdf URL: https://arxiv.org/pdf/2503.14489
Copy Paste: [[2503.14489]] Stable Virtual Camera: Generative View Synthesis with Diffusion Models(https://arxiv.org/abs/2503.14489)
Keywords: generative
Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.
摘要：我们提出了稳定的虚拟摄像头（SEVA），这是一种通才扩散模型，它在给定许多输入视图和目标摄像机的情况下创建了场景的新视图。在依靠特定的任务配置的同时，现有作品难以生成大型观点更改或暂时平滑的样本。我们的方法通过简单的模型设计，优化的培训配方和灵活的抽样策略来克服这些限制，这些培训策略在测试时遍布查看综合任务。结果，我们的样品保持较高的一致性，而无需基于3D表示的蒸馏，从而简化了野生中的视图合成。此外，我们表明我们的方法可以生成高质量的视频，并使用无缝循环闭合持续时间长达半分钟。广泛的基准测试表明，SEVA胜过不同数据集和设置的现有方法。

Title: Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Authors: NVIDIA: Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14492
Pdf URL: https://arxiv.org/pdf/2503.14492
Copy Paste: [[2503.14492]] Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control(https://arxiv.org/abs/2503.14492)
Keywords: generation
Abstract: We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
摘要：我们介绍了Cosmos-Transfer，这是一个有条件的世界一代模型，可以基于各种模式的多种空间控制输入（例如分割，深度和边缘）生成世界模拟。在设计中，空间条件方案是自适应和可定制的。它允许在不同的空间位置加权不同的条件输入。这使得高度可控的世界一代，并在包括SIM2Real在内的各种世界到世界转移用例中找到了使用。我们进行了广泛的评估，以分析提出的模型并证明其在物理AI中的应用，包括机器人SIM2REAL和自动驾驶汽车数据富集。我们进一步展示了通过NVIDIA GB200 NVL72机架实现实时世界的推理缩放策略。为了帮助加速该领域的研究开发，我们在此HTTPS URL上开源的模型和代码。

Title: Deeply Supervised Flow-Based Generative Models

Authors: Inkyu Shin, Chenglin Yang, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14494
Pdf URL: https://arxiv.org/pdf/2503.14494
Copy Paste: [[2503.14494]] Deeply Supervised Flow-Based Generative Models(https://arxiv.org/abs/2503.14494)
Keywords: generation, generative
Abstract: Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.
摘要：基于流量的生成模型通过遵守一个简单的原理来绘制多个视觉生成任务的令人印象深刻的路径：线性插值的学习速度表示。但是，我们观察到，训练速度仅来自最终层输出的训练速度，使富层的表示形式不足，可能会阻碍模型收敛。为了解决此限制，我们引入了深流，这是一个新颖的框架，可通过间层通信增强速度表示。 DeepFlow分区将变压器层带入平衡的分支，并具有深度监督，并在相邻分支之间插入带有加速度（VERA）块的轻质速度炼油厂，该分支在变压器块中与中间速度特征对齐。通过内部速度对齐的改进的深度监督支持，DeepFlow在Imagenet上的收敛速度快8倍，并且与没有分类器免费指导的先前基于流动的模型相比，与以前的基于流动的模型相比，训练时间比较减少了2.6。 DeepFlow在文本中的表现也优于图像生成任务，这可以通过对MSCOCO和零射击neneval的评估来证明。

Title: Advances in 4D Generation: A Survey

Authors: Qiaowei Miao, Kehan Li, Jinsheng Quan, Zhiyuan Min, Shaojie Ma, Yichao Xu, Yi Yang, Yawei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14501
Pdf URL: https://arxiv.org/pdf/2503.14501
Copy Paste: [[2503.14501]] Advances in 4D Generation: A Survey(https://arxiv.org/abs/2503.14501)
Keywords: generation, generative
Abstract: Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \href{this https URL}{this https URL}.
摘要：近年来，生成的人工智能见证了多个领域的显着进步。以2D和3D内容产生的成功为基础，将时间维度纳入生成任务的4D代基础已成为一个新兴而迅速发展的研究领域。本文对这个新兴领域进行了全面的调查，系统地研究了其理论基础，关键方法和实际应用，以便为读者提供对4D代的当前状态和未来潜力的全面了解。我们首先介绍4D数据表示的核心概念，包括结构化和非结构化格式及其对生成任务的影响。在这个基础的基础上，我们深入研究了驱动4D代的能力技术，包括时空建模的进步，神经代表和生成框架。我们进一步审查了最近的研究，该研究采用各种控制机制和表示策略来产生4D输出，对这些方法进行分类并总结其研究轨迹。此外，我们探讨了4D代技术的广泛应用，跨越动态对象建模，场景生成，数字人类合成，4D内容编辑和自动驾驶。最后，我们分析了4D代固有的关键挑战，例如数据可用性，计算效率和时空一致性，并提出了有希望的未来研究方向。我们的代码可公开可用：\ href {this HTTPS url} {此https url}。

Title: The Power of Context: How Multimodality Improves Image Super-Resolution

Authors: Kangfu Mei, Hossein Talebi, Mojtaba Ardakani, Vishal M. Patel, Peyman Milanfar, Mauricio Delbracio
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14503
Pdf URL: https://arxiv.org/pdf/2503.14503
Copy Paste: [[2503.14503]] The Power of Context: How Multimodality Improves Image Super-Resolution(https://arxiv.org/abs/2503.14503)
Keywords: super-resolution, generative
Abstract: Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.
摘要：由于恢复细粒细节并从低分辨率输入中保留感知质量的固有困难，单片图像超分辨率（SISR）仍然具有挑战性。现有方法通常依赖有限的图像先验，从而导致次优结果。我们提出了一种新颖的方法，该方法利用多种方式可用的丰富上下文信息（包括深度，细分，边缘和文本提示），以在扩散模型框架内学习SISR的强大生成性。我们介绍了一个灵活的网络体系结构，该架构有效地融合了多模式信息，可容纳任意数量的输入方式，而无需对扩散过程进行重大修改。至关重要的是，我们通过使用来自其他模式的空间信息来指导基于区域文本的调节来减轻通常由文本提示引入的幻觉。每种模式的指导强度也可以独立控制，从而使转向输出向不同的方向转向，例如通过深度增加散景或通过分割调整对象突出。广泛的实验表明，我们的模型超过了最先进的生成SISR方法，可实现出色的视觉质量和忠诚度。请参阅此HTTPS URL的项目页面。

Title: MusicInfuser: Making Video Diffusion Listen and Dance

Authors: Susung Hong, Ira Kemelmacher-Shlizerman, Brian Curless, Steven M. Seitz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14505
Pdf URL: https://arxiv.org/pdf/2503.14505
Copy Paste: [[2503.14505]] MusicInfuser: Making Video Diffusion Listen and Dance(https://arxiv.org/abs/2503.14505)
Keywords: generation, generative
Abstract: We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at this https URL.
摘要：我们介绍了MusicInfuser，这是一种生成高质量舞蹈视频的方法，该视频与指定的音乐曲目同步。我们没有尝试设计和训练新的多式联运音频视频模型，而是通过引入轻量级的音乐录像交叉注意力和低级适配器来展示现有的视频扩散模型如何与音乐输入相一致。与需要运动捕获数据的先前工作不同，我们的方法仅在舞蹈视频上进行微调。 MusicInfuser实现了高质量的音乐驱动视频，同时保留了基础模型的灵活性和生成能力。我们介绍了一个使用视频LLM的评估框架来评估舞蹈产生质量的多个维度。该项目页面和代码可在此HTTPS URL上找到。