2025-03-21

Title: GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction

Authors: Tung Sum Thomas Kwok, Chi-Hua Wang, Guang Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15564
Pdf URL: https://arxiv.org/pdf/2503.15564
Copy Paste: [[2503.15564]] GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction(https://arxiv.org/abs/2503.15564)
Keywords: generation
Abstract: Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.
摘要：表格数据的合成不仅涉及多桌合成，还涉及多模式数据（例如，字符串和类别），这可以实现多种知识综合。但是，分开数值和分类数据限制了表格数据生成的有效性。伟大的（生成逼真的表格数据）框架使用大语言模型（LLMS）来编码整个行，从而消除了对数据类型进行分配的需求。尽管如此，该框架的性能仍受两个问题的限制：（1）表格数据条目缺乏足够的语义含义，从而限制了LLM利用预训练的知识来进行封闭式学习的能力，以及（2）复杂的多桌数据集合难以建立有效的关系进行协作。为了解决这些问题，我们提出了更大的建议（在数据增强和减少数据后生成逼真的表格数据），其中包括：（1）数据语义增强系统，该系统通过映射来提高LLM对表格数据的理解，并启用更好的内部文章学习，（2）交叉台式连接方法，以建立跨复杂访问台的有效关系。实验结果表明，更大的表现优于出色的框架。

Title: Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling

Authors: Yanchen Luo, Zhiyuan Liu, Yi Zhao, Sihang Li, Kenji Kawaguchi, Tat-Seng Chua, Xiang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15567
Pdf URL: https://arxiv.org/pdf/2503.15567
Copy Paste: [[2503.15567]] Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling(https://arxiv.org/abs/2503.15567)
Keywords: generation
Abstract: 3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality.
摘要：3D分子的产生对于药物发现和材料科学至关重要，需要模型处理复杂的多模式，包括原子类型，化学键和3D坐标。一个关键的挑战是整合不同形状的方式，同时维持3D坐标的SE（3）均衡。为了实现这一目标，现有方法通常维持单独的潜在空间，以实现不变和模态方式，从而降低了训练和采样的效率。在这项工作中，我们提出了\ textbf {u} nifient \ textbf {a} uto- \ textbf {e} ncoder \ textbf {3d}分子潜伏扩散建模（\ textbf {\ textbf {\ aie-3d}），从{uae-3d}），一个多摩座的统一级别，一个统一的摩尔多特序列3d satress 3d satress 3d satress 3d rodal satress 3d空间，同时保持接近零的重建误差。这种统一的潜在空间消除了在执行潜在扩散建模时处理多模式和均衡性的复杂性。我们通过采用扩散变压器（无分子电感偏置的通用扩散模型）来证明这一点。对Geom-Prugs和QM9数据集进行的广泛实验表明，我们的方法在\ textit {de de Novo}和条件3D分子产生中显着建立了新的基准，从而达到了领先的效率和质量。

Title: CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Authors: Masud Ahmed, Zahid Hasan, Syed Arefinul Haque, Abu Zaher Md Faridee, Sanjay Purushotham, Suya You, Nirmalya Roy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15617
Pdf URL: https://arxiv.org/pdf/2503.15617
Copy Paste: [[2503.15617]] CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation(https://arxiv.org/abs/2503.15617)
Keywords: generation
Abstract: Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL
摘要：传统的基于变压器的语义分割依赖于量化的嵌入。但是，我们的分析表明，使用量化的嵌入（例如VQ-VAE）对分割掩模的自动编码器精度比连续值嵌入（例如KL-VAE）低8％。在此激励的情况下，我们提出了一个连续值的语义分割的嵌入框架。通过将语义面膜生成重新定义为一个连续的图像对扩散过程，我们的方法消除了对离散潜在表示的需求，同时保留了细粒度的空间和语义细节。我们的主要贡献包括扩散引导的自动回归变压器，该变压器通过对图像特征中的长距离依赖性进行建模来学习连续的语义嵌入空间。我们的框架包含一个统一的体系结构，该体系结合了用于连续特征提取的VAE编码器，用于条件嵌入生成的扩散引导的变压器以及用于语义掩盖重建的VAE解码器。我们的设置有助于通过嵌入空间的连续性来实现零击域的适应能力。跨不同数据集（例如，城市景观和域移动变体）进行的实验表明了对分配转移的最新鲁棒性，包括不利天气（例如，雾，雪）和视图变化。我们的模型还表现出强大的噪声弹性，在高斯噪声，中等运动模糊和中等亮度/对比度变化下，达到了强大的性能（与基线相比，与基线相比约为95％），同时仅经历中等影响（$ \ $ \ $ \ $ \ $ 90％的AP）（与基线相比约为90％AP），来自50％的盐和胡椒噪声，饱和效果，饱和效果，饱和度，饱和度和饱和率和饱和度和饱和度和饱和率和小变化。可用代码：此HTTPS URL

Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Authors: Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2503.15621
Pdf URL: https://arxiv.org/pdf/2503.15621
Copy Paste: [[2503.15621]] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning(https://arxiv.org/abs/2503.15621)
Keywords: generation
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: this https URL.
摘要：多模式大语言模型（MLLM）的最新进展突出了视觉主链和基础语言模型的关键作用。虽然先前的工作主要集中于将这些组件扩展到数十亿个参数，但模型大小，体系结构和性能之间的权衡仍未得到充实。此外，培训数据和评估协议的不一致性阻碍了直接比较，因此很难得出最佳的设计选择。在本文中，我们介绍了Llava-More，这是一个新的MLLM家族，将最近的语言模型与不同的视觉主链整合在一起。为了确保公平的比较，我们采用统一的培训协议，该协议始终在所有体系结构中应用。我们的分析系统地探讨了中小型LLM（包括PHI-4，Llama-3.1和Gemma-2），以评估随后的多模式推理，生成和指导，同时检查模型大小与性能之间的关系。除了评估LLM对最终结果的影响外，我们还对各种视觉编码器进行了全面的研究，从基于夹子的架构到Dinov2，Siglip和Siglip2等替代方案。其他实验研究了图像分辨率增加和预训练数据集变化的影响。总体而言，我们的结果为更有效的MLLM的设计提供了见解，提供了可再现的评估框架，可促进直接比较并可以指导未来的模型开发。我们的源代码和受过训练的模型可公开可用：此HTTPS URL。

Title: Transport-Related Surface Detection with Machine Learning: Analyzing Temporal Trends in Madrid and Vienna

Authors: Miguel Ureña Pliego, Rubén Martínez Marín, Nianfang Shi, Takeru Shibayama, Ulrich Leth, Miguel Marchamalo Sacristán
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15653
Pdf URL: https://arxiv.org/pdf/2503.15653
Copy Paste: [[2503.15653]] Transport-Related Surface Detection with Machine Learning: Analyzing Temporal Trends in Madrid and Vienna(https://arxiv.org/abs/2503.15653)
Keywords: generation
Abstract: This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.
摘要：这项研究探讨了机器学习与城市空中图像分析的整合，重点是识别汽车和行人的基础设施表面，并分析历史趋势。它强调了从卷积体系结构到基于变压器的预训练模型的过渡，强调了它们在全球地理空间分析中的潜力。提出了一个工作流程，用于自动生成地理空间数据集，从而从各种来源（包括WMS/WMTS链接，矢量制图和OpenStreetMap（OSM）OpenStreetMap（OSM）Overpass Passpass-Turbo请求创建语义分割数据集。开发的代码允许使用公开可用的数据无需手动标记的数据来进行快速的数据集生成过程，用于训练机器学习模型。使用来自马德里和维也纳各个地理办公室的空中图像和矢量数据，为汽车和行人表面检测生成了两个数据集。为每个城市训练和评估了一个基于变压器的模型，证明了良好的准确性值。历史趋势分析涉及将受过训练的模型应用于早期图像，该图像早于矢量数据10到20年的可用性，成功地确定了各个城市地区的行人和汽车基础设施的时间趋势。该技术适用于市政政府以最低的成本收集有价值的数据。

Title: DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

Authors: Yuming Gu, Phong Tran, Yujian Zheng, Hongyi Xu, Heyuan Li, Adilbek Karmanov, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15667
Pdf URL: https://arxiv.org/pdf/2503.15667
Copy Paste: [[2503.15667]] DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis(https://arxiv.org/abs/2503.15667)
Keywords: generation
Abstract: Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
摘要：从单视图像中生成人头的高质量360度视图对于启用可访问的沉浸式触觉应用程序和可扩展的个性化内容创建至关重要。虽然全部产量的尖端方法仅限于建模现实的人头，但最新的基于扩散的动态 - 友善的头部合成的方法只能产生正面视图，并与视图一致性斗争，从而阻止其转换为真正的3D模型，以从任意角度渲染。我们介绍了一种新颖的方法，可产生完全一致的360度头视图，可容纳人类，风格化和拟人形式，包括眼镜和帽子等配件。我们的方法建立在diffportrait3D框架上，并结合了用于背面细节的自定义控制网和双重外观模块，以确保全局前后后背的一致性。通过在连续的视图序列上进行训练并集成了背部参考图像，我们的方法可以实现稳健的，局部连续的视图合成。我们的模型可用于实时，自由观看点渲染，在对象合成中的最先进方法和360度的头部生成中，用于实时，自由观看点渲染，用于实时，非常具有挑战性的输入肖像。

Title: The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

Authors: Benidir Yanis, Gonthier Nicolas, Mallet Clement
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15683
Pdf URL: https://arxiv.org/pdf/2503.15683
Copy Paste: [[2503.15683]] The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation(https://arxiv.org/abs/2503.15683)
Keywords: generation, generative
Abstract: Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: $\href{this https URL}{this https URL}$.
摘要：基于非常高分辨率（VHR）图像的大规模阶梯变化检测对于地球监测至关重要。到目前为止，这仍然很差：方法要么需要大量注释的数据（语义情况），要么仅限于限制数据集（二进制设置）。大多数方法没有表现出时间和空间适应所需的多功能性：建筑设计中的简单性和对现实和全面数据集进行预处理。合成数据集是关键解决方案，但仍无法处理复杂而多样化的场景。在本文中，我们提出了HYSCDG的生成管道，用于创建一个大型混合语义变化检测数据集，该数据集包含真实的VHR图像和Interded图像，以及日期和更改图的土地覆盖语义图。 HYSCDG在语义和空间引导下，产生了逼真的图像，从而导致了全面且混合传输数据集FSC-180K。我们在五个变更检测案例（二进制和语义）上评估FSC-180K，从零射门到混合和顺序训练，以及在低数据状态训练下。实验表明，在我们的混合数据集上进行预处理会导致每种配置中的一个完全合成数据集的显着性能提升，优于SyntheWorld。所有代码，模型和数据都在此处可用：$ \ href {this HTTPS url} {此https url} $。

Title: Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Authors: Jiaqi Liu, Jichao Zahng, Paolo Rota, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15686
Pdf URL: https://arxiv.org/pdf/2503.15686
Copy Paste: [[2503.15686]] Multi-focal Conditioned Latent Diffusion for Person Image Synthesis(https://arxiv.org/abs/2503.15686)
Keywords: generation
Abstract: The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at this https URL.
摘要：潜在扩散模型（LDM）在高分辨率图像生成中表现出很强的功能，并且已广泛用于姿势引导的人图像合成（PGPIS），从而产生了令人鼓舞的结果。但是，LDM的压缩过程通常会导致细节的恶化，尤其是在敏感区域，例如面部特征和衣服纹理。在本文中，我们提出了一种多焦点条件潜扩散（MCLD）方法，通过根据这些敏感区域的分离，姿势不变的特征来解决这些局限性来解决这些局限性。我们的方法利用了多焦点条件聚合模块，该模块有效地集成了面部身份和特定于纹理的信息，从而增强了模型生成外观现实和符合身份持续图像的能力。我们的方法在DeepFashion数据集上显示了一致的身份和外观产生，并且由于其产生的一致性而启用了灵活的人图像编辑。该代码可在此HTTPS URL上找到。

Title: Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

Authors: Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15742
Pdf URL: https://arxiv.org/pdf/2503.15742
Copy Paste: [[2503.15742]] Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes(https://arxiv.org/abs/2503.15742)
Keywords: generative
Abstract: Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.
摘要：从单个图像中重建3D场景是由于问题的严重约束性质，这是一项根本不适的任务。因此，当场景从新颖的相机视图呈现时，现有的单个图像到3D重建方法就会导致不一致和模糊视图。当看不见的区域远离输入摄像头时，此问题会加剧。在这项工作中，我们解决了现有的单个图像到3D场景馈电网络中的这些固有局限性。为了减轻由于输入图像的观点以外的信息不足而导致的性能不佳，我们利用预先训练的潜在视频扩散模型的形式利用了强大的生成性先验，以迭代精炼以优化的高斯参数代表的粗糙场景。为了确保生成的图像的样式和纹理与输入图像的样式保持一致，我们在生成的图像和输入图像之间合并了即时的傅立叶式传输。此外，我们设计了一个语义不确定性定量模块，该模块可以计算每个像素熵并产生用于指导最自信的像素的精炼过程的不确定性图，同时丢弃剩余的高度不确定的图像。我们在现实世界的场景数据集上进行了广泛的实验，包括内域realestate-10k和不域的Kitti-V2，这表明我们的方法与现有的最新方法相比，我们的方法可以提供更现实，更高的新颖性新视图综合结果。

Title: ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

Authors: Venmugil Elango
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2503.15758
Pdf URL: https://arxiv.org/pdf/2503.15758
Copy Paste: [[2503.15758]] ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism(https://arxiv.org/abs/2503.15758)
Keywords: generation
Abstract: Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows the model to capture intricate dependencies within the data. However, the self-attention mechanism also incurs significant computational and memory costs, particularly for long sequences. In this paper, we introduce ATTENTION2D, a novel approach that exploits parallelism along two dimensions - query and key/value - of the self-attention operation. This method enables efficient distribution and parallelization of computations across multiple devices. Our approach facilitates asymptotically faster training and inference phases compared to previous methods, without relying on approximations or incurring additional computational or memory overheads. Furthermore, unlike existing techniques that struggle to scale with an increasing number of processing units, our approach effectively scales with additional processing units. Our experimental results confirm the effectiveness of our method in improving communication efficiency and scalability. Compared to Ring Attention, our approach demonstrated up to a 5x performance boost on a GPT-3-like model using 64 NVIDIA A100 GPUs across 16 nodes, and up to a 9.4x performance boost on 64 NVIDIA H100 GPUs across 64 nodes.
摘要：基于变压器的模型已成为自然语言处理，自然语言生成和图像生成任务的领先架构。变压器体系结构的一个基本要素是自我注意力，它允许模型捕获数据中的复杂依赖性。但是，自我发挥的机制也会产生巨大的计算和记忆成本，尤其是对于长序列而言。在本文中，我们介绍了Avation2d，这是一种新型方法，该方法利用了自我注意操作的二维（查询和钥匙/值）的并行性。此方法可以在多个设备上有效地分布和平行计算。与以前的方法相比，我们的方法促进了渐近的训练和推理阶段，而无需依靠近似或产生其他计算或内存开销。此外，与现有的技术与越来越多的处理单元进行扩展的技术不同，我们的方法可以通过其他处理单元有效地扩展。我们的实验结果证实了我们方法在提高沟通效率和可扩展性方面的有效性。与引起注意相比，我们的方法在16个节点上使用64个NVIDIA A100 GPU在类似GPT-3样模型上提高了5倍性能提升，并且在64个节点上，在64个NVIDIA H100 GPU上最多可提高9.4倍的性能。

Title: AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models

Authors: Boshra Khalili, Andrew W.Smyth
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.15778
Pdf URL: https://arxiv.org/pdf/2503.15778
Copy Paste: [[2503.15778]] AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models(https://arxiv.org/abs/2503.15778)
Keywords: generation
Abstract: In autonomous driving, open-ended question answering often suffers from unreliable evaluations because freeform responses require either complex metrics or subjective human judgment. To address this challenge, we introduce AutoDrive-QA, an automatic pipeline that converts existing driving QA datasets (including DriveLM, NuScenes-QA, and LingoQA) into a structured multiple-choice question (MCQ) format. This benchmark systematically assesses perception, prediction, and planning tasks, providing a standardized and objective evaluation framework. AutoDrive-QA employs an automated pipeline that leverages large language models (LLMs) to generate high-quality, contextually relevant distractors based on domain-specific error patterns commonly found in autonomous driving scenarios. To evaluate both general capabilities and generalization performance, we test the benchmark on three public datasets and conduct zero-shot experiments on an unseen dataset. The zero-shot evaluations reveal that GPT-4V leads with 69.57% accuracy -- achieving 74.94% in Perception, 65.33% in Prediction, and 68.45% in Planning -- demonstrating that while all models excel in Perception, they struggle in Prediction. Consequently, AutoDrive-QA establishes a rigorous, unbiased standard for integrating and evaluating different vision-language models across various autonomous driving datasets, thereby improving generalization in this field. We release all the codes in the AutoDrive-QA GitHub Repository.
摘要：在自主驾驶中，开放式的问题回答通常会受到不可靠的评估，因为自由形式的回答需要复杂的指标或主观的人类判断。为了应对这一挑战，我们将Autodrive-QA介绍为自动管道，该管道将现有驾驶QA数据集（包括Drivelm，Nuscenes-QA和LingoQA）转换为结构化的多项选择问题（MCQ）格式。该基准有系统地评估感知，预测和计划任务，提供标准化和客观的评估框架。 AutoDrive-QA采用了一种自动化管道，该管道利用大型语言模型（LLMS）来生成基于自主驾驶场景中常见的域特异性错误模式的高质量，上下文相关的干扰因素。为了评估一般能力和泛化性能，我们在三个公共数据集上测试基准测试，并在看不见的数据集上进行零拍实验。零拍摄的评估表明，GPT-4V的准确性为69.57％ - 在感知中达到74.94％，预测为65.33％，计划的68.45％ - 表明所有模型在感知方面都表现出色，但他们在预测方面挣扎。因此，AutoDrive-QA建立了一个严格的，公正的标准，用于整合和评估各种自主驾驶数据集的不同视觉模型，从而改善了该领域的概括。我们在Autodrive-QA GitHub存储库中释放所有代码。

Title: RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

Authors: Parham Saremi, Amar Kumar, Mohammed Mohammed, Zahra TehraniNasab, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15784
Pdf URL: https://arxiv.org/pdf/2503.15784
Copy Paste: [[2503.15784]] RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models(https://arxiv.org/abs/2503.15784)
Keywords: generation
Abstract: Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.
摘要：视觉语言基础模型（VLFM）在产生高分辨率，影像逼真的自然图像方面表现出巨大的性能。虽然VLFM对跨模式的语义内容具有丰富的理解，但他们经常在需要图像区域和文本描述之间精确对应的细粒度对准任务中挣扎，在医学成像中限制了临床特征的局限性，对于诊断和分析是必不可少的。为了解决这个问题，我们提出了一个多阶段体系结构，预先训练的VLFM提供了粗略的语义理解，而增强学习（RL）算法通过迭代过程优化了用于理解语义上下文的迭代过程。奖励信号旨在使文本的语义信息与合成图像保持一致。我们证明了我们的方法在医学成像皮肤数据集中的有效性，在该数据集中，生成的图像在微调稳定扩散中迅速表现出改善的发电质量和对齐方式。我们还表明，合成的样品可用于通过增强来改善代表性不足的亚组的疾病分类器的性能。

Title: DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs

Authors: Masoud Hashemi, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudhan, Jishnu Sethumadhavan Nair, Aman Tiwari, Vikas Yadav
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15793
Pdf URL: https://arxiv.org/pdf/2503.15793
Copy Paste: [[2503.15793]] DNA Bench: When Silence is Smarter -- Benchmarking Over-Reasoning in Reasoning LLMs(https://arxiv.org/abs/2503.15793)
Keywords: generation
Abstract: Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce Dont Answer Bench (DNA Bench), a new benchmark designed to evaluate LLMs ability to robustly understand the tricky reasoning triggers and avoiding unnecessary generation. DNA Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many of the recent prominent LLMs. DNA Bench tests models abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, Claude-3.7-sonnet and compare them against a powerful non-reasoning model, e.g., GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.
摘要：测试时间缩放大大改善了大型语言模型的表现，从而可以更深入地解决复杂的问题。但是，这种提高的推理能力也会导致代币产生过多和不必要的解决问题的尝试。我们介绍了不回答基准（DNA台），这是一种新的基准测试，旨在评估LLMS可靠地理解棘手的推理触发并避免不必要的一代的能力。 DNA工作台由150个对抗设计的提示组成，这些提示很容易让人可以理解和反应，但令人惊讶的是，对于许多最近的著名LLM而言，这并不是。 DNA基准测试跨不同功能的模型，例如指令依从性，避免幻觉，冗余过滤和无法回答的问题识别。我们评估了推理LLM（RLMS），包括DeepSeek-R1，OpenAi O3-Mini，Claude-3.7-Sonnet，并将它们与强大的非原定模型（例如GPT-4O）进行比较。我们的实验表明，RLM的产生比必要多70倍，通常在更简单的非争议模型以更高精度有效地处理的任务上失败。我们的发现强调了RLMS中更有效的培训和推理策略的需求。

Title: Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

Authors: Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15818
Pdf URL: https://arxiv.org/pdf/2503.15818
Copy Paste: [[2503.15818]] Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection(https://arxiv.org/abs/2503.15818)
Keywords: generative
Abstract: 3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.
摘要：3D点云已被广泛用于诸如自动驾驶汽车，机器人技术，CAD型号等的应用中。据我们所知，这些应用程序提出了3D点云中隐私泄漏问题的问题，这尚未得到很好的研究。与纹理和2D几何结构有关的2D图像隐私不同，3D点云无纹理，仅与3D几何结构有关。在这项工作中，我们定义了3D点云隐私问题，并提出了一个名为PointflowGMM的有效的隐私保护框架，该框架可以支持下游分类和分割任务，而无需看到原始数据。使用基于流量的生成模型，点云被投影到一个潜在的高斯混合物分布式子空间中。我们进一步设计了一种新颖的角度相似性损失，以使原始几何结构混淆，并将模型大小从767MB降低到120MB，而不会降低识别性能。潜在空间中的投影点云是正交旋转的，以进一步保护原始的几何结构，旋转后阶级之间的关系保留，因此，受保护的点云可以支持识别任务。我们在多个数据集上评估了我们的模型，并与原始点云相比，在加密点云上获得了可比的识别结果。

Title: EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Authors: Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15831
Pdf URL: https://arxiv.org/pdf/2503.15831
Copy Paste: [[2503.15831]] EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation(https://arxiv.org/abs/2503.15831)
Keywords: generation
Abstract: Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
摘要：长期以来，处理复合物或非线性运动模式对视频框架插值提出了挑战。尽管基于扩散的方法的最新进展可改善基于传统的基于光流的方法，但它们仍然很难在运动大型方案中产生锋利的，时间一致的框架。为了解决这一限制，我们介绍了伊甸园，这是高质量的大型视频框架插值的增强扩散。我们的方法首先利用基于变压器的令牌来生成用于扩散模型的中间帧的精制潜图。然后，我们在整个过程中以时间关注来增强扩散变压器，并结合嵌入的起始框架差异以指导动态运动的产生。广泛的实验表明，伊甸园在流行的基准测试中取得了最新的结果，包括戴维斯和SNU-FILM的LPIP近10％，以及Dain-HD的8％提高。

Title: What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?

Authors: Xuanming Cui, Jaiminkumar Ashokbhai Bhoi, Chionh Wei Peng, Adriel Kuek, Ser Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15846
Pdf URL: https://arxiv.org/pdf/2503.15846
Copy Paste: [[2503.15846]] What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?(https://arxiv.org/abs/2503.15846)
Keywords: generation
Abstract: Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision. While existing approaches often focus on sophisticated architectural design and solely use recall during evaluation, we take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods: severe precision-recall trade-off, lack of awareness on triplet importance, and inappropriate evaluation protocols. On the other hand, recent advances of Large Multimodal Models (LMMs) have shown great capabilities in video understanding, yet they have not been tested on fine-grained, frame-wise understanding tasks like DSGG. In this work, we conduct the first systematic analysis of Video LMMs for performing DSGG. Without relying on sophisticated architectural design, we show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators that effectively overcome the aforementioned issues, while requiring little finetuning (5-10% training data).
摘要：在计算机视觉中，用于视频的动态场景图（DSGG）是一项艰巨的任务。尽管现有的方法通常集中在复杂的建筑设计上，并且在评估过程中仅使用召回率，但我们仔细研究了他们的预测场景图，并通过现有DSGG方法发现了三个关键问题：严重的Precision-Precision-Recall Recall折衷，对三胞胎重要性的认识不足以及不当评估协议。另一方面，大型多模型模型（LMM）的最新进展在视频理解中显示出很大的功能，但是尚未对诸如DSGG（例如DSGG）的细粒度，框架理解的任务进行测试。在这项工作中，我们对执行DSGG的视频LMM进行了首次系统分析。在不依赖复杂的体系结构设计的情况下，我们表明具有简单的仅解码器结构的LMM可以将其转变为最新的场景图生成器，这些发电机有效地克服了上述问题，同时几乎不需要填充（5-10％的培训数据）。

Title: Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Authors: Zhou Zhenglin, Ma Fan, Fan Hehe, Chua Tat-Seng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15851
Pdf URL: https://arxiv.org/pdf/2503.15851
Copy Paste: [[2503.15851]] Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion(https://arxiv.org/abs/2503.15851)
Keywords: generation
Abstract: Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: this https URL.
摘要：可动画的头部头像生成通常需要大量的培训数据。为了减少数据需求，一种自然解决方案是利用现有的无数据静态化身生成方法，例如带有评分蒸馏采样（SDS）的预训练扩散模型，该模型与扩散模型中的伪基地面真相输出相结合。但是，由于生成的视频中的空间和时间不一致，直接从视频扩散中蒸馏出4D化身通常会导致过度光滑的结果。为了解决此问题，我们提出了一种稳健的方法零1-A，该方法使用视频扩散模型合成了4D Avatar重建的空间和时间一致性数据集。具体而言，零1-1的迭代构建视频数据集并以渐进的方式优化可动画化的化身，从而确保头像质量在整个学习过程中平稳，一致地增加。这种渐进式学习涉及两个阶段：（1）空间一致性学习修复表达式并从前面视图中学习，以及（2）时间一致性学习修复了视图，并从放松到夸张的表达式中学习，以简单的复杂方式产生4D化身。广泛的实验表明，与现有的基于扩散的方法相比，零1-A可以提高忠诚度，动画质量和渲染速度，从而为栩栩如生的化身创建提供了解决方案。代码可公开可用：此HTTPS URL。

Title: VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Authors: Hyojun Go, Byeongjun Park, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15855
Pdf URL: https://arxiv.org/pdf/2503.15855
Copy Paste: [[2503.15855]] VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling(https://arxiv.org/abs/2503.15855)
Keywords: generation, generative
Abstract: We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.
摘要：我们提出了VideorFsplat，这是一种直接的文本到3D模型，利用视频生成模型生成现实的3D高斯碎片（3DGS），以实现无界现实世界的场景。为了生成现实世界场景的各种相机姿势和无限的空间范围，同时确保对任意文本提示的概括，以前的方法微调2D生成模型可以共同模型相机姿势和多视图图像。但是，由于模态差距将2D生成模型扩展到关节模型时，这些方法遭受了不稳定的影响，这需要稳定训练和推理的其他模型。在这项工作中，我们提出了一种体系结构和采样策略，以在微调视频生成模型时共同建模多视图图像和相机。我们的核心想法是一种双流式体系结构，该体系结构将专用的姿势生成模型与通过通信块进行预训练的视频生成模型一起，通过单独的流来生成多视图图像和相机姿势。该设计减少了姿势和图像方式之间的干扰。此外，我们提出了一种异步采样策略，该策略比多视图图像更快地姿势姿势，从而可以快速降低姿势，从而调理多视图的生成，减少相互的歧义并增强交叉模式的一致性。经过多个大型现实世界数据集（RealEstate10k，Mvimgnet，DL3DV-10K，Acid），VideorFSPlat的培训优于现有的文本到3D直接生成方法，这些方法在很大程度上取决于通过得分蒸馏采样，无需进行这种细化就可以通过分数蒸馏采样，而无需进行这种细化。

Title: UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

Authors: Debabrata Mandal, Soumitri Chattopadhyay, Guansen Tong, Praneeth Chakravarthula
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15868
Pdf URL: https://arxiv.org/pdf/2503.15868
Copy Paste: [[2503.15868]] UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations(https://arxiv.org/abs/2503.15868)
Keywords: restoration
Abstract: Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Project page: this https URL
摘要：图像恢复对于增强跨计算机视觉任务的降级图像至关重要。但是，大多数现有的方法一次仅处理一次单一类型的降解（例如，模糊，噪声或雾霾）一次，从而限制了其现实世界中经常同时发生多种降解的现实适用性。在本文中，我们提出了独角兽，Unicorn是一种统一的图像恢复方法，能够使用多头扩散模型同时处理多种降解类型。具体而言，我们发现了从图像中提取的低级视觉提示的潜力，从而指导一个可控制的扩散模型，以实现现实世界图像恢复，并通过Experters策略设计了一个多头控制网络。我们通过智能设计的课程学习配方训练模型，而无需任何事先假设特定降解。此外，我们还引入了Metarestore，这是一种金属成像基准测试，其中包含具有多种降解和伪影的图像。对包括我们的基准在内的几个具有挑战性的数据集进行了广泛的评估，表明我们的方法可以实现巨大的性能增长，并且可以通过严重降解来稳健地恢复图像。项目页面：此HTTPS URL

Title: MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

Authors: Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, Bing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15875
Pdf URL: https://arxiv.org/pdf/2503.15875
Copy Paste: [[2503.15875]] MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving(https://arxiv.org/abs/2503.15875)
Keywords: generation
Abstract: In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: this https URL.
摘要：近年来，数据驱动的技术具有非常高级的自主驾驶系统，但是对稀有和多样化的培训数据的需求仍然是一个挑战，需要对设备和劳动力进行大量投资。预测和生成未来环境状态的世界模型通过合成带注释的视频数据进行培训提供了有希望的解决方案。但是，现有的方法难以生成长时间，一致的视频而不会累积错误，尤其是在动态场景中。为了解决这个问题，我们提出了米拉（Mila），这是一个新颖的框架，用于产生高保真，长期视频，长达一分钟。 MILA利用一种粗到RE（精细）的方法来稳定视频生成并正确地失真动态对象。此外，我们介绍了一个时间渐进的DeNoising调度程序和联合DeNoising和校正流量模块，以提高生成的视频的质量。 Nuscenes数据集的广泛实验表明，MILA在视频生成质量方面取得了最新的性能。有关更多信息，请访问项目网站：此HTTPS URL。

Title: Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Authors: Tiange Xiang, Kai Li, Chengjiang Long, Christian Häne, Peihong Guo, Scott Delp, Ehsan Adeli, Li Fei-Fei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15877
Pdf URL: https://arxiv.org/pdf/2503.15877
Copy Paste: [[2503.15877]] Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation(https://arxiv.org/abs/2503.15877)
Keywords: generation
Abstract: Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.
摘要：文本到图像扩散模型的最新进展是由配对2D数据的可用性提高所驱动的。但是，高质量3D数据的稀缺性阻碍了3D扩散模型的发展，与2D同行相比，竞争性能较低。为了应对这一挑战，我们建议重新利用3D对象生成的预训练的2D扩散模型。我们介绍了高斯地图集，这是一种利用密集的2D网格的新型表示，使2D扩散模型的微调生成了3D高斯。我们的方法表明，从预先训练的2D扩散模型到从3D结构扁平的2D歧管中的成功转移学习。为了支持模型训练，我们编译了一个大规模数据集，该数据集由各种3D对象组成205K高质量的3D高斯配件。我们的实验结果表明，文本到图像扩散模型可以有效地适合3D内容生成，从而弥合了2D和3D建模之间的差距。

Title: UMIT: Unifying Medical Imaging Tasks via Vision-Language Models

Authors: Haiyang Yu, Siyang Yi, Ke Niu, Minghan Zhuo, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15892
Pdf URL: https://arxiv.org/pdf/2503.15892
Copy Paste: [[2503.15892]] UMIT: Unifying Medical Imaging Tasks via Vision-Language Models(https://arxiv.org/abs/2503.15892)
Keywords: generation
Abstract: With the rapid advancement of deep learning, particularly in the field of medical image analysis, an increasing number of Vision-Language Models (VLMs) are being widely applied to solve complex health and biomedical challenges. However, existing research has primarily focused on specific tasks or single modalities, which limits their applicability and generalization across diverse medical scenarios. To address this challenge, we propose UMIT, a unified multi-modal, multi-task VLM designed specifically for medical imaging tasks. UMIT is able to solve various tasks, including visual question answering, disease detection, and medical report generation. In addition, it is applicable to multiple imaging modalities (e.g., X-ray, CT and PET), covering a wide range of applications from basic diagnostics to complex lesion analysis. Moreover, UMIT supports both English and Chinese, expanding its applicability globally and ensuring accessibility to healthcare services in different linguistic contexts. To enhance the model's adaptability and task-handling capability, we design a unique two-stage training strategy and fine-tune UMIT with designed instruction templates. Through extensive empirical evaluation, UMIT outperforms previous methods in five tasks across multiple datasets. The performance of UMIT indicates that it can significantly enhance diagnostic accuracy and workflow efficiency, thus providing effective solutions for medical imaging applications.
摘要：随着深度学习的快速发展，尤其是在医学图像分析领域，越来越多的视觉模型（VLM）广泛应用于解决复杂的健康和生物医学挑战。但是，现有的研究主要集中在特定的任务或单一模式上，这限制了它们在各种医疗方案中的适用性和概括。为了应对这一挑战，我们提出了UMIT，这是一种专门为医学成像任务设计的统一的多模式的多任务VLM。 UMIT能够解决各种任务，包括视觉问答，疾病检测和医疗报告生成。此外，它适用于多种成像方式（例如X射线，CT和PET），涵盖了从基本诊断到复杂病变分析的广泛应用。此外，UMIT支持英语和中文，在全球范围内扩展其适用性，并确保在不同语言环境中获得医疗服务的可访问性。为了增强模型的适应性和任务处理能力，我们设计了独特的两阶段训练策略，并使用设计的指令模板进行了微调。通过广泛的经验评估，UMIT在多个数据集的五个任务中优于先前的方法。 UMIT的性能表明它可以显着提高诊断准确性和工作流程效率，从而为医学成像应用提供有效的解决方案。

Title: BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Authors: Hui Zhang, Tingwei Gao, Jie Shao, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15927
Pdf URL: https://arxiv.org/pdf/2503.15927
Copy Paste: [[2503.15927]] BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers(https://arxiv.org/abs/2503.15927)
Keywords: generation
Abstract: Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality.
摘要：扩散模型表现出了令人印象深刻的发电能力，尤其是在最近利用变压器体系结构来提高视觉和艺术质量的进步时。但是，扩散变压器（DITS）继续遇到与低推理速度相关的挑战，这主要是由于迭代降解过程。为了解决这个问题，我们提出了Blockdance，这是一种无训练的方法，该方法在相邻的时间步骤中探索相似之处以加速DIT。与以前缺乏在不同尺度上缺乏特征的重复使用策略的功能重新使用方法不同，Blockdance优先列出了最结构相似的特征的识别，即在结构上相似的时空（STSS）特征。这些特征主要位于变形金刚的结构块内，后期在变形金刚的后期阶段。封锁缓存和重用这些高度相似的功能，以减轻冗余计算，从而加速dit，同时最大程度地与原始模型的生成结果保持一致性。此外，考虑到生成的内容的多样性和冗余功能的不同分布，我们引入了Blockdance-ADA，这是一个针对特定实例的加速量身定制的轻量级决策网络。 Blockdance-ADA动态分配资源并提供卓越的内容质量。在各种一代任务和模型中，Blockdance和Blockdance-ADA都有效地有效，在维持生成质量的同时，达到25％至50％的加速度。

Title: UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

Authors: Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15940
Pdf URL: https://arxiv.org/pdf/2503.15940
Copy Paste: [[2503.15940]] UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation(https://arxiv.org/abs/2503.15940)
Keywords: generation
Abstract: Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at this https URL.
摘要：自动放射学报告的一代旨在加快放射科医生的繁琐且容易出错的报告过程。尽管最近的作品取得了进步，但由于标记的医学数据的相对稀缺性，学习对齐医学图像和文本发现仍然具有挑战性。例如，此任务的数据集比计算机视觉中用于图像字幕的数据集小得多。在这项工作中，我们建议将表示形式从剪贴画（一个大规模训练的视觉模型）转移，以更好地捕获图像和文本之间的跨模式语义。但是，由于自然图像和放射学之间的域间隙，直接应用夹子是次优的。为了启用有效的适应，我们引入了Unicrossadapter，轻巧的适配器模块，这些模块被整合到剪辑中并在目标任务上进行了微调，同时将基本参数固定为固定。这些适配器分布在模态及其相互作用之间，以增强视觉语言对准。两个公共数据集上的实验证明了我们方法的有效性，并在放射学报告生成中推进了最新的。拟议的转移学习框架提供了一种利用大规模预训练模型的语义知识的方法，以解决数据筛查医学视觉语言任务。代码可在此HTTPS URL上找到。

Title: Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

Authors: Kendong Liu, Zhiyu Zhu, Hui Liu, Junhui Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15975
Pdf URL: https://arxiv.org/pdf/2503.15975
Copy Paste: [[2503.15975]] Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation(https://arxiv.org/abs/2503.15975)
Keywords: generation, generative
Abstract: We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a $20\times$ increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.
摘要：我们提出ACC3D，以应对加速从单个图像生成3D模型的扩散过程的挑战。为了通过几步推论得出高质量的重建，我们强调了在随机噪声状态下将得分功能的学习正规化的关键问题。为此，我们提出了边缘一致性，即在高信噪比区域之间的一致预测，以增强预训练的扩散模型，从而实现基于蒸馏的端点得分函数的完善。在这些蒸馏扩散模型的基础上，我们提出了一种对抗性增强策略，以进一步丰富生成细节并提高整体发电质量。这两个模块相互补充，相互加强以提高生成性能。广泛的实验表明，与最新的ACC3D相比，我们的ACC3D不仅可以实现计算效率的20美元\ tims $提高，而且还可以产生显着的质量改进。

Title: A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli

Authors: Pengyu Liu, Guohua Dong, Dan Guo, Kun Li, Fengling Li, Xun Yang, Meng Wang, Xiaomin Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15978
Pdf URL: https://arxiv.org/pdf/2503.15978
Copy Paste: [[2503.15978]] A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli(https://arxiv.org/abs/2503.15978)
Keywords: generation
Abstract: In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit this https URL.
摘要：在日常生活中，我们遇到了各种外部刺激，例如图像，声音和视频。随着多模式刺激和神经科学进展的研究，基于fMRI的大脑解码已成为理解大脑感知及其复杂认知过程的关键工具。解码大脑信号重建刺激不仅揭示了复杂的神经机制，还可以推动AI，疾病治疗和脑部计算机界面的进展。神经影像学和图像产生模型的最新进展已显着改善了基于fMRI的解码。尽管fMRI为精确的大脑活动映射提供了高空间分辨率，但其时间分辨率低和信号噪声构成挑战。同时，诸如gan，vaes和扩散模型之类的技术增强了重建图像质量，并且多模式预训练的模型已增强了跨模式解码任务。这项调查系统地回顾了基于功能磁共振成像的大脑解码的最新进展，重点是被动脑信号的刺激重建。它总结了数据集，相关的大脑区域，并按模型结构对现有方法进行了分类。此外，它评估了模型性能并讨论其有效性。最后，它确定了主要的挑战，并提出了未来的研究方向，为该领域提供了宝贵的见解。有关与此调查有关的更多信息和资源，请访问此HTTPS URL。

Title: DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration

Authors: Suraj Singh, Anastasia Batsheva, Oleg Y. Rogov, Ahmed Bouridane
Subjects: cs.CV, astro-ph.IM, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.15984
Pdf URL: https://arxiv.org/pdf/2503.15984
Copy Paste: [[2503.15984]] DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration(https://arxiv.org/abs/2503.15984)
Keywords: restoration, super-resolution, generation
Abstract: Contemporary image restoration and super-resolution techniques effectively harness deep neural networks, markedly outperforming traditional methods. However, astrophotography presents unique challenges for deep learning due to limited training data. This work explores hybrid strategies, such as the Deep Image Prior (DIP) model, which facilitates blind training but is susceptible to overfitting, artifact generation, and instability when handling noisy images. We propose enhancements to the DIP model's baseline performance through several advanced techniques. First, we refine the model to process multiple frames concurrently, employing the Back Projection method and the TVNet model. Next, we adopt a Markov approach incorporating Monte Carlo estimation, Langevin dynamics, and a variational input technique to achieve unbiased estimates with minimal variance and counteract overfitting effectively. Collectively, these modifications reduce the likelihood of noise learning and mitigate loss function fluctuations during training, enhancing result stability. We validated our algorithm across multiple image sets of astronomical and celestial objects, achieving performance that not only mitigates limitations of Lucky Imaging, a classical computer vision technique that remains a standard in astronomical image reconstruction but surpasses the original DIP model, state of the art transformer- and diffusion-based models, underscoring the significance of our improvements.
摘要：当代图像恢复和超分辨率技术有效地利用了深层神经网络，明显优于传统方法。但是，由于培训数据有限，天文学造影对深度学习提出了独特的挑战。这项工作探讨了混合策略，例如深层图像先验（DIP）模型，该模型有助于盲目训练，但在处理嘈杂的图像时容易受到过度拟合，人工制品的产生和不稳定性的影响。我们通过多种高级技术提出了DIP模型基线性能的增强。首先，我们通过使用返回投影方法和TVNet模型并同时处理多个模型，以同时处理多个帧。接下来，我们采用一种马尔可夫的方法，该方法结合了蒙特卡洛估计，兰格文动力学和一种变异输入技术，以实现无偏见的估计，并有效地抵消过度拟合。总的来说，这些修改减少了噪声学习的可能性，并减轻训练过程中损失功能的波动，从而增强了结果稳定性。我们验证了跨天文和天体对象的多个图像集验证我们的算法，这不仅可以减轻幸运成像的局限性，这是一种经典的计算机视觉技术，它仍然是天文学图像重建的标准，而且超过了原始的DIP模型，是艺术变压器的状态，基于ART和扩散的模型的状态，以及我们的改进的意义。

Title: Automating 3D Dataset Generation with Neural Radiance Fields

Authors: P. Schulz, T. Hempel, A. Al-Hamadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15997
Pdf URL: https://arxiv.org/pdf/2503.15997
Copy Paste: [[2503.15997]] Automating 3D Dataset Generation with Neural Radiance Fields(https://arxiv.org/abs/2503.15997)
Keywords: generation
Abstract: 3D detection is a critical task to understand spatial characteristics of the environment and is used in a variety of applications including robotics, augmented reality, and image retrieval. Training performant detection models require diverse, precisely annotated, and large scale datasets that involve complex and expensive creation processes. Hence, there are only few public 3D datasets that are additionally limited in their range of classes. In this work, we propose a pipeline for automatic generation of 3D datasets for arbitrary objects. By utilizing the universal 3D representation and rendering capabilities of Radiance Fields, our pipeline generates high quality 3D models for arbitrary objects. These 3D models serve as input for a synthetic dataset generator. Our pipeline is fast, easy to use and has a high degree of automation. Our experiments demonstrate, that 3D pose estimation networks, trained with our generated datasets, archive strong performance in typical application scenarios.
摘要：3D检测是了解环境空间特征的关键任务，并用于包括机器人技术，增强现实和图像检索在内的各种应用中。培训性能检测模型需要各种各样的，精确的注释和大规模数据集，涉及复杂且昂贵的创建过程。因此，只有很少的公共3D数据集在其类别范围内受到限制。在这项工作中，我们为自动生成任意对象的3D数据集提供了一条管道。通过利用辐射场的通用3D表示和渲染功能，我们的管道为任意对象生成了高质量的3D模型。这些3D模型是合成数据集发电机的输入。我们的管道快速，易于使用，并且具有高度的自动化。我们的实验表明，通过生成的数据集训练的3D构成估计网络，在典型的应用程序场景中归档了较强的性能。

Title: SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks

Authors: Haojia Gao, Haohua Que, Hoiian Au, Weihao Shan, Mingkai Liu, Yusen Qin, Lei Mu, Rong Zhao, Xinghua Yang, Qi Wei, Fei Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16000
Pdf URL: https://arxiv.org/pdf/2503.16000
Copy Paste: [[2503.16000]] SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks(https://arxiv.org/abs/2503.16000)
Keywords: generative
Abstract: This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k parameters. Our smallest model achieves better performance on the KTH dataset than U-net (24.5M) and LaMa (51M), delivering PSNR 9.026 and SSIM 0.718, particularly representing a 38.7% PSNR improvement over the 51M-parameter LaMa model. Cross-domain testing demonstrates its strong generalization capability, with an FID score of 161.55 on the HouseExpo dataset, significantly outperforming comparable methods. Regarding exploration efficiency, on the KTH dataset,SenseExpo demonstrates approximately a 67.9% time reduction in exploration time compared to MapEx. On the MRPB 1.0 dataset, SenseExpo achieves 77.1% time reduction roughly compared to MapEx. Deployed as a plug-and-play ROS node, the framework seamlessly integrates with existing navigation systems, providing an efficient solution for resource-constrained devices.
摘要：本文提出了基于轻量预测网络的有效自主探索框架SenseExpo，该框架解决了计算开销和环境概括中传统方法的局限性。通过整合生成对抗网络（GAN），变压器和快速傅立叶卷积（FFC），我们设计了一个仅具有709K参数的轻量级预测模型。我们最小的模型在KTH数据集上的性能要比U-NET（24.5m）和LAMA（51m）更好，可提供PSNR 9.026和SSIM 0.718，尤其是代表51m-Caremeter Lama Model的38.7％PSNR改善。跨域测试证明了其强大的概括能力，在HouseexPO数据集上的FID得分为161.55，大大优于可比较的方法。关于勘探效率，在KTH数据集上，SenseExpo与Mapex相比，勘探时间降低了约67.9％。在MRPB 1.0数据集上，SenseExpo与Mapex相比大致达到了77.1％的时间缩短。该框架部署为插件ROS节点，与现有导航系统无缝集成，为资源受限设备提供了有效的解决方案。

Title: Single Image Iterative Subject-driven Generation and Editing

Authors: Yair Shpitzer, Gal Chechik, Idan Schwartz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.16025
Pdf URL: https://arxiv.org/pdf/2503.16025
Copy Paste: [[2503.16025]] Single Image Iterative Subject-driven Generation and Editing(https://arxiv.org/abs/2503.16025)
Keywords: generation
Abstract: Personalizing image generation and editing is particularly challenging when we only have a few images of the subject, or even a single image. A common approach to personalization is concept learning, which can integrate the subject into existing models relatively quickly, but produces images whose quality tends to deteriorate quickly when the number of subject images is small. Quality can be improved by pre-training an encoder, but training restricts generation to the training distribution, and is time consuming. It is still an open hard challenge to personalize image generation and editing from a single image without training. Here, we present SISO, a novel, training-free approach based on optimizing a similarity score with an input subject image. More specifically, SISO iteratively generates images and optimizes the model based on loss of similarity with the given subject image until a satisfactory level of similarity is achieved, allowing plug-and-play optimization to any image generator. We evaluated SISO in two tasks, image editing and image generation, using a diverse data set of personal subjects, and demonstrate significant improvements over existing methods in image quality, subject fidelity, and background preservation.
摘要：当我们只有几张主题的图像，甚至单个图像时，个性化图像生成和编辑尤其具有挑战性。一个常见的个性化方法是概念学习，它可以相对较快地将主题集成到现有模型中，但是在主题图像数量较小时，产生的图像趋于迅速恶化。可以通过预先培训编码器来提高质量，但是训练将生成限制在培训分布中，并且很耗时。在没有培训的情况下，从单个图像中个性化图像生成和编辑仍然是一个开放的挑战。在这里，我们提出了SISO，这是一种基于输入主题图像优化相似性评分的新颖，无训练的方法。更具体地说，SISO迭代生成图像并基于与给定主题图像的相似性丧失，直到达到令人满意的相似性水平，从而可以对任何图像生成器进行插入和播放优化。我们使用各种个人主题的数据集评估了SISO，分别是图像编辑和图像生成的两项任务，并在图像质量，主题保真度和背景保护方面的现有方法表现出显着改进。

Title: Closer to Ground Truth: Realistic Shape and Appearance Labeled Data Generation for Unsupervised Underwater Image Segmentation

Authors: Andrei Jelea, Ahmed Nabil Belbachir, Marius Leordeanu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16051
Pdf URL: https://arxiv.org/pdf/2503.16051
Copy Paste: [[2503.16051]] Closer to Ground Truth: Realistic Shape and Appearance Labeled Data Generation for Unsupervised Underwater Image Segmentation(https://arxiv.org/abs/2503.16051)
Keywords: generation
Abstract: Solving fish segmentation in underwater videos, a real-world problem of great practical value in marine and aquaculture industry, is a challenging task due to the difficulty of the filming environment, poor visibility and limited existing annotated underwater fish data. In order to overcome these obstacles, we introduce a novel two stage unsupervised segmentation approach that requires no human annotations and combines artificially created and real images. Our method generates challenging synthetic training data, by placing virtual fish in real-world underwater habitats, after performing fish transformations such as Thin Plate Spline shape warping and color Histogram Matching, which realistically integrate synthetic fish into the backgrounds, making the generated images increasingly closer to the real world data with every stage of our approach. While we validate our unsupervised method on the popular DeepFish dataset, obtaining a performance close to a fully-supervised SoTA model, we further show its effectiveness on the specific case of salmon segmentation in underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature (30 GB). Moreover, on both datasets we prove the capability of our approach to boost the performance of the fully-supervised SoTA model.
摘要：解决水下视频中的鱼类细分，这是一个在海洋和水产养殖行业中具有巨大实际价值的现实问题，这是一项艰巨的任务，因为拍摄环境的难度，可见性差和现有的注释水下鱼类数据有限。为了克服这些障碍，我们引入了一种新颖的两阶段无监督分段方法，不需要人类注释，并将人为创建和真实的图像结合在一起。我们的方法通过将虚拟鱼类放置在现实世界的水下栖息地中，在执行诸如薄板样条形状翘曲和颜色直方直方图匹配之后，通过将虚拟鱼类放置在现实世界的水下栖息地中，从而将合成鱼实际地整合到背景中，从而使生成的图像越来越近于我们的每个阶段的方法。当我们在流行的深鱼数据集上验证了无监督的方法，并获得了接近完全监督的SOTA模型的性能，但我们进一步显示了其对水下视频中鲑鱼分割的特定情况的有效性，为此我们引入了DeepSalmon，DeepSalmon是文学中最大的类型数据集（30 GB）（30 GB）。此外，在两个数据集中，我们都证明了我们的方法可以提高所有监督SOTA模型的性能。

Title: Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution

Authors: Wanshu Fan, Yue Wang, Cong Wang, Yunzhe Zhang, Wei Wang, Dongsheng Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16056
Pdf URL: https://arxiv.org/pdf/2503.16056
Copy Paste: [[2503.16056]] Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution(https://arxiv.org/abs/2503.16056)
Keywords: super-resolution
Abstract: Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems, which are integral to various vision-based instrumentation and measurement applications. These systems often require clear and detailed images for precise object detection and recognition. However, images captured by visual measurement tools frequently suffer from degradation, including blurring and loss of detail, which can impede measurement this http URL a potential remedy, we in this paper propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR. Our SGGLC-Net leverages semantic priors extracted from a pre-trained model to guide the super-resolution process, enhancing image detail quality effectively. Specifically,we propose a Semantic Guidance Module that seamlessly integrates the semantic priors into the super-resolution network, enabling the network to more adeptly capture and utilize semantic priors, thereby enhancing image details. To further explore both local and non-local interactions for improved detail rendition,we propose a Global-Local Collaborative Module, which features three Global and Local Detail Enhancement Modules, as well as a Hybrid Attention Mechanism to work together to efficiently learn more useful features. Our extensive experiments show that SGGLC-Net achieves competitive PSNR and SSIM values across multiple benchmark datasets, demonstrating higher performance with the multi-adds reduction of 12.81G compared to state-of-the-art lightweight super-resolution approaches. These improvements underscore the potential of our approach to enhance the precision and effectiveness of visual measurement systems. Codes are at this https URL.
摘要：单片图像超分辨率（SISR）在增强测量系统的准确性和可靠性方面起着关键作用，这对于各种基于视觉的仪器和测量应用不可或缺。这些系统通常需要清晰而详细的图像以进行精确的对象检测和识别。但是，视觉测量工具捕获的图像经常遭受降解的损失，包括模糊和细节丢失，这可能会妨碍该HTTP URL的潜在补救措施，我们在本文中提出了一个语义指导的全球 - 位置协作网络（SGGLC-net）。我们的SGGLC网络利用从预训练的模型中提取的语义先验来指导超分辨率过程，从而有效增强图像细节质量。具体而言，我们提出了一个语义指导模块，该模块将语义先验无缝地集成到超分辨率网络中，从而使网络能够更熟练地捕获和利用语义先验，从而增强图像细节。为了进一步探索本地和非本地互动以改善细节演绎，我们提出了一个全球 - 本地协作模块，该模块具有三个全球和局部细节增强模块，以及一种混合注意机制，可以有效地学习更多有用的功能。我们的广泛实验表明，SGGLC-NET在多个基准数据集中实现了竞争性的PSNR和SSIM值，与最先进的轻巧超级分辨率方法相比，多额外降低12.81G的性能更高。这些改进强调了我们方法增强视觉测量系统的精度和有效性的潜力。代码在此HTTPS URL处。

Title: Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Authors: Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16057
Pdf URL: https://arxiv.org/pdf/2503.16057
Copy Paste: [[2503.16057]] Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts(https://arxiv.org/abs/2503.16057)
Keywords: generation
Abstract: Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.
摘要：扩散模型已成为视觉生成中的主流框架。在这一成功的基础上，专家（MOE）方法的混合物的整合在增强模型可伸缩性和性能方面已显示出希望。在本文中，我们介绍了Race-Dit，这是一种具有灵活的路由策略的扩散变压器的新型MOE模型。通过允许代币和专家一起竞争并选择顶级候选人，该模型学会了将专家动态分配给关键令牌。此外，我们提出每层正则化，以应对浅层学习中的挑战，以及路由器的相似性损失以防止模式崩溃，从而确保更好的专家利用。对ImageNet的广泛实验验证了我们方法的有效性，展示了显着的性能增长，同时有希望缩放性能。

Title: PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

Authors: Longbin Ji, Lei Zhong, Pengfei Wei, Changjian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16068
Pdf URL: https://arxiv.org/pdf/2503.16068
Copy Paste: [[2503.16068]] PoseTraj: Pose-Aware Trajectory Control in Video Diffusion(https://arxiv.org/abs/2503.16068)
Keywords: generation
Abstract: Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.
摘要：轨迹引导的视频生成的最新进展取得了显着的进步。但是，由于有限的3D理解，现有模型仍面临挑战，并在宽范围内旋转下可能会改变6D姿势。为了解决这个问题，我们介绍了Posetraj，这是一种姿势感知的视频拖动模型，用于从2D轨迹生成3D对准运动。我们的方法采用了一种新颖的两阶段姿势感知预处理框架，从而提高了各种轨迹的3D理解。具体而言，我们提出了一个大规模的合成数据集Posetraj-10k，其中包含旋转轨迹后的10k对象视频，并通过将3D边界框作为中间监督信号纳入对象姿势变化的模型感知。之后，我们在现实世界视频上微调了轨迹控制模块，并应用了额外的摄像头键入模块，以进一步完善运动精度。各种基准数据集上的实验表明，我们的方法不仅在3D姿势一致的旋转轨迹中脱颖而出，而且在轨迹准确性和视频质量方面都优于现有基线。

Title: MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

Authors: Lucas Morin, Valéry Weber, Ahmed Nassar, Gerhard Ingmar Meijer, Luc Van Gool, Yawei Li, Peter Staar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16096
Pdf URL: https://arxiv.org/pdf/2503.16096
Copy Paste: [[2503.16096]] MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures(https://arxiv.org/abs/2503.16096)
Keywords: generation
Abstract: The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.
摘要：化学文献的自动分析有望在材料科学和药物开发等领域加速发现。特别是，专利文档中的化学结构和Markush结构（化学结构模板）的搜索功能很有价值，例如用于先前搜索。在自动从文本和图像中自动提取化学结构的进步，但由于其复杂的多模式性质，马可什结构在很大程度上尚未开发。在这项工作中，我们提出了MarkushGrapher，这是一种识别文档中Markush结构的多模式方法。我们的方法通过视觉文本编码器和光学化学结构识别视觉编码器共同编码文本，图像和布局信息。这些表示形式合并并用于自动回归生成Markush结构的顺序图表以及定义其可变组的表。为了克服缺乏现实世界的培训数据，我们提出了一条合成数据生成管道，该管道可产生各种现实的Markush结构。此外，我们提出了M2S，这是现实世界中Markush结构的第一个注释的基准，以提高对这项具有挑战性的任务的研究。广泛的实验表明，在大多数评估环境中，我们的方法优于最先进的化学特异性和通用视觉语言模型。代码，模型和数据集将可用。

Title: FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Authors: Tianyi Wei, Yifan Zhou, Dongdong Chen, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16153
Pdf URL: https://arxiv.org/pdf/2503.16153
Copy Paste: [[2503.16153]] FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing(https://arxiv.org/abs/2503.16153)
Keywords: generation
Abstract: The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.
摘要：多模式扩散变压器（MMDIT）中旋转位置嵌入（绳索）的整合具有显着增强的文本对图像生成质量。但是，在世代相传的位置嵌入与查询钥匙相似性的基本依赖仍然是一个有趣的问题。我们介绍了基于绳索的MMDIT模型（例如Flux）的首次机械分析，该模型引入了一种自动探测策略，该策略通过在生成过程中通过战略性地操纵绳索来解散位置信息与内容依赖性。我们的分析揭示了不同的依赖模式，这些模式与深度不直接相关，从而提供了对基于绳索的MMDIT中特定层特定角色的新见解。基于这些发现，我们提出了一个无培训的，特定任务的图像编辑框架，将编辑任务分为三种类型：依赖位置的编辑（例如，对象添加），内容相似性依赖性依赖性依赖性编辑（例如，非辅助编辑）和区域预处理编辑（例如，后台替换）。对于每种类型，我们根据编辑任务的特征设计了量身定制的键值注入策略。广泛的定性和定量评估表明，我们的方法表现优于最先进的方法，尤其是在保留原始的语义内容和实现无缝修改方面。

Title: Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation

Authors: Soham Roy, Abhishek Mishra, Shirish Karande, Murari Mandal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16171
Pdf URL: https://arxiv.org/pdf/2503.16171
Copy Paste: [[2503.16171]] Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation(https://arxiv.org/abs/2503.16171)
Keywords: generation, generative
Abstract: Modern text-to-image generative models can inadvertently reproduce copyrighted content memorized in their training data, raising serious concerns about potential copyright infringement. We introduce Guardians of Generation, a model agnostic inference time framework for dynamic copyright shielding in AI image generation. Our approach requires no retraining or modification of the generative model weights, instead integrating seamlessly with existing diffusion pipelines. It augments the generation process with an adaptive guidance mechanism comprising three components: a detection module, a prompt rewriting module, and a guidance adjustment module. The detection module monitors user prompts and intermediate generation steps to identify features indicative of copyrighted content before they manifest in the final output. If such content is detected, the prompt rewriting mechanism dynamically transforms the user's prompt by sanitizing or replacing references that could trigger copyrighted material while preserving the prompt's intended semantics. The adaptive guidance module adaptively steers the diffusion process away from flagged content by modulating the model's sampling trajectory. Together, these components form a robust shield that enables a tunable balance between preserving creative fidelity and ensuring copyright compliance. We validate our method on a variety of generative models such as Stable Diffusion, SDXL, and Flux, demonstrating substantial reductions in copyrighted content generation with negligible impact on output fidelity or alignment with user intent. This work provides a practical, plug-and-play safeguard for generative image models, enabling more responsible deployment under real-world copyright constraints. Source code is available at: this https URL
摘要：现代的文本对图像生成模型可以在培训数据中无意间再现受版权保护的内容，从而引起了人们对潜在版权侵权的严重关注。我们介绍了《 Guardians of Generagians》，这是一个模型不可知的推理时间框架，用于AI图像生成中的动态版权屏蔽。我们的方法不需要对生成模型权重的重新培训或修改，而是与现有扩散管道无缝集成。它通过包含三个组成部分的自适应指导机制来增强生成过程：检测模块，及时重写模块和指导调整模块。检测模块监视用户提示和中间生成步骤，以识别指示版权内容的功能，然后才能在最终输出中体现出来。如果检测到此类内容，请提示重写机制通过对可能触发受版权保护的材料进行消毒或替换参考的参考来动态转换用户的提示，同时保留了提示的预期语义。自适应引导模块通过调节模型的采样轨迹来自适应使扩散过程脱离标记的内容。这些组件共同形成了强大的盾牌，可以在保留创意保真度和确保版权合规之间取得可调的平衡。我们在各种生成模型（例如稳定的扩散，SDXL和通量）上验证了我们的方法，这些方法证明了受版权保护的内容生成的大幅减少，对输出保真度或与用户意图的对齐方式无可忽视。这项工作为生成图像模型提供了实用的，插件的保护措施，从而在现实世界的版权约束下实现了更负责任的部署。源代码可用：此HTTPS URL

Title: Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction

Authors: Ziyao Guo, Kaipeng Zhang, Michael Qizhe Shieh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16194
Pdf URL: https://arxiv.org/pdf/2503.16194
Copy Paste: [[2503.16194]] Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction(https://arxiv.org/abs/2503.16194)
Keywords: generation
Abstract: Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method's superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.
摘要：通过从语言建模中调整顺序预测技术，自回归模型在图像生成方面取得了显着成功。但是，将这些方法应用于图像需要通过VQ-VAE等向量量化方法离散连续的像素数据。为了减轻VQ-VAE中存在的量化错误，最近的作品倾向于使用较大的代码书。但是，这将相应地扩大词汇量，从而使自回归建模任务变得复杂。本文旨在找到一种方法来享受大型代码书的好处，而不会使自回归建模更加困难。通过实证研究，我们发现具有相似代码字表示的令牌对最终生成的图像产生了相似的影响，从而揭示了大型代码簿中的显着冗余。基于此洞察力，我们建议通过为相似令牌分配相同的粗标签来预测从粗糙到细（CTF）的令牌。我们的框架由两个阶段组成：（1）一个自回归模型，该模型依次预测序列中每个令牌的粗标记，以及（2）一个辅助模型，该辅助模型同时预测了在其粗标签上的所有令牌的细粒度标签。 ImageNet上的实验证明了我们方法的出色性能，与基准相比，实体得分的平均提高了59分。值得注意的是，尽管增加了推论步骤，但我们的方法仍达到更快的采样速度。

Title: VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis

Authors: Chia-Yi Hsu, Jia-You Chen, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, Chun-Ying Huang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16195
Pdf URL: https://arxiv.org/pdf/2503.16195
Copy Paste: [[2503.16195]] VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis(https://arxiv.org/abs/2503.16195)
Keywords: generative
Abstract: Differentially private (DP) synthetic data has become the de facto standard for releasing sensitive data. However, many DP generative models suffer from the low utility of synthetic data, especially for high-resolution images. On the other hand, one of the emerging techniques in parameter efficient fine-tuning (PEFT) is visual prompting (VP), which allows well-trained existing models to be reused for the purpose of adapting to subsequent downstream tasks. In this work, we explore such a phenomenon in constructing captivating generative models with DP constraints. We show that VP in conjunction with DP-NTK, a DP generator that exploits the power of the neural tangent kernel (NTK) in training DP generative models, achieves a significant performance boost, particularly for high-resolution image datasets, with accuracy improving from 0.644$\pm$0.044 to 0.769. Lastly, we perform ablation studies on the effect of different parameters that influence the overall performance of VP-NTK. Our work demonstrates a promising step forward in improving the utility of DP synthetic data, particularly for high-resolution images.
摘要：差异化私有（DP）合成数据已成为释放敏感数据的事实上的标准。但是，许多DP生成模型遭受合成数据的效用较低，尤其是对于高分辨率图像。另一方面，参数有效微调（PEFT）中的一种新兴技术是视觉提示（VP），它允许训练有素的现有模型被重复使用，以适应后续的下游任务。在这项工作中，我们探索了具有DP约束的迷人生成模型时的这种现象。我们显示，VP与DP-NTK（DP-NTK）（DP发电机）相结合，可利用神经切线内核（NTK）在训练DP生成模型中的功能，可实现显着的性能提升，尤其是高分辨率图像数据集，精确地从0.644 $ \ pm $ 0.044 $ 0.044提高到0.744 $ 0.769。最后，我们对影响VP-NTK总体性能的不同参数的影响进行消融研究。我们的工作证明了在改善DP合成数据的实用性方面迈出了前进的一步，尤其是对于高分辨率图像。

Title: Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts

Authors: Yu Cao, Zengqun Zhao, Ioannis Patras, Shaogang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16218
Pdf URL: https://arxiv.org/pdf/2503.16218
Copy Paste: [[2503.16218]] Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts(https://arxiv.org/abs/2503.16218)
Keywords: generation, generative
Abstract: Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. In our analysis, we identify three distinct phases in the diffusion generative process: Profiling, Mutation, and Refinement. Artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This temporal nature explains why existing methods focusing only on spatial uncertainty of the final output fail at effective artifact localization. Based on these insights, we propose ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, with a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, \eg, by applying a noising-denoising scheme after generation, our mitigation strategy operates seamlessly within the existing diffusion process. Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.
摘要：在扩散模型中，即使在大规模数据集上进行培训，视觉伪像仍然是持续的挑战。当前的解决方案主要依赖于监督的探测器，但不了解这些工件首先发生的原因。在我们的分析中，我们在扩散生成过程中确定了三个不同的阶段：分析，突变和改进。伪影通常在突变阶段出现，其中某些区域随时间表现出异常的评分动力学，从而导致正常演化模式中的突然中断。这种时间性质解释了为什么在有效伪像定位时仅关注最终输出的空间不确定性的现有方法。基于这些见解，我们提出了ASCED（用于增强扩散的异常得分校正），通过在扩散过程中监测异常得分动态来检测伪影，并采用轨迹感知的在通线的缓解策略，以便在检测区域中适当产生噪声。与大多数应用事后校正的现有方法不同，\ eg通过生成后的noising-deNoising计划，我们的缓解策略在现有的扩散过程中无缝运行。广泛的实验表明，我们提出的方法有效地减少了跨不同领域的人工制品，在没有额外培训的情况下匹配或超过了现有的监督方法。

Title: Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

Authors: Zijian Li, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Rui Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16260
Pdf URL: https://arxiv.org/pdf/2503.16260
Copy Paste: [[2503.16260]] Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data(https://arxiv.org/abs/2503.16260)
Keywords: generation
Abstract: Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose \textit{Chain of Functions (CoF)}, a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textit{CoF} provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models. Employing \textit{CoF}, we construct the \textit{ChartCoF} dataset, with 1.4k complex reasoning Q\&A for fine-grained analysis and 50k Q\&A for reasoning enhancement. The fine-grained evaluation on \textit{ChartCoF} reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textit{ChartCoF} achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks. Furthermore, the novel paradigm of function-governed rationale generation in \textit{CoF} could inspire broader applications beyond charts.
摘要：视觉推理对于多模式大语言模型（MLLM）至关重要，以解决复杂的图表查询，但高质量的理由数据仍然很少。利用现有方法（M）LLM进行数据生成，但直接提示通常会产生有限的精度和多样性。在本文中，我们提出了\ textit {功能链（COF）}，这是一种新型的程序化推理数据生成管道，利用自由探索的推理路径作为监督来确保数据的精度和多样性。具体而言，它始于原子功能（例如最大数据和算术操作）之间的无人类探索，以生成多种功能链，然后将其转化为语言原理和问题，只有中等开源的LLM。 \ textit {cof}提供了多个好处：1）精度：与自由形式的生成相比，功能管理的一代减少了幻觉； 2）多样性：枚举功能链使各种问题分类法实现； 3）解释性：功能链充当内置理由，允许超过整体准确性的细粒度评估； 4）实用性：消除对极大模型的依赖。使用\ textit {cof}，我们构造了\ textit {chartCof}数据集，使用1.4k复杂的推理q \＆a进行细粒度分析，50k q \＆a用于推理增强。对\ textIt {ChartCof}的细粒度评估揭示了每个MLLM的问题分类法的各种性能，并且实验还表明，使用\ textit {ChartCof}进行填充实现了广泛使用的基准标记的同一尺度MLLM的最新性能。此外，在\ textit {cof}中，功能州基本原理生成的新颖范式可以激发图表以外的更广泛的应用。

Title: Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

Authors: Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Weinan E, Linfeng Zhang, Guolin Ke
Subjects: cs.LG, cond-mat.mtrl-sci, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.16278
Pdf URL: https://arxiv.org/pdf/2503.16278
Copy Paste: [[2503.16278]] Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens(https://arxiv.org/abs/2503.16278)
Keywords: generation
Abstract: Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding ({3D GU}) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates {3D GU} tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse {3D GU} tasks within a single autoregressive framework. Extensive experiments across multiple microscopic {3D GU} tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at this https URL.
摘要：大型语言模型及其多模式扩展的最新进展证明了通过自动回归的下一步预测统一产生和理解的有效性。但是，尽管3D结构产生和理解（{3D GU}）在AI中的科学中起着至关重要的作用，但这些任务在很大程度上是独立发展的，并且自回归方法仍未得到充实。为了弥合这一差距，我们介绍了Uni-3dar，这是一个统一的框架，通过自动回归预测无缝地集成了{3D GU}任务。 Uni-3dar的核心采用了一种新型的层次令牌化，该代币化使用OCTREE压缩了3D空间，利用了3D结构的固有稀疏性。然后，它对细粒的结构细节进行了其他代币化，捕获了微观3D结构中的原子类型和精确的空间坐标等关键属性。我们进一步提出了两种优化，以提高效率和有效性。第一个是一种两级子树压缩策略，该策略将OCTREE令牌序列降低至8倍。第二个是针对动态变化的代币位置量身定制的蒙版的下一步预测机制，可显着提高模型性能。通过结合这些策略，Uni-3DAR成功地将各种{3D GU}任务统一了一个自动回归框架。跨多个显微镜{3D GU}任务的广泛实验，包括分子，蛋白质，聚合物和晶体，验证其有效性和多功能性。值得注意的是，Uni-3DAR通过大幅度的差距超过先前的最新扩散模型，在交付推理速度时，相对改进的速度最高可达256 \％，高达21.8倍。该代码在此HTTPS URL上公开可用。

Title: SceneMI: Motion In-betweening for Modeling Human-Scene Interactions

Authors: Inwoo Hwang, Bing Zhou, Young Min Kim, Jian Wang, Chuan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16289
Pdf URL: https://arxiv.org/pdf/2503.16289
Copy Paste: [[2503.16289]] SceneMI: Motion In-betweening for Modeling Human-Scene Interactions(https://arxiv.org/abs/2503.16289)
Keywords: generative
Abstract: Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening -- a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.
摘要：建模人类娱乐相互作用（HSI）对于理解和模拟日常人类行为至关重要。使用生成建模的最新方法在该领域取得了进步。但是，它们对现实世界应用的可控性和灵活性受到限制。为了应对这些挑战，我们建议将HSI建模问题重新定义为场景感知运动，这是一项更加可行和实用的任务。我们介绍了Scenemi，该框架支持几个实际应用程序，包括在3D场景中的密钥帧引导的角色动画以及提高不完美HSI数据的运动质量。 Scenemi采用双场景描述符来全面编码全局和本地场景上下文。此外，我们的框架利用了扩散模型的固有的deno固定性质来推广到嘈杂的关键框架上。实验结果证明了Scenemi在场景感知的关键帧中的有效性和对现实世界中的GIMO数据集的概括，在该数据集中，嘈杂的IMU传感器和智能手机获得了动作和场景。我们进一步展示了Scenemi在单眼视频中的HSI重建中的适用性。

Title: Unleashing Vecset Diffusion Model for Fast Shape Generation

Authors: Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qinxiang Lin, Jinwei Huang, Yuhong Liu, Jie Jiang, Chunchao Guo, Xiangyu Yue
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.16302
Pdf URL: https://arxiv.org/pdf/2503.16302
Copy Paste: [[2503.16302]] Unleashing Vecset Diffusion Model for Fast Shape Generation(https://arxiv.org/abs/2503.16302)
Keywords: generation
Abstract: 3D shape generation has greatly flourished through the development of so-called "native" 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at this https URL.
摘要：通过开发所谓的“天然” 3D扩散，尤其是通过Vecset扩散模型（VDM），生成3D形状的产生蓬勃发展。尽管最近的进步在产生高分辨率3D形状方面表现出了令人鼓舞的结果，但VDM仍在与高速生成中挣扎。出现挑战是因为不仅在加速扩散采样方面遇到困难，而且在VDM中的VAE解码（在先前的作品中都没有探索的区域）。为了应对这些挑战，我们提出了FlashVDM，这是一个系统的框架，用于在VDM中加速VAE和DIT。对于DIT，FlashVDM可以以少于5个推理步骤和可比的质量进行灵活的扩散采样，这是通过通过我们新引入的渐进流动蒸馏来稳定一致性蒸馏而成为可能的。对于VAE，我们介绍了配备自适应KV选择，分层卷解码和有效的网络设计的闪电Vecset解码器。通过利用Vecset的局部性和体积中形状表面的稀疏性，我们的解码器大大降低了拖鞋，从而最大程度地减少了整体解码开销。我们将FlashVDM应用于Hunyuan3D-2以获取Hunyuan3D-2涡轮增压。通过系统的评估，我们表明我们的模型显着胜过现有的快速3D生成方法，从而实现了与最先进的性能相当的性能，同时将推理时间缩短了45倍以上以进行重建，而生成32倍。代码和型号可在此HTTPS URL上找到。

Title: Ultra-Resolution Adaptation with Ease

Authors: Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16322
Pdf URL: https://arxiv.org/pdf/2503.16322
Copy Paste: [[2503.16322]] Ultra-Resolution Adaptation with Ease(https://arxiv.org/abs/2503.16322)
Keywords: generation
Abstract: Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{this https URL}{here}.
摘要：近年来，文本到图像扩散模型取得了显着的进步。但是，高分辨率图像生成的培训模型仍然具有挑战性，尤其是当培训数据和计算资源受到限制时。在本文中，我们从两个关键角度探讨了这个实用问题：数据和参数效率，并提出了一组关键指南，以称为\ emph {urae}的超分辨率适应性。为了数据效率，我们从理论上和经验上证明，某些教师模型产生的合成数据可以显着促进培训融合。为了进行参数效率，我们发现重量矩阵的调整次要组件在不可用时超过了广泛使用的低级适配器，在保持效率的同时提供了可观的性能提高。此外，对于利用指导蒸馏（例如通量）的模型，我们表明，禁用无分类器指导，\ textit {i.e。}，将指导量表设置为适应过程中的1，对于令人满意的性能至关重要。广泛的实验验证了URAE能够与最先进的封闭源模型（如Flux1.1 [Pro] Ultra，仅具有3K样品和2K迭代）实现可比的2k代性能，同时为4K分辨率生成设定了新的基准。代码可用\ href {此https url} {there}。

Title: UniSync: A Unified Framework for Audio-Visual Synchronization

Authors: Tao Feng, Yifan Xie, Xun Guan, Jiyuan Song, Zhou Liu, Fei Ma, Fei Yu
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.16357
Pdf URL: https://arxiv.org/pdf/2503.16357
Copy Paste: [[2503.16357]] UniSync: A Unified Framework for Audio-Visual Synchronization(https://arxiv.org/abs/2503.16357)
Keywords: generation
Abstract: Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.
摘要：语音视频中的精确视听同步对于内容质量和观众理解至关重要。现有方法通过基于规则的方法和端到端的学习技巧来解决这一挑战方面取得了长足的进步。但是，这些方法通常依赖于有限的视听表示和次优的学习策略，从而有可能在更复杂的情况下限制其有效性。为了解决这些局限性，我们提出了Unisync，这是一种使用嵌入相似性评估视听同步的新方法。 UNISYNC提供了与各种音频表示（例如MEL频谱，Hubert）和视觉表示（例如RGB图像，面部解析图，面部标记，3DMM）的广泛兼容性，可有效地处理其显着的维度差异。我们使用基于边缘的损失组件和跨语言者对对比度学习框架，不同步对，从而提高了判别能力。 Unisync在标准数据集上的现有方法优于现有方法，并在各种视听表示中演示了多功能性。它整合到会说话的面部生成框架中增强了自然和AI生成内容的同步质量。

Title: NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Authors: Han-Hung Lee, Qinghong Han, Angel X. Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16375
Pdf URL: https://arxiv.org/pdf/2503.16375
Copy Paste: [[2503.16375]] NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes(https://arxiv.org/abs/2503.16375)
Keywords: generation
Abstract: In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.
摘要：在本文中，我们探讨了产生扩展的室外场景的任务，从城堡到高层。与一直是先前工作的主要重点的室内场景不同，室外场景生成带来了独特的挑战，包括场景高度的广泛变化以及能够快速生产大型景观的方法。为了解决这个问题，我们提出了一种有效的方法，该方法将场景块编码为均匀的向量集，提供了比先前方法中使用的空间结构潜在的潜在潜伏期更好的压缩和性能。此外，我们训练一个无限生成的明确支出模型，与先前的基于重新采样的镶嵌方案相比，它可以提高连贯性，同时还通过消除额外的扩散步骤来加速生成。为了促进这项任务，我们策划了Nuiscene43，这是一组小但高质量的场景，并进行了联合培训的预处理。值得注意的是，当在不同风格的场景中接受培训时，我们的模型可以在同一场景中融合不同的环境，例如农村房屋和城市摩天大楼，突出了我们策展过程的潜力，以利用异质场景进行联合培训。

Title: LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Authors: Leyang Wang, Joice Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16376
Pdf URL: https://arxiv.org/pdf/2503.16376
Copy Paste: [[2503.16376]] LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images(https://arxiv.org/abs/2503.16376)
Keywords: generation
Abstract: The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.
摘要：现代机器学习的成功，尤其是在面部翻译网络中，高度取决于高质量，配对的大规模数据集的可用性。但是，获取足够的数据通常具有挑战性且昂贵。受到扩散模型在高质量图像合成和大型语言模型（LLMS）进步中的成功启发的启发，我们提出了一个新型框架，称为LLM辅助配对图像生成（Lapig）。该框架可以使用LLMS生成的字幕构建综合，高质量的配对和热图像。我们的方法包括三个部分：可见的图像合成，具有弧形嵌入，使用潜扩散模型（LDMS）的热图像翻译以及用LLMS产生字幕。我们的方法不仅生成多视图配对的可见图像和热图像以增加数据多样性，而且还会产生高质量的配对数据，同时保持其身份信息。我们通过将公共数据集与现有方法进行比较来评估我们的方法，证明了Lapig的优势。

Title: SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Authors: Chun-Han Yao, Yiming Xie, Vikram Voleti, Huaizu Jiang, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16396
Pdf URL: https://arxiv.org/pdf/2503.16396
Copy Paste: [[2503.16396]] SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation(https://arxiv.org/abs/2503.16396)
Keywords: generation
Abstract: We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D. Project page: this https URL.
摘要：我们提出稳定的视频4D 2.0（SV4D 2.0），这是一个用于动态3D资产生成的多视频视频扩散模型。与其前身SV4D相比，SV4D 2.0对闭塞和大型运动更为强大，可以更好地推广到现实世界的视频，并以细节清晰度和时空的一致性来产生更高质量的输出。 We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling.广泛的实验表明，SV4D 2.0在视觉上和定量上都显着增长，在小说视频综合和4D优化（-12 \％LPIPS和-24 \％LPIPS和-24 \％fv4d）中，获得了更好的细节（-14 \％LPIP）和4D一致性（-44 \％FV4D）。项目页面：此HTTPS URL。

Title: Scale-wise Distillation of Diffusion Models

Authors: Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16397
Pdf URL: https://arxiv.org/pdf/2503.16397
Copy Paste: [[2503.16397]] Scale-wise Distillation of Diffusion Models(https://arxiv.org/abs/2503.16397)
Keywords: generation
Abstract: We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.
摘要：我们提出了SWD，这是一个针对扩散模型（DMS）的规模蒸馏框架，该框架有效地采用了基于扩散的基于扩散的几步生成器的临时预测思想。更详细的是，SWD的灵感来自最新的见解，该见解与隐式光谱自动估计相关的扩散过程。我们假设DMS可以在较低的数据分辨率下启动生成，并在每个降级步骤中逐渐上升样品而不会损失性能，同时大大降低计算成本。 SWD自然将基于分布匹配的现有扩散蒸馏方法整合到现有的扩散蒸馏方法中。此外，我们通过引入与目标分布的良好元素相似性的新型贴片损失来丰富分配匹配方法的家族。当应用于最先进的文本到图像扩散模型时，SWD接近两个完整分辨率步骤的推理时间，并在相同的计算预算下明显优于对应的推理时间，这是自动化量指标和人类偏好研究所证明的。

Title: ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

Authors: Haolin Yang, Feilong Tang, Ming Hu, Yulong Li, Junjie Guo, Yexin Liu, Zelin Peng, Junjun He, Zongyuan Ge, Imran Razzak,
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.16400
Pdf URL: https://arxiv.org/pdf/2503.16400
Copy Paste: [[2503.16400]] ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos(https://arxiv.org/abs/2503.16400)
Keywords: generation
Abstract: Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.
摘要：视频扩散模型（VDMS）促进了高质量视频的生成，当前的研究主要集中在培训期间通过改进数据质量，计算资源和模型复杂性的培训。但是，推理时间缩放受到较少的关注，大多数方法将模型限制为单一尝试。最近的研究发现了“黄金声音”的存在，可以增强一代人的视频质量。在此基础上，我们发现，指导VDM的缩放推理时间搜索以识别更好的噪声候选者，不仅评估了当前步骤中生成的帧的质量，而且还通过引用以前多块的锚固框架来保留高级对象功能，从而提供了长期价值。我们的分析表明，扩散模型可以固有地通过改变降解步骤，甚至是一步降级方法，在以奖励信号的指导下，可以带来显着的长期福利，从而固有地具有对计算的灵活调整。基于观察结果，我们提出了CalingNoise，这是一种插件的推理时间搜索策略，它标识了Golden初始噪声的扩散抽样过程，以提高全球内容一致性和视觉多样性。具体而言，我们执行一步转换为剪辑中的初始噪声，然后评估其长期价值，利用由先前生成的内容锚定的奖励模型。此外，为了保留多样性，我们从倾斜的噪声分布中采样了候选者，这些分布有希望的噪音。通过这种方式，ScalingNoise大大减少了噪声引起的错误，从而确保了更连贯和时空一致的视频生成。基准数据集上的广泛实验表明，提出的ScalingNoise有效地改善了长时间的视频生成。

Title: DreamTexture: Shape from Virtual Texture with Analysis by Augmentation

Authors: Ananta R. Bhattarai, Xingzhe He, Alla Sheffer, Helge Rhodin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16412
Pdf URL: https://arxiv.org/pdf/2503.16412
Copy Paste: [[2503.16412]] DreamTexture: Shape from Virtual Texture with Analysis by Augmentation(https://arxiv.org/abs/2503.16412)
Keywords: generative
Abstract: DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues -- a novel monocular reconstruction paradigm that we call Analysis by Augmentation.
摘要：DreamFusion通过结合生成模型的进步和可区分的渲染，为虚拟观点建立了一个新的范式，用于从虚拟观点中进行无监督的3D重建。但是，基本的多视图渲染以及大规模生成模型的监督在计算上是昂贵且受到限制的。我们提出了DreamTexture，这是一种新颖的形状 - 虚拟纹理方法，它利用单眼深度提示重建3D对象。我们的方法通过将虚拟纹理与输入中的真实深度线提示对齐，从而利用对现代扩散模型编码的单眼几何形状的固有理解来构成输入图像。然后，我们从虚拟纹理变形中重建深度，并通过新的保形地图优化来减轻记忆密集型体积表示。我们的实验表明，生成模型具有对单眼形状提示的理解，可以通过增强和对齐纹理提示提取这种模型，这是一种新型的单眼重建范式，我们通过增强称为分析。

Title: InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Authors: Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.16418
Pdf URL: https://arxiv.org/pdf/2503.16418
Copy Paste: [[2503.16418]] InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity(https://arxiv.org/abs/2503.16418)
Keywords: generation
Abstract: Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
摘要：实现柔性和高保真身份保存的图像产生仍然令人震惊，尤其是在诸如通量之类的高级扩散变压器（DIT）的情况下。我们介绍了InfiniteYou（INFU），这是最早利用DITS执行此任务的框架之一。 INFU解决了现有方法的重大问题，例如身份相似性不足，文本图像一致性差，低发电质量和美学。 INFU的中心是InfuseNet，它是通过残差连接将身份特征注入DIT基本模型的组件，增强了身份相似性，同时保持发电能力。一种多阶段培训策略，包括通过合成单人群样本（SPMS）数据进行预处理和监督微调（SFT），进一步改善了文本图像对齐，减轻图像质量，并减轻面部拷贝性。广泛的实验表明，INFU实现了最先进的性能，超过了现有的基准。此外，INFU的插件设计确保了与各种现有方法的兼容性，为更广泛的社区提供了宝贵的贡献。

Title: SynCity: Training-Free Generation of 3D Worlds

Authors: Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16420
Pdf URL: https://arxiv.org/pdf/2503.16420
Copy Paste: [[2503.16420]] SynCity: Training-Free Generation of 3D Worlds(https://arxiv.org/abs/2503.16420)
Keywords: generation, generative
Abstract: We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.
摘要：我们解决了从文本描述中生成3D世界的挑战。我们提出了Syncity，一种无训练和优化的方法，它利用了预训练的3D生成模型的几何精度以及2D图像发生器的艺术多功能性来创建大型，高质量的3D空间。尽管大多数3D生成模型都是以对象为中心的，并且无法生成大规模的世界，但我们展示了如何将3D和2D发电机组合在一起以生成不断扩展的场景。通过基于瓷砖的方法，我们可以对布局和场景的外观进行细粒度的控制。世界是按瓷砖产生的，每个新瓷砖都在其世界上的文化中生成，然后与场景融合在一起。 Syncity产生了令人信服的沉浸式场景，这些场景详细且多样性丰富。

Title: MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

Authors: Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, Zuxuan Wu
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.16421
Pdf URL: https://arxiv.org/pdf/2503.16421
Copy Paste: [[2503.16421]] MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance(https://arxiv.org/abs/2503.16421)
Keywords: generation
Abstract: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at this https URL.
摘要：视频生成的最新进展导致视觉质量和时间连贯性的显着改善。为此，已经出现了可控制的视频，可以通过明确定义的空间路径来启用精确的对象运动控制。但是，现有的方法与复杂的对象运动和多对象运动控制障碍，从而导致轨迹依从性不精确，对象一致性差和视觉质量受损。此外，这些方法仅支持单一格式的轨迹控制，从而限制了它们在不同情况下的适用性。此外，没有专门针对轨迹控制的视频生成，阻碍健壮的培训和系统评估的公开数据集或基准。为了应对这些挑战，我们介绍了MagicMotion，这是一个新颖的图像到视频生成框架，可以通过三个层次的条件从密集到稀疏的三个级别的轨迹控制：面具，边界框和稀疏的盒子。在给定输入图像和轨迹的情况下，MagicMotion无缝地将对象沿定义的轨迹动画，同时保持对象的一致性和视觉质量。此外，我们提出MagicData，这是一个大规模轨迹控制的视频数据集，以及用于注释和过滤的自动管道。我们还介绍了MagicBench，这是一个综合基准，可评估不同数量对象的视频质量和轨迹控制精度。广泛的实验表明，魔术的表现优于各种指标的先前方法。我们的项目页面可在此HTTPS URL上公开可用。

Title: Tokenize Image as a Set

Authors: Zigang Geng, Mengde Xu, Han Hu, Shuyang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16425
Pdf URL: https://arxiv.org/pdf/2503.16425
Copy Paste: [[2503.16425]] Tokenize Image as a Set(https://arxiv.org/abs/2503.16425)
Keywords: generation
Abstract: This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion--the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance--enabling effective set distribution modeling. Experiments demonstrate our method's superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at this https URL.
摘要：本文通过基于集合的令牌化和分布建模提出了一种从根本上进行图像生成的新范式。与将图像序列化为具有统一压缩比的固定位置潜在代码的常规方法不同，我们将无序的令牌集表示为基于区域语义复杂性的动态分配编码能力。该令牌可以增强全球环境聚合，并提高针对局部扰动的鲁棒性。为了解决建模离散集的关键挑战，我们设计了一种双重转换机制，该机制将设置通过求和约束将集合转换为固定长度的整数序列。此外，我们提出了固定和离散扩散 - 同时处理离散值，固定序列长度和求和不变性的第一个框架 - 可增强有效的设置分布建模。实验证明了我们方法在语义感知表示和发电质量方面的优势。我们的创新，涵盖了新颖的表示和建模策略，超越了传统的顺序令牌范式。我们的代码和模型在此HTTPS URL上公开可用。

Title: Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Authors: Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.16430
Pdf URL: https://arxiv.org/pdf/2503.16430
Copy Paste: [[2503.16430]] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation(https://arxiv.org/abs/2503.16430)
Keywords: generation
Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: this https URL.
摘要：自回归的视觉生成模型通常依靠令牌将图像压缩到可以依次预测的令牌中。代币表示存在着根本的困境：离散令牌可以直接建模，并具有标准的横向损失，但由于信息丢失和令牌训练的不稳定性而受到影响；连续令牌更好地保留视觉细节，但需要复杂的分布建模，使生成管道变得复杂。在本文中，我们提出了Tokenbridge，它通过保持连续令牌的强大表示能力，同时保留离散令牌的建模简单性，从而弥合了这一差距。为了实现这一目标，我们通过直接从连续表示形式获得离散令牌的训练后量化来使离散化的离散化与训练后量化。具体而言，我们引入了尺寸量化策略，该策略将每个特征维度独立离散，并与轻巧的自动回归预测机制配对，该机制有效地对所得的大令牌空间进行了建模。广泛的实验表明，我们的方法在使用标准分类预测的同时，通过连续方法实现重建和发电质量。这项工作表明，桥接离散和连续的范式可以有效地利用这两种方法的优势，从而通过简单的自动回归建模为高质量的视觉产生提供了有希望的方向。项目页面：此HTTPS URL。