2025-03-27

Title: Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation

Authors: Zhiyao Ren, Yibing Zhan, Baosheng Yu, Dacheng Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19937
Pdf URL: https://arxiv.org/pdf/2503.19937
Copy Paste: [[2503.19937]] Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation(https://arxiv.org/abs/2503.19937)
Keywords: generation
Abstract: Text-to-image generation has become increasingly popular, but achieving the desired images often requires extensive prompt engineering. In this paper, we explore how to decode textual prompts from reference images, a process we refer to as image reverse prompt engineering. This technique enables us to gain insights from reference images, understand the creative processes of great artists, and generate impressive new images. To address this challenge, we propose a method known as automatic reverse prompt optimization (ARPO). Specifically, our method refines an initial prompt into a high-quality prompt through an iteratively imitative gradient prompt optimization process: 1) generating a recreated image from the current prompt to instantiate its guidance capability; 2) producing textual gradients, which are candidate prompts intended to reduce the difference between the recreated image and the reference image; 3) updating the current prompt with textual gradients using a greedy search method to maximize the CLIP similarity between prompt and reference image. We compare ARPO with several baseline methods, including handcrafted techniques, gradient-based prompt tuning methods, image captioning, and data-driven selection method. Both quantitative and qualitative results demonstrate that our ARPO converges quickly to generate high-quality reverse prompts. More importantly, we can easily create novel images with diverse styles and content by directly editing these reverse prompts. Code will be made publicly available.
摘要：文本到图像的生成越来越流行，但是实现所需图像通常需要广泛的及时工程。在本文中，我们探讨了如何从参考图像中解码文本提示，这是我们称为图像反向提示工程的过程。这项技术使我们能够从参考图像中获得见解，了解伟大艺术家的创作过程，并产生令人印象深刻的新图像。为了应对这一挑战，我们提出了一种称为自动反向提示优化（ARPO）的方法。具体而言，我们的方法通过迭代模拟梯度提示优化过程将初始提示完善到高质量的提示中：1）从当前提示中生成重新创建的图像以实例化其指导能力； 2）产生文本梯度，这些梯度是候选的提示，旨在减少已重新创建的图像和参考图像之间的差异； 3）使用贪婪的搜索方法更新当前提示符，以最大程度地提高提示图和参考图像之间的剪辑相似性。我们将ARPO与几种基线方法进行比较，包括手工制作的技术，基于梯度的提示方法，图像字幕和数据驱动的选择方法。定量和定性结果都表明，我们的ARPO迅速收敛以产生高质量的反向提示。更重要的是，我们可以通过直接编辑这些反向提示来轻松创建具有不同样式和内容的新颖图像。代码将公开可用。

Title: Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Authors: Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Jiajun Wu, Daniel LK Yamins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19953
Pdf URL: https://arxiv.org/pdf/2503.19953
Copy Paste: [[2503.19953]] Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals(https://arxiv.org/abs/2503.19953)
Keywords: generation
Abstract: Estimating motion in videos is an essential computer vision problem with many downstream applications, including controllable video generation and robotics. Current solutions are primarily trained using synthetic data or require tuning of situation-specific heuristics, which inherently limits these models' capabilities in real-world contexts. Despite recent developments in large-scale self-supervised learning from videos, leveraging such representations for motion estimation remains relatively underexplored. In this work, we develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. Opt-CWM works by learning to optimize counterfactual probes that extract motion information from a base video model, avoiding the need for fixed heuristics while training on unrestricted video inputs. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.
摘要：估算视频中的运动是许多下游应用程序（包括可控的视频生成和机器人）的必不可少的计算机视觉问题。当前的解决方案主要是使用合成数据训练的，或者需要对特定情况的启发式方法进行调整，这本质上限制了这些模型在实际环境中的功能。尽管从视频中进行了大规模的自我监督学习方面的发展，但利用此类表示的运动估计仍然相对不受影响。在这项工作中，我们开发了OPT-CWM，这是一种从预先训练的下一框架预测模型中进行流量和遮挡估算的自我监督技术。 OPT-CWM通过学习优化反事实探针的工作作品，这些探针从基本视频模型中提取运动信息，从而避免了对无限制视频输入培训的固定启发式方法的需求。我们在不需要标记的数据的同时，在现实世界视频上实现了最新的运动估算。

Title: The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs

Authors: Jonathan Sauder, Viktor Domazetoski, Guilhem Banc-Prandi, Gabriela Perna, Anders Meibom, Devis Tuia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20000
Pdf URL: https://arxiv.org/pdf/2503.20000
Copy Paste: [[2503.20000]] The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs(https://arxiv.org/abs/2503.20000)
Keywords: restoration
Abstract: Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.
摘要：由于气候变化和当地压力源，全球珊瑚礁正在下降。为了告知有效的保护或恢复，需要以最高的空间和时间分辨率进行监视。传统的珊瑚礁测量方法由于依靠专家劳动时间而限制可扩展性，激发了使用计算机视觉工具来自动化图像中实时珊瑚的识别和丰度估计。但是，由于缺乏大型高质量数据集而阻碍了对此类工具的设计和评估。我们发布了CoralsCapes数据集，这是针对珊瑚礁的第一个通用语义分割数据集，涵盖了2075张图像，39个底栖类别和174K分段掩模，由专家注释。 CoralsCapes具有与广泛使用的城市景观数据集相似的范围和相同的结构，可用于城市场景细分，从而在一个新的具有挑战性的领域中对语义细分模型进行基准测试，这需要专家知识才能注释。我们对广泛的语义分割模型进行了基准，并发现从珊瑚尺寸转移到现有较小数据集的将学习始终导致最先进的性能。 CoralsCapes将根据计算机视觉促进对高效，可扩展和标准化的珊瑚礁测量方法的研究，并具有简化水下生态机器人技术发展的潜力。

Title: Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang (Dennis)Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20084
Pdf URL: https://arxiv.org/pdf/2503.20084
Copy Paste: [[2503.20084]] Can Multi-modal (reasoning) LLMs work as deepfake detectors?(https://arxiv.org/abs/2503.20084)
Keywords: generative
Abstract: Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.
摘要：在先进的生成模型时代，尤其是随着合成媒体变得更加复杂，深层检测仍然是一个关键的挑战。在这项研究中，我们探讨了艺术状态多模式（推理）大语言模型（LLMS）用于深击图像检测的潜力，例如（OpenAi O1/4O，Gemini Thinking Flash 2，DeepSeek Janus，Grok 3，Llama 3.2，Llama 3.2，Qwen 2/2.5 VL，Mistral Pixtral，Claude 3.5 Sonnet）。我们基准了12个最新的多模式LLM，可针对多个数据集的传统深层检测方法，包括最近发布的现实世界中的深层图像。为了提高性能，我们采用迅速调整并对模型推理途径进行深入分析，以确定其决策过程中的关键因素。我们的发现表明，最佳的多模式LLM可以通过有希望的概括能力（零射击）实现竞争性能，甚至超过传统的深层捕获探测管道，而LLM家族的其余部分则表现出比随机猜测更糟糕的令人失望的。此外，我们发现较新的模型版本和推理功能在此类DeepFake检测的这种利基任务中并没有贡献，而模型尺寸在某些情况下确实有帮助。这项研究突出了将多模式推理整合到未来的深击检测框架中的潜力，并为在现实世界情景中的鲁棒性提供了模型可解释性的见解。

Title: Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning

Authors: Yongshuai Liu, Xin Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20139
Pdf URL: https://arxiv.org/pdf/2503.20139
Copy Paste: [[2503.20139]] Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning(https://arxiv.org/abs/2503.20139)
Keywords: generation
Abstract: Model-based reinforcement learning (MBRL) has demonstrated superior sample efficiency compared to model-free reinforcement learning (MFRL). However, the presence of inaccurate models can introduce biases during policy learning, resulting in misleading trajectories. The challenge lies in obtaining accurate models due to limited diverse training data, particularly in regions with limited visits (uncertain regions). Existing approaches passively quantify uncertainty after sample generation, failing to actively collect uncertain samples that could enhance state coverage and improve model accuracy. Moreover, MBRL often faces difficulties in making accurate multi-step predictions, thereby impacting overall performance. To address these limitations, we propose a novel framework for uncertainty-aware policy optimization with model-based exploratory planning. In the model-based planning phase, we introduce an uncertainty-aware k-step lookahead planning approach to guide action selection at each step. This process involves a trade-off analysis between model uncertainty and value function approximation error, effectively enhancing policy performance. In the policy optimization phase, we leverage an uncertainty-driven exploratory policy to actively collect diverse training samples, resulting in improved model accuracy and overall performance of the RL agent. Our approach offers flexibility and applicability to tasks with varying state/action spaces and reward structures. We validate its effectiveness through experiments on challenging robotic manipulation tasks and Atari games, surpassing state-of-the-art methods with fewer interactions, thereby leading to significant performance improvements.
摘要：与无模型增强学习（MFRL）相比，基于模型的增强学习（MBRL）表明样品效率优于样品效率。但是，不准确模型的存在可能会在政策学习过程中引入偏见，从而造成误导性轨迹。挑战在于由于有限的培训数据有限，尤其是在访问有限的地区（不确定区域），因此获得了准确的模型。现有方法可以被动地量化样本后的不确定性，无法积极收集不确定的样本，这些样本可以增强状态覆盖范围并提高模型的准确性。此外，MBRL经常在做出准确的多步预测方面面临困难，从而影响整体性能。为了解决这些限制，我们通过基于模型的探索性计划提出了一个新颖的框架，以实现不确定性感知政策优化。在基于模型的计划阶段，我们引入了一种不确定性感知的K-Step LookAhead计划方法，以指导每个步骤的行动选择。该过程涉及模型不确定性和价值函数近似错误之间的权衡分析，从而有效地增强了策略绩效。在政策优化阶段，我们利用不确定性驱动的探索性政策积极收集各种培训样本，从而提高了RL代理的模型准确性和整体性能。我们的方法为具有不同状态/行动空间和奖励结构的任务提供了灵活性和适用性。我们通过对挑战机器人操纵任务和Atari游戏的实验来验证其有效性，超过了互动较少的最先进方法，从而导致了显着的性能提高。

Title: AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions

Authors: Xianke Qiang, Zheng Chang, Ying-Chang Liang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20166
Pdf URL: https://arxiv.org/pdf/2503.20166
Copy Paste: [[2503.20166]] AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions(https://arxiv.org/abs/2503.20166)
Keywords: generative
Abstract: Federated learning (FL) can fully leverage large-scale terminal data while ensuring privacy and security, and is considered as a distributed alternative for the centralized machine learning. However, the issue of data heterogeneity poses limitations on FL's performance. To address this challenge, artificial intelligence-generated content (AIGC) which is an innovative data synthesis technique emerges as one potential solution. In this article, we first provide an overview of the system architecture, performance metrics, and challenges associated with AIGC-assistant FL system design. We then propose the Generative federated learning (GenFL) architecture and present its workflow, including the design of aggregation and weight policy. Finally, using the CIFAR10 and CIFAR100 datasets, we employ diffusion models to generate dataset and improve FL performance. Experiments conducted under various non-independent and identically distributed (non-IID) data distributions demonstrate the effectiveness of GenFL on overcoming the bottlenecks in FL caused by data heterogeneity. Open research directions in the research of AIGC-assisted FL are also discussed.
摘要：联合学习（FL）可以在确保隐私和安全性的同时充分利用大规模终端数据，并被视为集中机器学习的分布式替代方案。但是，数据异质性的问题对FL的性能产生了限制。为了应对这一挑战，人工智能生成的内容（AIGC）是一种创新的数据合成技术，它是一种潜在的解决方案。在本文中，我们首先概述了系统体系结构，性能指标以及与AIGC辅助FL系统设计相关的挑战。然后，我们提出了生成的联合学习（GENFL）体系结构，并介绍其工作流程，包括集合和权重政策的设计。最后，使用CIFAR10和CIFAR100数据集，我们采用扩散模型来生成数据集并提高FL性能。在各种非独立且相同分布的（非IID）数据分布下进行的实验证明了GENFL对克服数据异质性引起的FL瓶颈的有效性。还讨论了AIGC辅助FL研究中的开放研究方向。

Title: Guiding Human-Object Interactions with Rich Geometry and Relations

Authors: Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, Changxing Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20172
Pdf URL: https://arxiv.org/pdf/2503.20172
Copy Paste: [[2503.20172]] Guiding Human-Object Interactions with Rich Geometry and Relations(https://arxiv.org/abs/2503.20172)
Keywords: generation
Abstract: Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.
摘要：人类对象的相互作用（HOI）合成对于为虚拟现实等应用创造沉浸式和现实的经验至关重要。现有的方法通常依赖于简化的对象表示，例如对象的质心或与人的最近点，以实现物理上合理的运动。但是，这些方法可能会忽略几何复杂性，从而导致次优的相互作用保真度。为了解决这一限制，我们介绍了ROG，这是一种基于扩散的新型框架，该框架模拟了HOI固有的时空关系，并具有丰富的几何细节。为了有效的对象表示，我们从对象网格中选择以边界为中心和细尾密钥点，以确保对对象的几何形状进行全面描述。该表示形式用于构建一个交互式距离字段（IDF），以捕获强大的HOI动力学。此外，我们开发了一个基于扩散的关系模型，该模型整合了空间和时间注意机制，从而可以更好地理解复杂的HOI关系。该关系模型完善了生成的运动的IDF，指导运动生成过程，以产生关系感知和语义对齐运动。实验评估表明，在合成HOI的现实主义和语义准确性中，ROG明显优于最先进的方法。

Title: Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Authors: Shihao Zhou, Dayu Li, Jinshan Pan, Juncheng Zhou, Jinglei Shi, Jufeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20174
Pdf URL: https://arxiv.org/pdf/2503.20174
Copy Paste: [[2503.20174]] Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration(https://arxiv.org/abs/2503.20174)
Keywords: restoration
Abstract: Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.
摘要：基于变压器的方法在图像恢复中引起了极大的关注，在图像恢复中，核心组成部分（即多头关注（MHA））在捕获多样的特征和恢复高质量的结果中起着至关重要的作用。在MHA中，Heads独立于统一的拆分子空间执行注意力计算，并且触发了冗余问题，以阻止该模型实现令人满意的输出。在本文中，我们建议通过探索多样的学习者并引入头部之间的各种相互作用来改善MHA，从而导致层次多头注意力驱动的变压器模型称为图像恢复。提示包含两个模块，即分层多头注意（HMHA）和Query-Key Cache更新（QKCU）模块，以解决植根于Vanilla MHA的冗余问题。具体而言，HMHA通过采用头部从不同大小的子空间学习并包含不同信息来提取各种上下文特征。此外，QKCU（包括层间和间层间方案）通过促进跨层内和跨层之间的注意力头部之间的增强相互作用，进一步降低了冗余问题。在5个图像恢复任务中进行的12个基准进行了广泛的实验，包括低光增强，除去，悬而未决，脱落，降解和降低，以证明提示的优越性。源代码在补充材料中可用。

Title: Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

Authors: Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, Xiaoming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20188
Pdf URL: https://arxiv.org/pdf/2503.20188
Copy Paste: [[2503.20188]] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector(https://arxiv.org/abs/2503.20188)
Keywords: generation
Abstract: Deepfake detection is a long-established research topic vital for mitigating the spread of malicious misinformation. Unlike prior methods that provide either binary classification results or textual explanations separately, we introduce a novel method capable of generating both simultaneously. Our method harnesses the multi-modal learning capability of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and explainability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs tailored face forgery prompt learning, incorporating the pre-trained CLIP to improve generalization to unseen forgeries. Also, M2F2-Det incorporates an LLM to provide detailed textual explanations of its detection decisions, enhancing interpretability by bridging the gap between natural language and subtle cues of facial forgeries. Empirically, we evaluate M2F2-Det on both detection and explanation generation tasks, where it achieves state-of-the-art performance, demonstrating its effectiveness in identifying and explaining diverse forgeries.
摘要：DeepFake检测是一个悠久的研究主题，对于减轻恶意错误信息的传播至关重要。与提供二进制分类结果或分别提供二进制分类结果的先前方法不同，我们引入了一种能够同时生成两者的新方法。我们的方法利用了预训练的剪辑的多模式学习能力以及大型语言模型（LLMS）前所未有的解释性，以增强深击检测的概括和解释性。具体而言，我们引入了一个多模式的伪造探测器（M2F2-DET），该探测器采用了量身定制的伪造及时学习，并结合了预训练的夹子以改善概括以提高看不见的伪造。此外，M2F2-DET结合了LLM，以提供有关其检测决策的详细文本解释，从而通过弥合自然语言和面部伪造的细微线索之间的差距来增强可解释性。从经验上讲，我们在检测和解释生成任务上都评估了M2F2-DET，在该任务中它可以实现最先进的性能，从而证明了其在识别和解释各种伪造方面的有效性。

Title: Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Authors: Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20198
Pdf URL: https://arxiv.org/pdf/2503.20198
Copy Paste: [[2503.20198]] Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models(https://arxiv.org/abs/2503.20198)
Keywords: generation, generative
Abstract: Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.
摘要：自回归和扩散模型的最新进展导致了用简短的文本单词在图像生成中的出色表现。但是，在幻灯片或文档中的段落中生成相干，长形的文本仍然是当前生成模型的主要挑战。我们介绍了专门针对长文本图像生成的第一项工作，解决了现有的文本到图像系统中通常仅处理简短短语或单个句子的关键差距。通过对最先进的自回归生成模型的全面分析，我们将图像令牌确定为文本生成质量的关键瓶颈。为了解决这个问题，我们介绍了一种以文本为中心的新型二进制代币仪进行了优化，以捕获详细的场景文本特征。利用我们的令牌剂，我们开发了\ modelname，这是一种多模式自回归模型，它在生成具有前所未有的忠诚度的高质量长篇文本图像方面擅长。我们的模型提供了可靠的可控性，可以自定义文本属性，例如字体样式，大小，颜色和对齐方式。广泛的实验表明，\ modelName〜在用dall-e 3〜 \ cite {dalle3}的\ cite {sd3}和gpt4o〜 \ cite {gpt4o}中显着胜过SD3.5 sd3.5 \ cite {sd3}和gpt4o〜 \ cite {gpt4o}，在始终如一地，稳定地，灵活地，灵活地，灵活地，灵活地，灵活地，均一。除了其技术成就外，\ ModelName〜还为诸如交织文档和PowerPoint生成之类的创新应用打开了激动人心的机会，并在长篇文本图像生成中建立了新的边界。

Title: Video Motion Graphs

Authors: Haiyang Liu, Zhan Xu, Fa-Ting Hong, Hsin-Ping Huang, Yi Zhou, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20218
Pdf URL: https://arxiv.org/pdf/2503.20218
Copy Paste: [[2503.20218]] Video Motion Graphs(https://arxiv.org/abs/2503.20218)
Keywords: generation, generative
Abstract: We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at this https URL
摘要：我们提供视频运动图，该系统旨在生成现实的人类运动视频。该系统使用参考视频和有条件的信号（例如音乐或运动标签），通过首先将视频剪辑带回符合条件的手势，然后生成插值框架以无缝连接剪辑边界来合成新视频。我们方法的核心是HMINTERP，这是一个强大的视频框架插值（VFI）模型，它可以使不连续帧无缝插值，即使对于诸如舞蹈之类的复杂运动场景也是如此。 HMINTERP I）采用双分支插值方法，将人类骨架运动插值的运动扩散模型与基于扩散的视频框架插值模型相结合。 ii）采用条件逐步训练，以有效利用身份强劲和弱条件，例如图像和姿势。这些设计可确保高视频质量质量和准确的运动轨迹。结果表明，我们的视频动作图优于多模式调节的人类运动视频生成的现有基于生成和检索的方法。可以在此HTTPS URL上找到项目页面

Title: DINeMo: Learning Neural Mesh Models with no 3D Annotations

Authors: Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20220
Pdf URL: https://arxiv.org/pdf/2503.20220
Copy Paste: [[2503.20220]] DINeMo: Learning Neural Mesh Models with no 3D Annotations(https://arxiv.org/abs/2503.20220)
Keywords: generation
Abstract: Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at this https URL.
摘要：类别级别的3D/6D姿势估计是朝着全面的3D场景理解迈出的关键步骤，这将使机器人技术和体现AI的广泛应用。最近的著作探索了神经网格模型，这些模型从分析的角度从分析的角度接近了一系列2D和3D任务。尽管部分遮挡和域的变化具有很大的鲁棒性，但这些方法在很大程度上取决于零件对抗性学习的3D注释，这将它们仅限于一组狭窄的类别集，并阻碍了有效的扩展。在这项工作中，我们提出了Dinemo，这是一种新型的神经网格模型，通过利用从大型视觉基础模型获得的伪符合度来训练，没有3D注释。我们采用双向伪通信生成方法，该方法生产伪通信利用本地外观特征和全局上下文信息。 CAR数据集的实验结果表明，我们的DINEMO的表现优于先前的零和几射线3D姿势估计，从而将差距缩小了67.3％。在训练过程中合并更多未标记的图像时，我们的DINEMO还可以有效，有效地扩展，这证明了依赖3D注释的监督学习方法的优势。我们的项目页面可在此HTTPS URL上找到。

Title: Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Authors: Prin Phunyaphibarn, Phillip Y. Lee, Jaihoon Kim, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20240
Pdf URL: https://arxiv.org/pdf/2503.20240
Copy Paste: [[2503.20240]] Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models(https://arxiv.org/abs/2503.20240)
Keywords: generation
Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.
摘要：无分类器指导（CFG）是训练条件扩散模型的基本技术。基于CFG的培训的常见实践是使用单个网络来学习条件和无条件的噪声预测，并以较小的调理率进行调节率。但是，我们观察到，在训练中，无条件噪声的联合学习无条件的带宽会导致无条件案例的较差的先验。更重要的是，这些无条件的无条件噪声预测成为降低条件产生质量的严重原因。受到以下事实的启发：大多数基于CFG的条件模型都是通过对基本模型进行更好无条件生成的基本模型来训练的，我们首先证明，仅通过基本模型预测的替换CFG中的无条件噪声可以显着改善条件产生。此外，我们表明，除了训练微型模型的扩散模型外，还可以用于无条件噪声。我们通过一系列用于图像和视频生成的基于CFG的条件模型在实验中验证我们的主张，包括零1-3，多功能扩散，DIT，Dynamicrafter和ConsendPix2Pix。

Title: ViLBench: A Suite for Vision-Language Process Reward Modeling

Authors: Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.20271
Pdf URL: https://arxiv.org/pdf/2503.20271
Copy Paste: [[2503.20271]] ViLBench: A Suite for Vision-Language Process Reward Modeling(https://arxiv.org/abs/2503.20271)
Keywords: generation
Abstract: Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at this https URL with our code, model, and data.
摘要：流程监督的奖励模型是一种精细元素功能，可为模型响应提供详细的逐步反馈，从而有效地选择了针对复杂任务的推理轨迹。尽管具有优势，但对PRM的评估仍然较少探索，尤其是在多模式领域。为了解决这一差距，本文首先将当前视觉模型（VLLMS）作为两种类型的奖励模型进行基准基准：输出奖励模型（ORMS）和流程奖励模型（PRMS）在多个视觉语言基准上，这既不能揭示ORM和PRM始终在所有任务上均超过所有任务，而卓越的VLLM并不一定会产生更好的奖励性能。为了进一步提高评估，我们介绍了Vilbench，Vilbench是一种视觉语言基准，旨在需要密集的过程奖励信号。值得注意的是，OpenAI的GPT-4O与经营链（COT）的GPT-4O仅达到27.3％的精度，表明基准对当前VLLM的挑战。最后，我们初步展示了弥合一般VLLM和奖励模型之间差距的有前途的途径 - 通过使用增强的Tree-Search算法收集73.6k视觉的过程奖励数据，我们的3B模型能够通过标准COT的平均改善，而不是标准COT的平均改善，并且与未经启动的vilbeb opation opation optation opation optation of vilbeai相比，我们的3B模型可以实现3.3％。我们使用代码，模型和数据在此HTTPS URL上发布实现。

Title: RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process

Authors: Kaifan Sun, Bingchen Yang, Peter Wonka, Jun Xiao, Haiyong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20289
Pdf URL: https://arxiv.org/pdf/2503.20289
Copy Paste: [[2503.20289]] RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process(https://arxiv.org/abs/2503.20289)
Keywords: generation, generative
Abstract: The generation of indoor furniture layouts has significant applications in augmented reality, smart homes, and architectural design. Successful furniture arrangement requires proper physical relationships (e.g., collision avoidance) and spacing relationships between furniture and their functional zones to be respected. However, manually defined relationships are almost always incomplete and can produce unrealistic layouts. This work instead extracts spacing relationships automatically based on a hierarchical analysis and adopts the Delaunay Triangulation to produce important triple relationships. Compared to pairwise relationship modeling, triple relationships account for interactions and space utilization among multiple objects. To this end, we introduce RelTriple, a novel approach that enhances furniture distribution by learning spacing relationships between objects and regions. We formulate triple relationships as object-to-object (O2O) losses and object-to-region (O2R) losses and integrate them directly into the training process of generative diffusion. Our approach consistently improves over existing state-of-the-art methods in visual results evaluation metrics on unconditional layout generation, floorplan-conditioned layout generation, and scene rearrangement, achieving at least 12% on the introduced spatial relationship metric and superior spatial coherence and practical usability.
摘要：室内家具布局的产生在增强现实，智能家居和建筑设计中具有重要的应用。成功的家具安排需要适当的身体关系（例如，避免碰撞），并需要尊重家具及其功能区域之间的间隔关系。但是，手动定义的关系几乎总是不完整，并且可能产生不切实际的布局。相反，这项工作会根据层次分析自动提取间距关系，并采用Delaunay三角剖分来产生重要的三重关系。与成对关系建模相比，三重关系解释了多个对象之间的相互作用和空间利用率。为此，我们介绍了Reltriple，这是一种新颖的方法，可以通过学习对象与地区之间的间距关系来增强家具分布。我们将三重关系作为对象对象（O2O）损失和对象对区域（O2R）损失，并将它们直接集成到生成扩散的训练过程中。我们的方法始终改善有关无条件布局生成，平面图的布局生成和场景重排的视觉结果评估指标中现有的最新方法的改进，在引入的空间关系指标和优越的空间相干性和实际可用性方面至少实现了至少12％。

Title: Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

Authors: Yuhan Wang, Suzhi Bi, Ying-Jun Angela Zhang, Xiaojun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20297
Pdf URL: https://arxiv.org/pdf/2503.20297
Copy Paste: [[2503.20297]] Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model(https://arxiv.org/abs/2503.20297)
Keywords: restoration, generative
Abstract: The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.
摘要：失真感知（DP）的权衡表明，失真指标（例如MSE和PSNR）与知觉质量之间的基本冲突。最近的研究越来越集中于评估DP框架内的降级算法。但是，现有的算法要么通过牺牲可接受的失真来优先考虑感知质量，要么专注于最大程度地减少MSE以进行忠实的恢复。当目标移动或嘈杂的测量变化时，适应DP平面上的不同点需要重新设计甚至重新设计模型。受到使用基于得分的生成模型解决反问题的最新进展的启发，我们探索了使用单个基于预训练的基于得分的模型灵活，最佳地遍历DP权衡的潜力。具体而言，我们引入了方差尺度的反向扩散过程，理论上表征了边缘分布。然后，我们证明提出的样本过程是DP权衡的最佳解决方案，以进行条件高斯分布。对二维和图像数据集的实验结果表明，单个分数网络可以有效，灵活地穿越DP折衷，以解决一般的剥离问题。

Title: Wan: Open and Advanced Large-Scale Video Generative Models

Authors: WanTeam: Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20314
Pdf URL: https://arxiv.org/pdf/2503.20314
Copy Paste: [[2503.20314]] Wan: Open and Advanced Large-Scale Video Generative Models(https://arxiv.org/abs/2503.20314)
Keywords: generation, generative
Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at this https URL.
摘要：该报告介绍了Wan，这是一套全面的视频基础模型，旨在突破视频生成的界限。 WAN建立在主流扩散变压器范式的基础上，通过一系列创新，包括我们的新型VAE，可扩展的预训练策略，大规模数据策划和自动化评估指标，从而在生成能力方面取得了重大进步。这些贡献共同提高了模型的性能和多功能性。具体而言，WAN的特点是四个关键特征：领先性能：WAN的14B模型，在包含数十亿张图像和视频的庞大数据集中训练，展示了有关数据和模型大小的视频生成规律定律。它始终优于现有的开源模型以及多个内部和外部基准的最先进的商业解决方案，表明了明确而重要的性能优势。全面性：WAN提供了两个有能力的模型，即1.3b和14b参数，分别为效率和有效性。它还涵盖了多个下游应用程序，包括图像到视频，指导引导的视频编辑以及个人视频生成，最多涵盖了八项任务。消费级效率：1.3B模型表现出卓越的资源效率，仅需要8.19 GB VRAM，使其与广泛的消费级GPU兼容。开放性：我们开源整个WAN系列，包括源代码和所有模型，以促进视频生成社区的增长。这种开放旨在显着扩大行业视频制作的创造性可能性，并为学术界提供高质量的视频基础模型。所有代码和模型均可在此HTTPS URL上找到。

Title: Progressive Focused Transformer for Single Image Super-Resolution

Authors: Wei Long, Xingyu Zhou, Leheng Zhang, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20337
Pdf URL: https://arxiv.org/pdf/2503.20337
Copy Paste: [[2503.20337]] Progressive Focused Transformer for Single Image Super-Resolution(https://arxiv.org/abs/2503.20337)
Keywords: super-resolution
Abstract: Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.
摘要：基于变压器的方法在图像超分辨率任务中取得了显着的结果，因为它们可以捕获低质量输入图像中的非本地依赖性。但是，这种功能密集型建模方法在计算上是昂贵的，因为它计算出与查询权重时与查询功能无关的众多功能之间的相似性。这些不必要的相似性计算不仅降低了重建性能，还引入了重要的计算开销。如何准确识别对当前查询功能很重要的功能，并避免不相关的功能之间的相似性计算仍然是一个紧迫的问题。为了解决这个问题，我们提出了一种新颖而有效的渐进式变压器（PFT），该变压器（PFT）通过渐进的集中注意力（PFA）将网络中的所有孤立注意力图（PFA）联系起来，以将注意力集中在最重要的标记上。 PFA不仅使网络能够捕获更关键的相似功能，而且还可以通过在计算相似性之前滤除无关的功能来大大降低整体网络的计算成本。广泛的实验证明了该方法的有效性，在各种单个图像超分辨率基准上实现了最新性能。

Title: Consistency Trajectory Matching for One-Step Generative Super-Resolution

Authors: Weiyi You, Mingyang Zhang, Leheng Zhang, Kexuan Shi, Xingyu Zhou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20349
Pdf URL: https://arxiv.org/pdf/2503.20349
Copy Paste: [[2503.20349]] Consistency Trajectory Matching for One-Step Generative Super-Resolution(https://arxiv.org/abs/2503.20349)
Keywords: super-resolution, generative
Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.
摘要：基于当前的基于扩散的超分辨率（SR）方法以高推理开销为代价实现了值得称赞的绩效。因此，蒸馏技术被用来将多步教师模型加速成一步学生模型。然而，这些方法大大提高了培训成本，并通过教师模型限制了学生模型的表现。为了克服这些艰巨的挑战，我们提出了超分辨率（CTMSR）的一致性轨迹匹配，这是一种无蒸馏的策略，能够生成光现实的SR逐步产生一步。具体而言，我们首先制定了概率流量流量差分方程（PF-ode）轨迹，以建立从具有噪声到高分辨率（HR）图像的低分辨率（LR）图像的确定性映射。然后，我们应用一致性训练（CT）策略，以一步直接学习映射，从而消除了预先训练的扩散模型的必要性。为了进一步提高性能并更好地利用训练过程中的基础真相，我们旨在使SR结果的分布更加紧密地与自然图像的分布保持一致。为此，我们建议通过精心设计的分布轨迹匹配（DTM）损失来最大程度地减少其各自的PF-od轨迹之间的差异，从而改善了我们恢复的HR图像的现实主义。全面的实验结果表明，所提出的方法可以在合成和真实数据集上获得可比甚至优越的功能，同时保持最小的推理潜伏期。

Title: FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies

Authors: Tianqi He, Xiaohan Huang, Yi Du, Qingqing Long, Ziyue Qiao, Min Wu, Yanjie Fu, Yuanchun Zhou, Meng Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20394
Pdf URL: https://arxiv.org/pdf/2503.20394
Copy Paste: [[2503.20394]] FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies(https://arxiv.org/abs/2503.20394)
Keywords: generative
Abstract: Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced this http URL first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.
摘要：特征转换对于经典的机器学习至关重要，该机器学习旨在生成特征组合，以从以数据为中心的角度来增强下游任务的性能。当前的方法，例如手动专家驱动的过程，迭代反馈技术和探索基础策略，通过最大程度地减少人类参与来自动化此类数据工程工作流程。但是，这些框架中仍然存在三个挑战：（1）它主要取决于下游任务绩效指标，因为评估是耗时的，尤其是对于大型数据集。（2）在随机探索结束后，几乎不能保证特征组合的多样性。（3）罕见的重大转变导致稀疏的宝贵反馈，从而阻碍学习过程或导致效果较差的结果。为了应对这些挑战，我们引入了FastFT，这是一个创新的框架，它利用了三个高级的HTTP URL首先将功能转换评估从生成的数据集的结果中解脱出来。为了解决奖励稀疏性问题，我们开发了一种评估生成转换序列新颖性的方法。将这种新颖性纳入奖励功能可以加速模型对有效转换的探索，从而提高了搜索生产率。此外，我们结合了新颖性和性能，以创建优先的记忆缓冲区，以确保在探索过程中有效地重新审视基本经验。我们广泛的实验评估验证了我们提出的框架的性能，效率和可追溯性，展示了其在处理复杂特征转化任务方面的优势。

Title: Active Data Sampling and Generation for Bias Remediation

Authors: Antonio Maratea, Rita Perna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20414
Pdf URL: https://arxiv.org/pdf/2503.20414
Copy Paste: [[2503.20414]] Active Data Sampling and Generation for Bias Remediation(https://arxiv.org/abs/2503.20414)
Keywords: generation
Abstract: Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.
摘要：足够的采样空间覆盖范围是有效训练值得信赖的机器学习模型的基石。不幸的是，由于没有适当的随机抽样对参考人群而没有适当的随机抽样时，实际数据确实有几种固有的风险，而且大多数情况下，这太贵了或耗时，无法成为可行的选择。根据培训数据的收集方式，不受限制的偏见会导致有害或歧视性后果，最终阻碍了预训练的模型的大规模适用性，并破坏了他们的真实性或公平期望。在本文中，提出了一种混合的主动抽样和数据生成策略（称为采样），作为在预先训练的分类器进行微调时进行补偿的一种均值，假设训练数据来自非稳态抽样模式，则其产生的不公平分类。给定预先训练的分类器，首先在测试集上评估了公平度量，然后生成了新的标记数据库，最后生成了许多相反偏见的人工样本，以进行模型的微调。该方法使用AS案例研究进行视觉语义角色标记的深层模型，该方法已经能够完全治愈模拟的性别偏见，从90/10的不平衡开始，只有一小部分新数据，并且对准确性的影响很小。

Title: Latent Beam Diffusion Models for Decoding Image Sequences

Authors: Guilherme Fernandes, Vasco Ramos, Regev Cohen, Idan Szpektor, João Magalhães
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20429
Pdf URL: https://arxiv.org/pdf/2503.20429
Copy Paste: [[2503.20429]] Latent Beam Diffusion Models for Decoding Image Sequences(https://arxiv.org/abs/2503.20429)
Keywords: generation
Abstract: While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.
摘要：尽管扩散模型在从文本提示中生成高质量图像方面表现出色，但它们在图像序列中的视觉一致性挣扎。现有方法独立生成每个图像，导致叙事脱节 - 这一挑战进一步加剧了非线性讲故事，场景必须超出相邻框架。我们引入了一种新型的光束搜索策略，以进行潜在空间探索，从而有条件地生成了通过梁搜索解码的完整图像序列。与使用固定潜在先验的先验方法不同，我们的方法动态搜索了最佳的潜在表示序列，从而确保了相干的视觉过渡。为了解决Beam Search的二次复杂性，我们集成了一个交叉注意机制，该机制有效地分数搜索路径并实现修剪，并优先使用文本提示和视觉上下文对齐。人类评估证实，我们的方法表现优于基线方法，以优异的连贯性，视觉连续性和文本对齐方式产生完整序列。通过桥接搜索优化和潜在空间改进的进步，这项工作为结构化图像序列生成设定了新的标准。

Title: Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Authors: Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20483
Pdf URL: https://arxiv.org/pdf/2503.20483
Copy Paste: [[2503.20483]] Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability(https://arxiv.org/abs/2503.20483)
Keywords: generation
Abstract: Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.
摘要：扩散模型在综合多种内容方面表现出了令人印象深刻的功能。但是，尽管产出高品质，但这些模型经常会延续社会偏见，包括与性别和种族有关的偏见。这些偏见可能会导致有害现实世界的后果，加强刻板印象并加剧各种社会背景下的不平等现象。虽然现有的有关扩散偏差的研究主要集中在指导内容生成上，但它通常忽略了扩散模型中的内在机制，这些模型会导致有因果的产出。在本文中，我们研究了扩散模型的内部过程，确定了特定的决策机制，即嵌入模型体系结构中的偏差特征。通过直接操纵这些特征，我们的方法可以精确地分离出来，并调整负责产生偏差的元素，从而允许对生成内容中的偏差水平进行颗粒的控制。通过对各种社会偏见属性无条件和条件扩散模型的实验，我们证明了我们的方法在管理发电分布的同时保留图像质量的功效。我们还剖析了发现的模型机制，揭示了控制发电的细粒度方面的不同内在特征，从而进一步研究了扩散模型的机械性解释性。

Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Authors: Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20491
Pdf URL: https://arxiv.org/pdf/2503.20491
Copy Paste: [[2503.20491]] VPO: Aligning Text-to-Video Generation Models with Prompt Optimization(https://arxiv.org/abs/2503.20491)
Keywords: generation
Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at this https URL.
摘要：视频生成模型在文本到视频任务中取得了显着进度。这些模型通常是通过文本视频对培训的，具有高度详细且精心制作的描述，而推断期间的现实世界用户输入通常简洁，模糊或结构不佳。该差距迅速优化对于生成高质量的视频至关重要。当前的方法通常依靠大型语言模型（LLM）来通过内在的学习来完善提示，但要受到一些局限性：它们可能会扭曲用户意图，忽略关键细节或引入安全风险。此外，他们在不考虑对最终视频质量的影响的情况下优化提示，这可能会导致次优效果。为了解决这些问题，我们介绍了VPO，这是一个原则上的框架，可根据三个核心原则优化提示：无害性，准确性和乐于助人。生成的提示会忠实地保留用户意图，更重要的是，提高了生成的视频的安全性和质量。为了实现这一目标，VPO采用了两阶段的优化方法。首先，我们根据安全性和一致性原则来构建和完善监督的微调（SFT）数据集。其次，我们介绍文本级别和视频级别的反馈，以通过偏好学习进一步优化SFT模型。我们的广泛实验表明，与基线方法相比，VPO显着提高了安全性，对齐方式和视频质量。此外，VPO在视频生成模型中显示出强烈的概括。此外，我们证明VPO可以胜过表现，并与视频生成模型上的RLHF方法结合使用，从而强调了VPO在对齐视频生成模型中的有效性。我们的代码和数据在此HTTPS URL上公开可用。

Title: Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

Authors: Fanhu Zeng, Zhen Cheng, Fei Zhu, Xu-Yao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20492
Pdf URL: https://arxiv.org/pdf/2503.20492
Copy Paste: [[2503.20492]] Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models(https://arxiv.org/abs/2503.20492)
Keywords: generation
Abstract: Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.
摘要：分类器的可靠预测对于他们在高安全性和动态变化情况下的部署至关重要。然而，现代神经网络通常表现出过度自信的错误分类预测，强调了需要估计置信度以检测错误的需求。尽管在小规模数据集上通过现有方法获得的成就，但它们都需要从头开始培训，并且没有有效有效的错误分类检测（MISD）方法，因此阻碍了对大规模和不断变化的数据集的实际应用。在本文中，我们为利用视觉语言模型（VLM）利用文本信息建立有效且通用的错误分类检测框架的方式铺平了道路。通过利用VLM的力量，我们构建了FSMISD，这是Misd避免从头开始训练的几个迅速学习框架，从而提高了调整效率。为了增强错误分类检测能力，我们使用自适应伪样本生成和新颖的负损失来通过推动类别提示脱离伪特征来减轻过度自信的问题。我们通过迅速学习方法进行全面的实验，并验证各个数据集的概括能力。显着和一致的改进表明了我们方法的有效性，效率和普遍性。

Title: Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications

Authors: Mahya Nikouei, Bita Baroutian, Shahabedin Nabavi, Fateme Taraghi, Atefe Aghaei, Ayoob Sajedi, Mohsen Ebrahimi Moghaddam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20516
Pdf URL: https://arxiv.org/pdf/2503.20516
Copy Paste: [[2503.20516]] Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications(https://arxiv.org/abs/2503.20516)
Keywords: super-resolution, generation
Abstract: Small object detection (SOD) is a critical yet challenging task in computer vision, with applications like spanning surveillance, autonomous systems, medical imaging, and remote sensing. Unlike larger objects, small objects contain limited spatial and contextual information, making accurate detection difficult. Challenges such as low resolution, occlusion, background interference, and class imbalance further complicate the problem. This survey provides a comprehensive review of recent advancements in SOD using deep learning, focusing on articles published in Q1 journals during 2024-2025. We analyzed challenges, state-of-the-art techniques, datasets, evaluation metrics, and real-world applications. Recent advancements in deep learning have introduced innovative solutions, including multi-scale feature extraction, Super-Resolution (SR) techniques, attention mechanisms, and transformer-based architectures. Additionally, improvements in data augmentation, synthetic data generation, and transfer learning have addressed data scarcity and domain adaptation issues. Furthermore, emerging trends such as lightweight neural networks, knowledge distillation (KD), and self-supervised learning offer promising directions for improving detection efficiency, particularly in resource-constrained environments like Unmanned Aerial Vehicles (UAV)-based surveillance and edge computing. We also review widely used datasets, along with standard evaluation metrics such as mean Average Precision (mAP) and size-specific AP scores. The survey highlights real-world applications, including traffic monitoring, maritime surveillance, industrial defect detection, and precision agriculture. Finally, we discuss open research challenges and future directions, emphasizing the need for robust domain adaptation techniques, better feature fusion strategies, and real-time performance optimization.
摘要：在计算机视觉中，小物体检测（SOD）是一项至关重要但挑战性的任务，诸如跨越监视，自主系统，医学成像和遥感之类的应用程序。与较大的物体不同，小物体包含有限的空间和上下文信息，因此难以准确检测。低分辨率，阻塞，背景干扰和阶级失衡等挑战使问题进一步复杂化。这项调查对使用深度学习的最新进展进行了全面的综述，重点介绍了2024 - 2025年Q1期刊上发表的文章。我们分析了挑战，最先进的技术，数据集，评估指标和现实世界应用程序。深度学习的最新进展引入了创新的解决方案，包括多尺度特征提取，超分辨率（SR）技术，注意机制和基于变压器的体系结构。此外，数据增强，综合数据生成和转移学习的改进已经解决了数据稀缺和域的适应问题。此外，新兴趋势，例如轻型神经网络，知识蒸馏（KD）和自我监管的学习提供了有希望的方向，以提高检测效率，尤其是在资源受限的环境中，如无人驾驶汽车（UAV）的监视和边缘计算。我们还审查了广泛使用的数据集以及标准评估指标，例如平均平均精度（MAP）和特定于尺寸的AP分数。该调查重点介绍了现实世界中的应用，包括交通监测，海上监视，工业缺陷检测和精密农业。最后，我们讨论了开放的研究挑战和未来的方向，强调了对强大的领域适应技术的需求，更好的功能融合策略和实时性能优化。

Title: MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Authors: Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20519
Pdf URL: https://arxiv.org/pdf/2503.20519
Copy Paste: [[2503.20519]] MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation(https://arxiv.org/abs/2503.20519)
Keywords: generation, generative
Abstract: Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).
摘要：自动回归变压器的最新进展已彻底改变了从语言处理到视觉产生的不同领域的生成建模，表现出了显着的功能。但是，将这些进步应用于3D生成提出了三个关键挑战：3D数据冲突的无序性质与顺序的下一步预测范式相互冲突，传统的向量量化方法在应用于3D网格时会导致实质性压缩损失，以及缺乏有效的缩放策略来实现较高的分辨率延展性预测。为了应对这些挑战，我们介绍了MAR-3D，该MAR-3D将金字塔变异自动编码器与级联的蒙版自动回归变压器（级联的MAR）集成在一起，以在连续空间中进行潜在的潜在升级。我们的体系结构在训练过程中采用随机掩盖，并在推理过程中按随机顺序进行自动降解，自然地容纳了3D潜伏令牌的无序属性。此外，我们提出了一种级联的培训策略，并具有增强条件的培训策略，以便通过快速收敛有效地提高潜在令牌分辨率。广泛的实验表明，与现有方法相比，MAR-3D不仅具有出色的性能和概括能力，而且与关节分布建模方法相比（例如，扩散变压器）相比具有增强的缩放功能。

Title: GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Authors: Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, Gianluca Corrado
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20523
Pdf URL: https://arxiv.org/pdf/2503.20523
Copy Paste: [[2503.20523]] GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving(https://arxiv.org/abs/2503.20523)
Keywords: generation, generative
Abstract: Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at this https URL.
摘要：生成模型为模拟复杂环境提供了可扩展且灵活的范式，但是当前方法在满足自动驾驶的特定领域需求时（例如多代理相互作用，细粒度控制和多相机的一致性）却缺乏。我们引入了GAIA-2，即自治的生成AI，这是一种潜在的扩散世界模型，将这些功能统一在单个生成框架中。 GAIA-2支持以丰富的结构化输入为条件的可控视频生成：自我车辆动力学，代理配置，环境因素和道路语义。它在地理上多样化的驾驶环境（英国，美国，德国）中产生了高分辨率，时空一致的多相机视频。该模型同时集成了结构化条件和外部潜在嵌入（例如，从专有驾驶模型），以促进柔性和语义接地的场景合成。通过这种集成，GAIA-2可以对常见和罕见驾驶场景进行可扩展的模拟，从而推进了生成世界模型作为自治系统开发的核心工具。视频可在此HTTPS URL上找到。

Title: TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration

Authors: Ziying Zhang, Xiang Gao, Zhixin Wang, Qiang hu, Xiaoyun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20537
Pdf URL: https://arxiv.org/pdf/2503.20537
Copy Paste: [[2503.20537]] TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration(https://arxiv.org/abs/2503.20537)
Keywords: restoration, generation, generative
Abstract: Diffusion-based methodologies have shown significant potential in blind face restoration (BFR), leveraging their robust generative capabilities. However, they are often criticized for two significant problems: 1) slow training and inference speed, and 2) inadequate recovery of fine-grained facial details. To address these problems, we propose a novel Truncated Diffusion model for efficient Blind Face Restoration (TD-BFR), a three-stage paradigm tailored for the progressive resolution of degraded images. Specifically, TD-BFR utilizes an innovative truncated sampling method, starting from low-quality (LQ) images at low resolution to enhance sampling speed, and then introduces an adaptive degradation removal module to handle unknown degradations and connect the generation processes across different resolutions. Additionally, we further adapt the priors of pre-trained diffusion models to recover rich facial details. Our method efficiently restores high-quality images in a coarse-to-fine manner and experimental results demonstrate that TD-BFR is, on average, \textbf{4.75$\times$} faster than current state-of-the-art diffusion-based BFR methods while maintaining competitive quality.
摘要：基于扩散的方法论表现出在盲人恢复（BFR）中的巨大潜力，利用了其强大的生成能力。但是，他们经常因两个重大问题而受到批评：1）缓慢的训练和推理速度，以及2）恢复精细的面部细节不足。为了解决这些问题，我们提出了一个新型的截断扩散模型，以实现有效的盲人面部恢复（TD-BFR），这是一种量身定制的三阶段范式，用于逐步解决降级图像。具体而言，TD-BFR利用了一种创新的截短采样方法，从低分辨率的低质量（LQ）图像开始，以提高采样速度，然后引入一个自适应退化拆卸模块，以处理未知的降解，然后连接不同分辨率的生成过程。此外，我们进一步调整了预训练的扩散模型的先验，以恢复丰富的面部细节。我们的方法以粗略的方式有效地恢复了高质量的图像，实验结果表明，TD-BFR平均\ textbf {4.75 $ \ times $}比当前基于先进的基于原始扩散的BFR方法快，同时保持竞争性质量。

Title: Diffusion Counterfactuals for Image Regressors

Authors: Trung Duc Ha, Sidney Bender
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20595
Pdf URL: https://arxiv.org/pdf/2503.20595
Copy Paste: [[2503.20595]] Diffusion Counterfactuals for Image Regressors(https://arxiv.org/abs/2503.20595)
Keywords: generative
Abstract: Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.
摘要：反事实解释已成功地用于为各种黑盒模型创建人类可解释的解释。它们适合图像域中的任务，其中解释的质量受益于生成模型的最新进展。尽管反事实解释已被广泛应用于分类模型，但它们在回归任务中的应用仍未得到充满反感。我们提出了两种方法，以使用基于扩散的生成模型来为图像回归任务创建反事实解释，以解决稀疏性和质量中的挑战：1）基于一种基于Deo的扩散概率模型，该模型直接在像素空间中运行，而2）另一个基于潜在的扩散自动配置器在潜在的空间中运行的扩散自动配置器。两者都在Celeba-HQ和合成数据集上产生现实，语义和平滑的反事实，并为回归模型的决策过程提供了易于解释的见解，并揭示了虚假的相关性。我们发现，对于回归反事实，特征的变化取决于预测值的区域。对于预测值的重大变化，需要进行大规模的语义变化，这使得与分类器相比，很难找到稀疏的反事实。此外，像素空间反事实更稀疏，而潜在空间反事实则具有更高的质量，并且可以更大的语义变化。

Title: MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Authors: Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20644
Pdf URL: https://arxiv.org/pdf/2503.20644
Copy Paste: [[2503.20644]] MMGen: Unified Multi-modal Image Generation and Understanding in One Go(https://arxiv.org/abs/2503.20644)
Keywords: generation, generative
Abstract: A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.
摘要：多模式生成和理解的统一扩散框架具有实现无缝和可控图像扩散和其他跨模式任务的变革潜力。在本文中，我们引入了MMGEN，这是一个将多个生成任务集成到单个扩散模型中的统一框架。这包括：（1）多模式类别条件生成，其中多模式输出是通过单个推理过程同时生成的，给定的类别信息；（2）多模式的视觉理解，可以准确预测RGB图像的深度，表面正态和分割图；（3）多模式条件的生成，该生成基于特定的模态条件和其他对齐方式产生相应的RGB图像。我们的方法开发了一种新颖的扩散变压器，该变压器灵活地支持多模式输出，以及一个简单的模态解码策略，以统一各种任务。广泛的实验和应用表明，MMGER在各种任务和条件下的有效性和优势，强调了其需要同时产生和理解的应用潜力。

Title: Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

Authors: Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20652
Pdf URL: https://arxiv.org/pdf/2503.20652
Copy Paste: [[2503.20652]] Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification(https://arxiv.org/abs/2503.20652)
Keywords: generation
Abstract: The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.
摘要：计算机断层扫描（CT）扫描检查数量的迅速增加，迫切需要自动化工具，例如器官分割，异常分类和报告生成，以帮助放射学家的工作量不断增长。三维（3D）CT扫描的多标签分类是一项具有挑战性的任务，因为数据的体积性质和要检测到的异常情况。基于卷积神经网络（CNN）的现有深度学习方法难以有效地捕获长期依赖性，而视觉变形金刚则需要广泛的预训练，对实际使用构成挑战。此外，这些现有方法在滚动CT扫描切片时，并未明确地对放射科医生的导航行为进行建模，这既需要全球上下文理解和本地细节意识。在这项研究中，我们提出了CT-Scroll，这是一种新型的全球环境注意模型，专门旨在模仿放射学家在3D CT扫描分析过程中的滚动行为。我们的方法在两个公共数据集上进行了评估，并通过全面的实验和消融研究表明了其功效，该研究突出了每个模型组件的贡献。

Title: AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports

Authors: Xiangwen Zhang, Qian Zhang, Longfei Han, Qiang Qu, Xiaoming Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20654
Pdf URL: https://arxiv.org/pdf/2503.20654
Copy Paste: [[2503.20654]] AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports(https://arxiv.org/abs/2503.20654)
Keywords: generation
Abstract: Collecting real-world vehicle accident videos for autonomous driving research is challenging due to their rarity and complexity. While existing driving video generation methods may produce visually realistic videos, they often fail to deliver physically realistic simulations because they lack the capability to generate accurate post-collision trajectories. In this paper, we introduce AccidentSim, a novel framework that generates physically realistic vehicle collision videos by extracting and utilizing the physical clues and contextual information available in real-world vehicle accident reports. Specifically, AccidentSim leverages a reliable physical simulator to replicate post-collision vehicle trajectories from the physical and contextual information in the accident reports and to build a vehicle collision trajectory dataset. This dataset is then used to fine-tune a language model, enabling it to respond to user prompts and predict physically consistent post-collision trajectories across various driving scenarios based on user descriptions. Finally, we employ Neural Radiance Fields (NeRF) to render high-quality backgrounds, merging them with the foreground vehicles that exhibit physically realistic trajectories to generate vehicle collision videos. Experimental results demonstrate that the videos produced by AccidentSim excel in both visual and physical authenticity.
摘要：由于其稀有性和复杂性，收集现实世界中的车祸视频进行自主驾驶研究是具有挑战性的。尽管现有的驾驶视频生成方法可能会产生视觉上现实的视频，但它们通常无法提供物理逼真的模拟，因为它们缺乏产生准确的碰撞后轨迹的能力。在本文中，我们介绍了事故，这是一个新颖的框架，该框架通过提取和利用现实世界中车辆事故报告中可用的物理线索和上下文信息来生成物理逼真的车辆碰撞视频。具体而言，事故限制在事故报告中的物理和上下文信息中利用可靠的物理模拟器来复制碰撞后车辆轨迹，并构建车辆碰撞轨迹数据集。然后，该数据集用于微调语言模型，使其能够响应用户提示，并根据用户描述在各种驾驶场景中预测物理一致的碰撞后轨迹。最后，我们采用神经辐射场（NERF）来渲染高质量的背景，并将它们与前景车辆合并，这些车辆表现出物理逼真的轨迹以产生车辆碰撞视频。实验结果表明，事故中excel产生的视频在视觉和物理真实性中均具有出色的作用。

Title: ARMO: Autoregressive Rigging for Multi-Category Objects

Authors: Mingze Sun, Shiwei Mao, Keyi Chen, Yurun Chen, Shunlin Lu, Jingbo Wang, Junting Dong, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20663
Pdf URL: https://arxiv.org/pdf/2503.20663
Copy Paste: [[2503.20663]] ARMO: Autoregressive Rigging for Multi-Category Objects(https://arxiv.org/abs/2503.20663)
Keywords: generation, generative
Abstract: Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made public for academic use upon acceptance.
摘要：大规模生成模型的最新进展显着提高了3D形状生成的质量和多样性。但是，大多数现有的方法主要侧重于产生静态3D模型，忽视某些形状的潜在动态性质，例如人形，动物和昆虫。为了解决这一差距，我们专注于索具，这是动画中的一项基本任务，它为3D模型建立了骨骼结构和皮肤。在本文中，我们介绍了Omnirig，这是第一个大型索具数据集，其中包括79,499个网眼，带有详细的骨骼和皮肤信息。与依赖于预定义标准姿势（例如A-Pose，T-Pose）的传统基准分析不同，我们的数据集包含各种形状类别，样式和姿势。利用这个丰富的数据集，我们提出了Armo，这是一个新型的操纵框架，它利用自回归模型以统一的方式预测关节位置和连接关系。通过将骨骼结构视为完整的图形并将其离散到令牌中，我们使用自动编码器对关节进行编码，以获得潜在的嵌入和自动回忆模型以预测令牌。网格条件的潜扩散模型用于预测条件骨架产生的潜在嵌入。我们的方法解决了基于回归的方法的局限性，这些方法通常会遭受误差积累和次优连接估计的影响。通过对Omnirig数据集的广泛实验，我们的方法在骨骼预测中实现了最新的性能，证明了各种对象类别的概括提高了。该代码和数据集将在接受后公开供学术使用。

Title: BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Authors: Yuyang Peng, Shishi Xiao, Keming Wu, Qisheng Liao, Bohan Chen, Kevin Lin, Danqing Huang, Ji Li, Yuhui Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20672
Pdf URL: https://arxiv.org/pdf/2503.20672
Copy Paste: [[2503.20672]] BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation(https://arxiv.org/abs/2503.20672)
Keywords: generation
Abstract: Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made significant progress in sentence-level visual text rendering. In this paper, we focus on the more challenging scenarios of article-level visual text rendering and address a novel task of generating high-quality business content, including infographics and slides, based on user provided article-level descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly longer context lengths and the scarcity of high-quality business content data. In contrast to most previous works that focus on a limited number of sub-regions and sentence-level prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in business content is far more challenging. We make two key technical contributions: (i) the construction of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions flexibly during inference using a layout conditional CFG. We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage the broader community to advance the progress of business content generation.
摘要：最近，最先进的文本到图像生成模型，例如Flux和Isex 2.0，在句子级的视觉文本渲染方面取得了重大进展。在本文中，我们关注文章级视觉文本渲染的更具挑战性的场景，并根据用户提供的文章级描述提示和超密集的布局来解决生成高质量业务内容的新任务，包括信息图表和幻灯片。基本挑战是双重的：明显更长的上下文长度和高质量业务内容数据的稀缺性。与大多数以前的作品相反，该作品着重于有限的子区域和句子级别的提示，从而确保精确地遵守业务内容中具有数十个甚至数百个子区域的超密集布局，这更具挑战性。我们做出了两个关键的技术贡献：（i）构建可扩展的高质量业务内容数据集，即Infopraphics-650k，配备了超密集的布局和提示，并通过实施层面的检索图表生成方案；（ii）一个布局引导的交叉注意方案，该方案根据超密集的布局注入数十个区域的促进区域潜在空间，并在推理过程中使用有条件的CFG在推理过程中灵活地改进每个子区域。与以前的SOTA系统（例如Flux和SD3）相比，我们证明了系统的强劲结果。此外，我们进行了彻底的消融实验，以验证每个组件的有效性。我们希望我们的构建的信息图表-650k和Bizeval能够鼓励更广泛的社区推进业务内容产生的进步。

Title: Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

Authors: Yinan Sun, Xiongkuo Min, Zicheng Zhang, Yixuan Gao, Yuqin Cao, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20673
Pdf URL: https://arxiv.org/pdf/2503.20673
Copy Paste: [[2503.20673]] Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy(https://arxiv.org/abs/2503.20673)
Keywords: quality assessment
Abstract: The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.
摘要：多模式大语言模型的快速发展导致视觉感知和理解方面的显着进步，将多个任务巩固为单个视觉问题缠绕框架。但是，这些模型容易幻觉，这限制了它们作为人工智能系统的可靠性。尽管此问题在自然语言处理和图像字幕上进行了广泛的研究，但仍缺乏对低级视觉感知和理解（HLPU）幻觉的研究（HLPU），尤其是在图像质量评估任务的背景下。我们认为这些幻觉是由于模型中没有明显的自我意识而产生的。为了解决此问题，我们首先介绍HLPU指令数据库，这是第一个指令数据库，专门针对低级视觉任务中的幻觉。该数据库包含大约200k的问答对，并包含四个子集，每个子集涵盖了不同类型的指令。随后，我们提出了消除自我意识故障（SAFEQA）模型，该模型利用图像功能，显着区域特征和质量特征来提高低级视觉任务中模型的感知和理解能力。此外，我们提出了增强的自我意识偏好优化（ESA-PO）框架，以提高模型对知识边界的认识，从而减轻幻觉的发生率。最后，我们对低级视力任务进行了全面的实验，结果表明，我们提出的方法显着增强了这些任务中模型的自我意识并减少了幻觉。值得注意的是，我们提出的方法在各种评估指标方面提高了所提出的模型的准确性和自我意识，并优于封闭式模型。

Title: GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection

Authors: Xingyu Peng, Si Liu, Chen Gao, Yan Bai, Beipeng Mu, Xiaofei Wang, Huaxia Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20682
Pdf URL: https://arxiv.org/pdf/2503.20682
Copy Paste: [[2503.20682]] GLRD: Global-Local Collaborative Reason and Debate with PSL for 3D Open-Vocabulary Detection(https://arxiv.org/abs/2503.20682)
Keywords: generation
Abstract: The task of LiDAR-based 3D Open-Vocabulary Detection (3D OVD) requires the detector to learn to detect novel objects from point clouds without off-the-shelf training labels. Previous methods focus on the learning of object-level representations and ignore the scene-level information, thus it is hard to distinguish objects with similar classes. In this work, we propose a Global-Local Collaborative Reason and Debate with PSL (GLRD) framework for the 3D OVD task, considering both local object-level information and global scene-level information. Specifically, LLM is utilized to perform common sense reasoning based on object-level and scene-level information, where the detection result is refined accordingly. To further boost the LLM's ability of precise decisions, we also design a probabilistic soft logic solver (OV-PSL) to search for the optimal solution, and a debate scheme to confirm the class of confusable objects. In addition, to alleviate the uneven distribution of classes, a static balance scheme (SBC) and a dynamic balance scheme (DBC) are designed. In addition, to reduce the influence of noise in data and training, we further propose Reflected Pseudo Labels Generation (RPLG) and Background-Aware Object Localization (BAOL). Extensive experiments conducted on ScanNet and SUN RGB-D demonstrate the superiority of GLRD, where absolute improvements in mean average precision are $+2.82\%$ on SUN RGB-D and $+3.72\%$ on ScanNet in the partial open-vocabulary setting. In the full open-vocabulary setting, the absolute improvements in mean average precision are $+4.03\%$ on ScanNet and $+14.11\%$ on SUN RGB-D.
摘要：基于激光雷达的3D开放式摄影检测（3D OVD）的任务要求检测器学会从没有现成的训练标签的点云中检测新对象。先前的方法着眼于学习对象级表示形式并忽略场景级信息，因此很难区分具有相似类别的对象。在这项工作中，我们考虑了3D OVD任务的全球本地协作原因，并与PSL（GLRD）框架进行了辩论，考虑了本地对象级信息和全球场景级信息。具体而言，LLM用于基于对象级别和场景级信息执行常识推理，其中检测结果得到了相应的改进。为了进一步提高LLM的精确决策能力，我们还设计了一个概率软逻辑求解器（OV-PSL）来搜索最佳解决方案，并设计了一个辩论方案，以确认可混淆的对象类别。此外，为了减轻类别的分布，设计了静态平衡方案（SBC）和动态平衡方案（DBC）。此外，为了减少噪声在数据和训练中的影响，我们进一步提出了反映的伪标签生成（RPLG）和背景感知对象定位（BAOL）。在扫描厂和Sun RGB-D上进行的广泛实验证明了GLRD的优势，在Sun RGB-D上，平均平均精度的绝对改进为$+2.82 \％$，在部分开放式vocabulary设置中，扫描仪上的扫描仪上的$+3.72 \％$。在完整的开放式视频计环境中，扫描仪的平均平均精度的绝对改进为$+4.03 \％$，Sun RGB-D的$++14.11 \％$ $。

Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Authors: Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20685
Pdf URL: https://arxiv.org/pdf/2503.20685
Copy Paste: [[2503.20685]] Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound(https://arxiv.org/abs/2503.20685)
Keywords: generation
Abstract: Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.
摘要：在2D乳房超声（BUS）和3D自动乳房超声（ABUS）中，对结节的准确分割对于临床诊断和治疗计划至关重要。因此，开发用于结节分割的自动化系统可以增强用户独立性并加快临床分析。与完全监督的学习不同，弱监督的分割（WSS）可以简化辛苦而复杂的注释过程。但是，当前的WSS方法在实现精确的结节分割方面面临挑战，因为其中许多依赖于不准确的激活图或效率低下的伪掩膜生成算法。在这项研究中，我们介绍了一种新型的基于多代理学习的WSS框架，称为Flip Learne，该框架仅依赖于2D/3D框以进行准确的分割。具体而言，采用多种代理从框中擦除目标以促进分类标签翻转，而基于擦除的区域则用作预测的分割面罩。这项研究的主要贡献如下：（1）采用基于超级像素/超级氧化的方法来编码标准化环境，捕获边界先验并加快学习过程。（2）引入三个精心设计的奖励，包括分类得分奖励和两个强度分布奖励，以精确地引导代理的擦除过程，从而避免了不足和过度细分。（3）实施渐进的课程学习策略，以使代理商能够以逐渐具有挑战性的方式与环境互动，从而提高学习效率。在大型内部巴士和ABU数据集上，我们的翻转学习方法在大型内部公共汽车和ABUS数据集上进行了广泛的验证，优于最先进的WSS方法和基础模型，并且可以实现与完全监督的学习算法相当的性能。

Title: Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization

Authors: Yankai Chen, Taotao Wang, Yixiang Fang, Yunyu Xiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20697
Pdf URL: https://arxiv.org/pdf/2503.20697
Copy Paste: [[2503.20697]] Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization(https://arxiv.org/abs/2503.20697)
Keywords: generation
Abstract: Node importance estimation, a classical problem in network analysis, underpins various web applications. Previous methods either exploit intrinsic topological characteristics, e.g., graph centrality, or leverage additional information, e.g., data heterogeneity, for node feature enhancement. However, these methods follow the supervised learning setting, overlooking the fact that ground-truth node-importance data are usually partially labeled in practice. In this work, we propose the first semi-supervised node importance estimation framework, i.e., EASING, to improve learning quality for unlabeled data in heterogeneous graphs. Different from previous approaches, EASING explicitly captures uncertainty to reflect the confidence of model predictions. To jointly estimate the importance values and uncertainties, EASING incorporates DJE, a deep encoder-decoder neural architecture. DJE introduces distribution modeling for graph nodes, where the distribution representations derive both importance and uncertainty estimates. Additionally, DJE facilitates effective pseudo-label generation for the unlabeled data to enrich the training samples. Based on labeled and pseudo-labeled data, EASING develops effective semi-supervised heteroscedastic learning with varying node uncertainty regularization. Extensive experiments on three real-world datasets highlight the superior performance of EASING compared to competing methods. Codes are available via this https URL.
摘要：节点重要性估计是网络分析中的一个经典问题，是各种Web应用程序的基础。以前的方法可以利用固有的拓扑特性，例如图形中心性，或利用其他信息，例如数据异质性，以增强节点特征。但是，这些方法遵循监督的学习设置，忽略了一个事实，即通常在实践中会部分标记地面图形效率数据。在这项工作中，我们提出了第一个半监督节点重要性估计框架，即放松，以提高异质图中未标记数据的学习质量。与以前的方法不同，放松明确捕获不确定性，以反映模型预测的置信度。为了共同估计重要性和不确定性，放松融合了DJE，这是一个深层编码器神经架构。 DJE引入了图节点的分布建模，其中分布表示既得出重要性估计，又得出了不确定性的估计。此外，DJE促进了未标记数据的有效伪标记生成，以丰富训练样本。基于标记和伪标记的数据，通过不同的节点不确定性正则化开发有效的半监督异质学习。与竞争方法相比，在三个现实世界数据集上进行了广泛的实验突出了宽松的出色表现。可以通过此HTTPS URL获得代码。

Title: Learning Straight Flows by Learning Curved Interpolants

Authors: Shiv Shankar, Tomas Geffner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20719
Pdf URL: https://arxiv.org/pdf/2503.20719
Copy Paste: [[2503.20719]] Learning Straight Flows by Learning Curved Interpolants(https://arxiv.org/abs/2503.20719)
Keywords: generation, generative
Abstract: Flow matching models typically use linear interpolants to define the forward/noise addition process. This, together with the independent coupling between noise and target distributions, yields a vector field which is often non-straight. Such curved fields lead to a slow inference/generation process. In this work, we propose to learn flexible (potentially curved) interpolants in order to learn straight vector fields to enable faster generation. We formulate this via a multi-level optimization problem and propose an efficient approximate procedure to solve it. Our framework provides an end-to-end and simulation-free optimization procedure, which can be leveraged to learn straight line generative trajectories.
摘要：流匹配模型通常使用线性插值来定义正向/噪声添加过程。这与噪声和目标分布之间的独立耦合一起产生了通常不是连续的矢量场。这种弯曲场导致了缓慢的推理/生成过程。在这项工作中，我们建议学习灵活的（潜在弯曲的）插值，以学习直接矢量场以实现更快的生成。我们通过多级优化问题对此进行了制定，并提出了一个有效的近似程序来解决它。我们的框架提供了一个端到端和无模拟的优化过程，可以利用它来学习直线生成轨迹。

Title: RecTable: Fast Modeling Tabular Data with Rectified Flow

Authors: Masane Fuchi, Tomohiro Takagi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20731
Pdf URL: https://arxiv.org/pdf/2503.20731
Copy Paste: [[2503.20731]] RecTable: Fast Modeling Tabular Data with Rectified Flow(https://arxiv.org/abs/2503.20731)
Keywords: generation
Abstract: Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at this https URL.
摘要：基于分数或扩散模型会生成高质量的表格数据，超过基于GAN的模型和基于VAE的模型。但是，这些方法需要大量的培训时间。在本文中，我们介绍了使用矫正的矫正，该校正使用了在文本到图像生成和文本对视频生成等中应用的整流流建模。可矫正功能具有简单的体系结构，该体系结构由一些堆叠的封闭式线性单元块组成。此外，我们的培训策略也很简单，结合了混合型噪声分布和logit-orpormal时间段分布。我们的实验表明，与几种最新的扩散和基于得分的模型相比，可矫正的竞争性能在减少所需的训练时间的同时。我们的代码可在此HTTPS URL上找到。

Title: High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching

Authors: Guoqiang Zhang, Kenta Niwa, J.P. Lewis, Cedric Mesnage, W. Bastiaan Kleijn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20744
Pdf URL: https://arxiv.org/pdf/2503.20744
Copy Paste: [[2503.20744]] High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching(https://arxiv.org/abs/2503.20744)
Keywords: generation
Abstract: We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.
摘要：我们介绍了相对和绝对位置匹配（RAPM），这是一种扩散蒸馏方法，导致高质量生成，可以在单个GPU上有效地训练。最近的扩散蒸馏研究已通过诸如分阶段一致性模型（PCM）和改进的分布匹配蒸馏（DMD2）等方法为高分辨率的文本对图像生成取得了良好的结果（DMD2）。但是，这些方法通常需要许多GPU（例如〜8-64）和在培训期间进行大量批次（例如〜128-2048），从而导致记忆和计算要求超出了一些研究人员的资源。 RAPM提供有效的单GPU扩散蒸馏训练，其批次化为1。新方法尝试通过匹配相对和绝对位置来模仿教师模型的采样轨迹。相对位置的设计灵感来自PCM。相应地在RAPM中引入了两个歧视因子，一个用于匹配的相对位置，另一个用于绝对位置。关于稳定率（SD）v1.5和SDXL的实验结果表明，在非常有限的计算资源下，具有4个时间段的RAPM作为最佳方法可与1个时间段的最佳方法产生可比的FID分数。

Title: MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study

Authors: Shakiba Rahimiaghdam, Hande Alemdar
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.20758
Pdf URL: https://arxiv.org/pdf/2503.20758
Copy Paste: [[2503.20758]] MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study(https://arxiv.org/abs/2503.20758)
Keywords: generation
Abstract: Ensuring transparency in machine learning decisions is critically important, especially in sensitive sectors such as healthcare, finance, and justice. Despite this, some popular explainable algorithms, such as Local Interpretable Model-agnostic Explanations (LIME), often produce unstable explanations due to the random generation of perturbed samples. Random perturbation introduces small changes or noise to modified instances of the original data, leading to inconsistent explanations. Even slight variations in the generated samples significantly affect the explanations provided by such models, undermining trust and hindering the adoption of interpretable models. To address this challenge, we propose MindfulLIME, a novel algorithm that intelligently generates purposive samples using a graph-based pruning algorithm and uncertainty sampling. MindfulLIME substantially improves the consistency of visual explanations compared to random sampling approaches. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME's stability with a 100% success rate in delivering reliable explanations under identical conditions. Additionally, MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations compared to LIME. We also performed comprehensive experiments considering various segmentation algorithms and sample numbers, focusing on stability, quality, and efficiency. The results demonstrate the outstanding performance of MindfulLIME across different segmentation settings, generating fewer high-quality samples within a reasonable processing time. By addressing the stability limitations of LIME in image data, MindfulLIME enhances the trustworthiness and interpretability of machine learning models in specific medical imaging applications, a critical domain.
摘要：确保机器学习决策的透明度至关重要，尤其是在医疗保健，金融和正义等敏感部门中。尽管如此，一些流行的可解释算法，例如局部可解释的模型不足的解释（Lime），由于随机产生的扰动样品而经常产生不稳定的解释。随机扰动将原始数据的修改实例引入了微小的变化或噪音，从而导致解释不一致。即使是生成样品的轻微变化也会显着影响此类模型提供的解释，从而破坏了信任并阻碍了可解释的模型的采用。为了应对这一挑战，我们提出了一种新颖的算法，该算法智能地使用基于图的修剪算法和不确定性采样来生成目的样品。与随机抽样方法相比，正念素很大程度上提高了视觉解释的一致性。我们在公认的胸部X射线数据集上进行的实验评估证实了Mindfullime的稳定性，在相同条件下提供可靠的解释方面的成功率100％。此外，与石灰相比，正式粉丝通过减少生成的解释与实际局部注释之间的距离来提高视觉解释的本地化精度。我们还考虑了各种分割算法和样本数量的全面实验，重点是稳定性，质量和效率。结果表明，在不同的分段设置中，正式浮标的出色性能出色，在合理的处理时间内产生了更少的高质量样本。通过解决图像数据中石灰的稳定性局限性，MIDFULLIME增强了在特定的医学成像应用程序（一个关键领域）中机器学习模型的可信度和解释性。

Title: Reliable algorithm selection for machine learning-guided design

Authors: Clara Fannjiang, Ji Won Park
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20767
Pdf URL: https://arxiv.org/pdf/2503.20767
Copy Paste: [[2503.20767]] Reliable algorithm selection for machine learning-guided design(https://arxiv.org/abs/2503.20767)
Keywords: generative
Abstract: Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion -- for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.
摘要：用于机器学习引导设计或设计算法的算法使用基于机器学习的预测来提出具有所需属性值的新对象。考虑到一项新的设计任务 - 例如，要设计具有高结合亲和力的新型蛋白质，必须选择一种设计算法并指定任何涉及的任何超参数以及预测性和/或生成模型。如何做出这些决定使所产生的设计成功？本文提出了一种设计算法选择的方法，该方法旨在选择设计算法，该算法将产生满足用户指定的成功标准的设计标签的分布（例如，设计标签的至少十％的标签都超过阈值。这样做是通过将设计的预测属性值与固定标记的数据相结合，以可靠地预测由不同设计算法产生的标签分布的特性，这是基于预测供电推断的技术。该方法可以保证，如果已知的设计和标记的数据分布之间的密度比，则可以返回设计算法，以返回设计算法，该算法获得成功的标签分布（或不存在的零集）。我们在具有已知或估计密度比的设置中证明了该方法在模拟蛋白质和RNA设计任务中的有效性。

Title: Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

Authors: Masoumeh Sharafi, Emma Ollivier, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, EricGranger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20771
Pdf URL: https://arxiv.org/pdf/2503.20771
Copy Paste: [[2503.20771]] Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data(https://arxiv.org/abs/2503.20771)
Keywords: generation
Abstract: Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health monitoring (e.g., pain, depression, fatigue, and stress). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable variability of expressions among subjects. Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy and storage issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (displaying only neutral expressions) for target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression.
摘要：视频的面部表达识别（FER）是各种应用领域的至关重要的任务，例如人类计算机的互动和健康监测（例如疼痛，抑郁，疲劳和压力）。除了认识到微妙的情绪或健康状态的挑战之外，深度FER模型的有效性通常受到受试者之间表达式的显着差异的阻碍。使用无源域的适应（SFDA）方法，仅使用未标记的目标域数据来调整预训练的源模型，从而避免了数据隐私和存储问题。通常，SFDA方法适应了与整个总体相对应的目标域数据集，并假定它包含来自所有识别类别的数据。但是，在医疗保健应用中，收集这种全面的目标数据可能是困难的，甚至是不可能的。在许多实际情况下，在部署前为目标受试者收集一个短的中性控制视频（仅显示中性表达式）可能是可行的。这些视频可用于调整模型，以更好地处理受试者之间表达式的可变性。本文介绍了无源的无源域适应性（DSFDA）方法，以解决缺少目标表达数据所带来的SFDA挑战。 DSFDA从中立目标控制视频中利用数据来端到端生成，并使用缺失的非中性数据适应目标数据。我们的方法学会了在生成缺失的非中性目标数据的同时删除与表达式和身份相关的特征，从而提高模型的准确性。此外，我们的自学策略通过重建保持相同身份和源表达的目标图像来改善模型适应。

Title: FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Authors: Jinwei Li, Huan-ang Gao, Wenyi Li, Haohan Chi, Chenyu Liu, Chenxi Du, Yiqian Liu, Mingju Gao, Guiyu Zhang, Zongzheng Zhang, Li Yi, Yao Yao, Jingwei Zhao, Hongyang Li, Yikai Wang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20784
Pdf URL: https://arxiv.org/pdf/2503.20784
Copy Paste: [[2503.20784]] FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks(https://arxiv.org/abs/2503.20784)
Keywords: generation
Abstract: With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.
摘要：随着扩散模型和3D生成技术的快速发展，动态3D内容生成已成为关键的研究领域。但是，实现具有强大时空一致性的高保真4D（动态3D）一代仍然是一项艰巨的任务。启发是受到预估计的扩散特征捕获丰富对应关系的启发，我们提出了FB-4D，这是一种新型的4D生成框架，该框架集成了特征库机制，以增强生成帧中的空间和时间一致性。在FB-4D中，我们存储从以前的帧中提取的功能，并将其融合到生成后续帧的过程中，从而确保时间和多个视图的一致特征。为了确保紧凑的表示形式，通过提出的动态合并机制更新功能库。利用此功能库，我们首次证明，通过多次自回归迭代生成其他参考序列可以不断提高发电性能。实验结果表明，FB-4D在渲染质量，时空的一致性和鲁棒性方面显着优于现有方法。它超过了所有多视图生成无调的方法，并通过基于培训的方法在同等的情况下实现了性能。

Title: Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Authors: Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20785
Pdf URL: https://arxiv.org/pdf/2503.20785
Copy Paste: [[2503.20785]] Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency(https://arxiv.org/abs/2503.20785)
Keywords: generation
Abstract: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
摘要：我们提出Free4D，这是一个从单个图像中为4D场景生成的新颖无调框架。现有的方法要么关注对象级生成，使场景级别的生成不可行，要么依靠大规模的多视频视频数据集进行昂贵的培训，并且由于4D场景数据的稀缺性而具有有限的概括能力。相比之下，我们的主要见解是将预先训练的基础模型提取为一致的4D场景表示，该模型具有有希望的优势，例如效率和概括性。 1）为了实现这一目标，我们首先使用图像到视频扩散模型对输入图像进行动画动画，然后使用4D几何结构初始化。 2）将这种粗糙的结构变成时空一致的多视频视频，我们设计了一种自适应指导机制，具有指引导的denoising策略的空间一致性和新型的潜在替代策略，以实现时间连贯性。 3）为了将这些生成的观测值提高到一致的4D表示形式中，我们提出了一种基于调制的改进，以减轻不一致之处，同时充分利用生成的信息。由此产生的4D表示可以实时，可控制的渲染，这标志着基于单片图的4D场景生成的显着进步。