2024-12-13

Title: ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes

Authors: Yuxi Wei, Jingbo Wang, Yuwen Du, Dingju Wang, Liang Pan, Chenxin Xu, Yao Feng, Bo Dai, Siheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08685
Pdf URL: https://arxiv.org/pdf/2412.08685
Copy Paste: [[2412.08685]] ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes(https://arxiv.org/abs/2412.08685)
Keywords: generation
Abstract: Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently a lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicles and pedestrians, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capable of generating interactive, controllable and realistic participant dynamics in street scenes based on language instructions. To achieve precise control through complex language, ChatDyn employs a multi-LLM-agent role-playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine-grained dynamics based on the planning, ChatDyn designs two novel executors: the PedExecutor, a unified multi-task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition-based policy that generates physically plausible vehicle dynamics. Extensive experiments show that ChatDyn can generate realistic driving scene dynamics with multiple vehicles and pedestrians, and significantly outperforms previous methods on subtasks. Code and model will be available at this https URL.
摘要：根据特定指令生成交通参与者的真实且可交互的动态对于街道场景模拟至关重要。然而，目前缺乏一种全面的方法来生成不同类型的参与者（包括车辆和行人）的真实动态，以及他们之间不同类型的交互。在本文中，我们介绍了 ChatDyn，这是第一个能够基于语言指令生成街道场景中可交互、可控且真实的参与者动态的系统。为了通过复杂语言实现精确控制，ChatDyn 采用了多 LLM 代理角色扮演方法，该方法利用自然语言输入来规划不同交通参与者的轨迹和行为。为了根据规划生成逼真的细粒度动态，ChatDyn 设计了两个新颖的执行器：PedExecutor，一个统一的多任务执行器，可在不同的任务规划下生成逼真的行人动态；以及 VehExecutor，一种基于物理转换的策略，可生成物理上合理的车辆动态。大量实验表明，ChatDyn 可以生成包含多辆车辆和行人的逼真驾驶场景动态，并且在子任务上的表现明显优于以前的方法。代码和模型将在此 https URL 上提供。

Title: ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder

Authors: Jungho Kim, Changwon Kang, Dongyoung Lee, Sehwan Choi, Jun Won Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08774
Pdf URL: https://arxiv.org/pdf/2412.08774
Copy Paste: [[2412.08774]] ProtoOcc: Accurate, Efficient 3D Occupancy Prediction Using Dual Branch Encoder-Prototype Query Decoder(https://arxiv.org/abs/2412.08774)
Keywords: generation
Abstract: In this paper, we introduce ProtoOcc, a novel 3D occupancy prediction model designed to predict the occupancy states and semantic classes of 3D voxels through a deep semantic understanding of scenes. ProtoOcc consists of two main components: the Dual Branch Encoder (DBE) and the Prototype Query Decoder (PQD). The DBE produces a new 3D voxel representation by combining 3D voxel and BEV representations across multiple scales through a dual branch structure. This design enhances both performance and computational efficiency by providing a large receptive field for the BEV representation while maintaining a smaller receptive field for the voxel representation. The PQD introduces Prototype Queries to accelerate the decoding process. Scene-Adaptive Prototypes are derived from the 3D voxel features of input sample, while Scene-Agnostic Prototypes are computed by applying Scene-Adaptive Prototypes to an Exponential Moving Average during the training phase. By using these prototype-based queries for decoding, we can directly predict 3D occupancy in a single step, eliminating the need for iterative Transformer decoding. Additionally, we propose the Robust Prototype Learning, which injects noise into prototype generation process and trains the model to denoise during the training phase. ProtoOcc achieves state-of-the-art performance with 45.02% mIoU on the Occ3D-nuScenes benchmark. For single-frame method, it reaches 39.56% mIoU with an inference speed of 12.83 FPS on an NVIDIA RTX 3090. Our code can be found at this https URL.
摘要：在本文中，我们介绍了 ProtoOcc，这是一种新颖的 3D 占用预测模型，旨在通过对场景的深度语义理解来预测 3D 体素的占用状态和语义类别。ProtoOcc 由两个主要组件组成：双分支编码器 (DBE) 和原型查询解码器 (PQD)。DBE 通过双分支结构跨多个尺度组合 3D 体素和 BEV 表示来生成新的 3D 体素表示。这种设计通过为 BEV 表示提供较大的感受野，同时为体素表示保持较小的感受野，从而提高了性能和计算效率。PQD 引入了原型查询来加速解码过程。场景自适应原型源自输入样本的 3D 体素特征，而场景无关原型则是通过在训练阶段将场景自适应原型应用于指数移动平均值来计算的。通过使用这些基于原型的查询进行解码，我们可以直接在单个步骤中预测 3D 占用率，从而无需迭代 Transformer 解码。此外，我们提出了稳健原型学习，它将噪声注入原型生成过程并训练模型在训练阶段进行去噪。ProtoOcc 在 Occ3D-nuScenes 基准上以 45.02% mIoU 实现了最先进的性能。对于单帧方法，它在 NVIDIA RTX 3090 上达到 39.56% mIoU，推理速度为 12.83 FPS。我们的代码可以在这个 https URL 中找到。

Title: Generative Modeling with Explicit Memory

Authors: Yi Tang, Peng Sun, Zhenglin Cheng, Tao Lin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08781
Pdf URL: https://arxiv.org/pdf/2412.08781
Copy Paste: [[2412.08781]] Generative Modeling with Explicit Memory(https://arxiv.org/abs/2412.08781)
Keywords: generation, generative
Abstract: Recent studies indicate that the denoising process in deep generative diffusion models implicitly learns and memorizes semantic information from the data distribution. These findings suggest that capturing more complex data distributions requires larger neural networks, leading to a substantial increase in computational demands, which in turn become the primary bottleneck in both training and inference of diffusion models. To this end, we introduce \textbf{G}enerative \textbf{M}odeling with \textbf{E}xplicit \textbf{M}emory (GMem), leveraging an external memory bank in both training and sampling phases of diffusion models. This approach preserves semantic information from data distributions, reducing reliance on neural network capacity for learning and generalizing across diverse datasets. The results are significant: our GMem enhances both training, sampling efficiency, and generation quality. For instance, on ImageNet at $256 \times 256$ resolution, GMem accelerates SiT training by over $46.7\times$, achieving the performance of a SiT model trained for $7M$ steps in fewer than $150K$ steps. Compared to the most efficient existing method, REPA, GMem still offers a $16\times$ speedup, attaining an FID score of 5.75 within $250K$ steps, whereas REPA requires over $4M$ steps. Additionally, our method achieves state-of-the-art generation quality, with an FID score of {3.56} without classifier-free guidance on ImageNet $256\times256$. Our code is available at \url{this https URL}.
摘要：最近的研究表明，深度生成扩散模型中的去噪过程会隐式学习和记忆数据分布中的语义信息。这些发现表明，捕获更复杂的数据分布需要更大的神经网络，从而导致计算需求大幅增加，而这反过来又成为扩散模型训练和推理的主要瓶颈。为此，我们引入了 \textbf{G}enerative \textbf{M}modeling with \textbf{E}xplicit \textbf{M}emory (GMem)，在扩散模型的训练和采样阶段都利用外部存储库。这种方法保留了数据分布中的语义信息，减少了对神经网络学习能力的依赖，并减少了在不同数据集之间进行泛化。结果意义重大：我们的 GMem 提高了训练、采样效率和生成质量。例如，在分辨率为 $256 \times 256$ 的 ImageNet 上，GMem 将 SiT 训练速度提高了超过 $46.7\times$，在不到 $150K$ 步内实现了经过 $7M$ 步训练的 SiT 模型的性能。与现有最高效的方法 REPA 相比，GMem 仍提供了 $16\times$ 的速度提升，在 $250K$ 步内实现了 5.75 的 FID 得分，而 REPA 需要超过 $4M$ 步。此外，我们的方法实现了最先进的生成质量，在 ImageNet $256\times256$ 上无需无分类器指导的情况下，FID 得分为 {3.56}。我们的代码可在 \url{此 https URL} 上找到。

Title: DALI: Domain Adaptive LiDAR Object Detection via Distribution-level and Instance-level Pseudo Label Denoising

Authors: Xiaohu Lu, Hayder Radha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08806
Pdf URL: https://arxiv.org/pdf/2412.08806
Copy Paste: [[2412.08806]] DALI: Domain Adaptive LiDAR Object Detection via Distribution-level and Instance-level Pseudo Label Denoising(https://arxiv.org/abs/2412.08806)
Keywords: generation
Abstract: Object detection using LiDAR point clouds relies on a large amount of human-annotated samples when training the underlying detectors' deep neural networks. However, generating 3D bounding box annotation for a large-scale dataset could be costly and time-consuming. Alternatively, unsupervised domain adaptation (UDA) enables a given object detector to operate on a novel new data, with unlabeled training dataset, by transferring the knowledge learned from training labeled \textit{source domain} data to the new unlabeled \textit{target domain}. Pseudo label strategies, which involve training the 3D object detector using target-domain predicted bounding boxes from a pre-trained model, are commonly used in UDA. However, these pseudo labels often introduce noise, impacting performance. In this paper, we introduce the Domain Adaptive LIdar (DALI) object detection framework to address noise at both distribution and instance levels. Firstly, a post-training size normalization (PTSN) strategy is developed to mitigate bias in pseudo label size distribution by identifying an unbiased scale after network training. To address instance-level noise between pseudo labels and corresponding point clouds, two pseudo point clouds generation (PPCG) strategies, ray-constrained and constraint-free, are developed to generate pseudo point clouds for each instance, ensuring the consistency between pseudo labels and pseudo points during training. We demonstrate the effectiveness of our method on the publicly available and popular datasets KITTI, Waymo, and nuScenes. We show that the proposed DALI framework achieves state-of-the-art results and outperforms leading approaches on most of the domain adaptation tasks. Our code is available at \href{this https URL}{this https URL}.
摘要：使用 LiDAR 点云进行物体检测依赖于大量人工注释的样本，这些样本用于训练底层检测器的深度神经网络。但是，为大规模数据集生成 3D 边界框注释可能成本高昂且耗时。或者，无监督域自适应 (UDA) 使给定的物体检测器能够使用未标记的训练数据集对新数据进行操作，方法是将从训练标记的 \textit{源域} 数据中学到的知识转移到新的未标记的 \textit{目标域}。伪标签策略通常用于 UDA，该策略涉及使用来自预训练模型的目标域预测边界框来训练 3D 物体检测器。但是，这些伪标签通常会引入噪音，从而影响性能。在本文中，我们介绍了域自适应 LIdar (DALI) 物体检测框架来解决分布和实例级别的噪音问题。首先，开发了一种训练后尺寸归一化 (PTSN) 策略，通过在网络训练后识别无偏尺度来减轻伪标签尺寸分布中的偏差。为了解决伪标签和相应点云之间的实例级噪声问题，开发了两种伪点云生成 (PPCG) 策略，即射线约束和无约束，以为每个实例生成伪点云，从而确保训练期间伪标签和伪点之间的一致性。我们在公开可用的流行数据集 KITTI、Waymo 和 nuScenes 上证明了我们方法的有效性。我们表明，所提出的 DALI 框架实现了最先进的结果，并且在大多数领域自适应任务上都优于领先方法。我们的代码可在 \href{this https URL}{this https URL} 上找到。

Title: ViUniT: Visual Unit Tests for More Robust Visual Programming

Authors: Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08859
Pdf URL: https://arxiv.org/pdf/2412.08859
Copy Paste: [[2412.08859]] ViUniT: Visual Unit Tests for More Robust Visual Programming(https://arxiv.org/abs/2412.08859)
Keywords: generation
Abstract: Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.
摘要：基于编程的推理任务方法大大扩展了模型可以回答的有关视觉场景的问题类型。然而，在基准视觉推理数据上，当模型回答正确时，它们 33% 的时间会生成错误的程序。这些模型往往出于错误的原因而正确，并有可能在新数据上出现意外故障。单元测试在确保代码正确性方面起着基础性作用，可用于修复此类故障。我们提出了可视化单元测试 (ViUniT)，这是一个通过自动生成单元测试来提高视觉程序可靠性的框架。在我们的框架中，单元测试表示为一个新的图像和答案对，旨在验证针对给定查询生成的程序的逻辑正确性。我们的方法利用语言模型以图像描述和预期答案的形式创建单元测试，并通过图像合成来生成相应的图像。我们对有效的视觉单元测试套件的构成进行了全面分析，探索了单元测试生成、采样策略、图像生成方法以及改变程序和单元测试的数量。此外，我们介绍了视觉单元测试的四种应用：最佳程序选择、答案拒绝、重新提示和强化学习的无监督奖励公式。在视觉问答和图像文本匹配的三个数据集上使用两个模型进行的实验表明，ViUniT 将模型性能提高了 11.4%。值得注意的是，它使 7B 开源模型的平均性能比 gpt-4o-mini 高出 7.7%，并将因错误原因而正确的程序的发生率降低了 40%。

Title: Radiology Report Generation via Multi-objective Preference Optimization

Authors: Ting Xiao, Lei Shi, Peng Liu, Zhe Wang, Chenjia Bai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08901
Pdf URL: https://arxiv.org/pdf/2412.08901
Copy Paste: [[2412.08901]] Radiology Report Generation via Multi-objective Preference Optimization(https://arxiv.org/abs/2412.08901)
Keywords: generation
Abstract: Automatic Radiology Report Generation (RRG) is an important topic for alleviating the substantial workload of radiologists. Existing RRG approaches rely on supervised regression based on different architectures or additional knowledge injection,while the generated report may not align optimally with radiologists' preferences. Especially, since the preferences of radiologists are inherently heterogeneous and multidimensional, e.g., some may prioritize report fluency, while others emphasize clinical accuracy. To address this problem,we propose a new RRG method via Multi-objective Preference Optimization (MPO) to align the pre-trained RRG model with multiple human preferences, which can be formulated by multi-dimensional reward functions and optimized by multi-objective reinforcement learning (RL). Specifically, we use a preference vector to represent the weight of preferences and use it as a condition for the RRG model. Then, a linearly weighed reward is obtained via a dot product between the preference vector and multi-dimensional this http URL,the RRG model is optimized to align with the preference vector by optimizing such a reward via RL. In the training stage,we randomly sample diverse preference vectors from the preference space and align the model by optimizing the weighted multi-objective rewards, which leads to an optimal policy on the entire preference space. When inference,our model can generate reports aligned with specific preferences without further fine-tuning. Extensive experiments on two public datasets show the proposed method can generate reports that cater to different preferences in a single model and achieve state-of-the-art performance.
摘要：自动放射学报告生成 (RRG) 是减轻放射科医生大量工作量的重要课题。现有的 RRG 方法依赖于基于不同架构或额外知识注入的监督回归，而生成的报告可能无法与放射科医生的偏好最佳地一致。特别是，由于放射科医生的偏好本质上是异质和多维的，例如，有些人可能优先考虑报告的流畅性，而另一些人则强调临床准确性。为了解决这个问题，我们通过多目标偏好优化 (MPO) 提出了一种新的 RRG 方法，以使预训练的 RRG 模型与多种人类偏好保持一致，这些偏好可以通过多维奖励函数来制定并通过多目标强化学习 (RL) 进行优化。具体而言，我们使用偏好向量来表示偏好的权重并将其用作 RRG 模型的条件。然后，通过偏好向量和多维此 http URL 之间的点积获得线性加权奖励，通过 RL 优化这种奖励来优化 RRG 模型以与偏好向量对齐。在训练阶段，我们从偏好空间中随机抽取不同的偏好向量，并通过优化加权多目标奖励来对齐模型，从而在整个偏好空间上获得最优策略。在推理时，我们的模型可以生成与特定偏好一致的报告，而无需进一步微调。在两个公共数据集上进行的大量实验表明，所提出的方法可以在单个模型中生成迎合不同偏好的报告，并实现最先进的性能。

Title: Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression

Authors: Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.08912
Pdf URL: https://arxiv.org/pdf/2412.08912
Copy Paste: [[2412.08912]] Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression(https://arxiv.org/abs/2412.08912)
Keywords: restoration
Abstract: In this paper, we introduce DiQP; a novel Transformer-Diffusion model for restoring 8K video quality degraded by codec compression. To the best of our knowledge, our model is the first to consider restoring the artifacts introduced by various codecs (AV1, HEVC) by Denoising Diffusion without considering additional noise. This approach allows us to model the complex, non-Gaussian nature of compression artifacts, effectively learning to reverse the degradation. Our architecture combines the power of Transformers to capture long-range dependencies with an enhanced windowed mechanism that preserves spatiotemporal context within groups of pixels across frames. To further enhance restoration, the model incorporates auxiliary "Look Ahead" and "Look Around" modules, providing both future and surrounding frame information to aid in reconstructing fine details and enhancing overall visual quality. Extensive experiments on different datasets demonstrate that our model outperforms state-of-the-art methods, particularly for high-resolution videos such as 4K and 8K, showcasing its effectiveness in restoring perceptually pleasing videos from highly compressed sources.
摘要：在本文中，我们介绍了 DiQP；一种用于恢复因编解码器压缩而降低的 8K 视频质量的新型 Transformer-Diffusion 模型。据我们所知，我们的模型是第一个考虑通过去噪扩散恢复由各种编解码器 (AV1、HEVC) 引入的伪影而不考虑额外噪声的模型。这种方法使我们能够对压缩伪影的复杂、非高斯性质进行建模，从而有效地学习逆转质量下降。我们的架构结合了 Transformers 的强大功能以捕获长距离依赖关系，以及增强的窗口机制，该机制可保留跨帧像素组中的时空上下文。为了进一步增强恢复能力，该模型结合了辅助的“前瞻”和“环顾”模块，提供未来和周围的帧信息，以帮助重建精细细节并提高整体视觉质量。对不同数据集进行的大量实验表明，我们的模型优于最先进的方法，特别是对于 4K 和 8K 等高分辨率视频，展示了其在从高度压缩的源中恢复感知令人愉悦的视频方面的有效性。

Title: Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration

Authors: Yunshuai Zhou, Junbo Qiao, Jincheng Liao, Wei Li, Simiao Li, Jiao Xie, Yunhang Shen, Jie Hu, Shaohui Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08939
Pdf URL: https://arxiv.org/pdf/2412.08939
Copy Paste: [[2412.08939]] Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration(https://arxiv.org/abs/2412.08939)
Keywords: restoration
Abstract: Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.
摘要：知识蒸馏 (KD) 是一种有价值但具有挑战性的方法，它通过从高性能但繁琐的教师模型中学习来增强紧凑的学生网络。然而，以前的图像恢复 KD 方法忽略了蒸馏过程中学生的状态，采用了限制 KD 能力的固定解决方案空间。此外，仅依靠 L1 型损失难以利用图像的分布信息。在这项工作中，我们提出了一种用于图像恢复的新型动态对比知识蒸馏 (DCKD) 框架。具体来说，我们引入动态对比正则化来感知学生的学习状态，并使用对比学习动态调整蒸馏后的解空间。此外，我们还提出了一个分布映射模块来提取和对齐教师和学生模型的像素级类别分布。请注意，提出的 DCKD 是一个与结构无关的蒸馏框架，它可以适应不同的主干，并且可以与优化上限约束的方法相结合，以进一步提高模型性能。大量实验表明，DCKD 在各种图像恢复任务和主干网络上的表现明显优于最先进的 KD 方法。

Title: Selective Visual Prompting in Vision Mamba

Authors: Yifeng Yao, Zichen Liu, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.08947
Pdf URL: https://arxiv.org/pdf/2412.08947
Copy Paste: [[2412.08947]] Selective Visual Prompting in Vision Mamba(https://arxiv.org/abs/2412.08947)
Keywords: generation
Abstract: Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at this https URL.
摘要：预训练的 Vision Mamba (Vim) 模型已在各种计算机视觉任务中以计算高效的方式展现出卓越的性能，这归功于其独特的选择性状态空间模型设计。为了进一步扩展其对各种下游视觉任务的适用性，可以使用称为视觉提示的高效微调技术调整 Vim 模型。然而，现有的视觉提示方法主要针对利用全局注意力机制的 Vision Transformer (ViT) 模型量身定制，而忽略了 Vim 独特的顺序标记压缩和传播特性。具体而言，序列前缀的现有提示标记不足以有效激活整个序列的输入和遗忘门，从而阻碍了判别信息的提取和传播。为了解决这一限制，我们引入了一种新颖的选择性视觉提示 (SVP) 方法，专门用于对 Vim 进行高效微调。为了防止在状态空间传播过程中丢失判别信息，SVP 采用轻量级选择性提示器来生成基于 token 的提示，确保自适应激活 Mamba 块内的更新门和遗忘门，以促进判别信息传播。此外，考虑到 Vim 会传播共享的跨层信息和特定的内层信息，我们进一步使用双路径结构完善 SVP：交叉提示和内部提示。交叉提示利用跨层的共享参数，而内部提示采用不同的参数，分别促进共享信息和特定信息的传播。在各种大规模基准测试中开展的大量实验结果表明，我们提出的 SVP 明显优于最先进的方法。我们的代码可在此 https URL 上找到。

Title: Mojito: Motion Trajectory and Intensity Control for Video Generation

Authors: Xuehai He, Shuohang Wang, Jianwei Yang, Xiaoxia Wu, Yiping Wang, Kuan Wang, Zheng Zhan, Olatunji Ruwase, Yelong Shen, Xin Eric Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.08948
Pdf URL: https://arxiv.org/pdf/2412.08948
Copy Paste: [[2412.08948]] Mojito: Motion Trajectory and Intensity Control for Video Generation(https://arxiv.org/abs/2412.08948)
Keywords: generation
Abstract: Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and \textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object's motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.
摘要：扩散模型的最新进展在制作高质量视频内容方面显示出巨大的潜力。然而，如何有效地训练能够整合方向引导和可控运动强度的扩散模型仍然是一个具有挑战性且尚未得到充分探索的领域。本文介绍了 Mojito，这是一种将 \textbf{Mo}tion tra\textbf{j}ectory 和 \textbf{i}ntensi\textbf{t}y contr\textbf{o}l 结合起来用于文本到视频生成的扩散模型。具体来说，Mojito 具有一个方向运动控制模块，该模块利用交叉注意力来有效地引导生成的对象的运动而无需额外的训练，同时还有一个运动强度调制器，它使用从视频生成的光流图来引导不同程度的运动强度。大量实验证明了 Mojito 能够以高计算效率实现精确的轨迹和强度控制，生成与指定方向和强度紧密匹配的运动模式，提供与现实场景中的自然运动完美契合的真实动态。

Title: Elevating Flow-Guided Video Inpainting with Reference Generation

Authors: Suhwan Cho, Seoung Wug Oh, Sangyoun Lee, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.08975
Pdf URL: https://arxiv.org/pdf/2412.08975
Copy Paste: [[2412.08975]] Elevating Flow-Guided Video Inpainting with Reference Generation(https://arxiv.org/abs/2412.08975)
Keywords: generation, generative
Abstract: Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.
摘要：视频修复 (VI) 是一项具有挑战性的任务，需要有效地跨帧传播可观察内容，同时生成原始视频中不存在的新内容。在本研究中，我们提出了一个强大而实用的 VI 框架，该框架利用大型生成模型进行参考生成，并结合先进的像素传播算法。在强大的生成模型的支持下，我们的方法不仅显著提高了对象移除的帧级质量，而且还根据用户提供的文本提示在缺失区域中合成新内容。对于像素传播，我们引入了一种一次性像素拉动方法，可有效避免重复采样造成的误差积累，同时保持亚像素精度。为了在现实场景中评估各种 VI 方法，我们还提出了一个高质量的 VI 基准 HQVI，包括使用 alpha matte 合成精心生成的视频。在公共基准和 HQVI 数据集上，与现有解决方案相比，我们的方法表现出明显更高的视觉质量和指标分数。此外，它可以轻松处理超过 2K 分辨率的高分辨率视频，凸显了其在实际应用中的优势。

Title: Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation

Authors: Lianrui Mu, Xingze Zhou, Wenjie Zheng, Jiangnan Ye, Xiaoyu Liang, Yuchen Yang, Jianhong Bai, Jiedong Zhuang, Haoji Hu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.08976
Pdf URL: https://arxiv.org/pdf/2412.08976
Copy Paste: [[2412.08976]] Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation(https://arxiv.org/abs/2412.08976)
Keywords: generation
Abstract: Landmark-guided character animation generation is an important field. Generating character animations with facial features consistent with a reference image remains a significant challenge in conditional video generation, especially involving complex motions like dancing. Existing methods often fail to maintain facial feature consistency due to mismatches between the facial landmarks extracted from source videos and the target facial features in the reference image. To address this problem, we propose a facial landmark transformation method based on the 3D Morphable Model (3DMM). We obtain transformed landmarks that align with the target facial features by reconstructing 3D faces from the source landmarks and adjusting the 3DMM parameters to match the reference image. Our method improves the facial consistency between the generated videos and the reference images, effectively improving the facial feature mismatch problem.
摘要：基于特征点引导的角色动画生成是一个重要的领域。生成面部特征与参考图一致的角色动画仍然是条件视频生成中的一项重大挑战，尤其是涉及跳舞等复杂动作时。现有方法通常无法保持面部特征的一致性，因为从源视频中提取的面部特征点与参考图中的目标面部特征不匹配。为了解决这个问题，我们提出了一种基于 3D 可变形模型 (3DMM) 的面部特征点变换方法。我们通过从源特征点重建 3D 面部并调整 3DMM 参数以匹配参考图，获得与目标面部特征对齐的变换后的特征点。我们的方法提高了生成的视频和参考图之间的面部一致性，有效地改善了面部特征不匹配问题。

Title: MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments

Authors: Yuqi Tong, Yue Qiu, Ruiyang Li, Shi Qiu, Pheng-Ann Heng
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2412.09008
Pdf URL: https://arxiv.org/pdf/2412.09008
Copy Paste: [[2412.09008]] MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments(https://arxiv.org/abs/2412.09008)
Keywords: generation, generative
Abstract: We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline that enables users to create realistic 3D objects in extended reality (XR) environments using hand-drawn sketches assisted by voice inputs. In specific, users can intuitively sketch objects using natural hand movements in mid-air within a virtual environment. By integrating voice inputs, we devise ControlNet to infer realistic images based on the drawn sketches and interpreted text prompts. Users can then review and select their preferred image, which is subsequently reconstructed into a detailed 3D mesh using the Convolutional Reconstruction Model. In particular, our proposed pipeline can generate a high-quality 3D mesh in less than 20 seconds, allowing for immersive visualization and manipulation in run-time XR scenes. We demonstrate the practicability of our pipeline through two use cases in XR settings. By leveraging natural user inputs and cutting-edge generative AI capabilities, our approach can significantly facilitate XR-based creative production and enhance user experiences. Our code and demo will be available at: this https URL
摘要：我们提出了 MS2Mesh-XR，这是一种新颖的多模式草图到网格生成管道，它使用户能够使用手绘草图在语音输入的帮助下在扩展现实 (XR) 环境中创建逼真的 3D 对象。具体来说，用户可以在虚拟环境中的半空中使用自然的手部动作直观地绘制物体。通过集成语音输入，我们设计了 ControlNet，以根据绘制的草图和解释的文本提示推断出逼真的图像。然后，用户可以查看并选择他们喜欢的图像，然后使用卷积重建模型将其重建为详细的 3D 网格。特别是，我们提出的管道可以在不到 20 秒的时间内生成高质量的 3D 网格，从而允许在运行时 XR 场景中进行沉浸式可视化和操作。我们通过 XR 设置中的两个用例展示了我们管道的实用性。通过利用自然的用户输入和尖端的生成 AI 功能，我们的方法可以显著促进基于 XR 的创意制作并增强用户体验。我们的代码和演示将在以下位置提供：此 https URL

Title: Arbitrary-steps Image Super-resolution via Diffusion Inversion

Authors: Zongsheng Yue, Kang Liao, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09013
Pdf URL: https://arxiv.org/pdf/2412.09013
Copy Paste: [[2412.09013]] Arbitrary-steps Image Super-resolution via Diffusion Inversion(https://arxiv.org/abs/2412.09013)
Keywords: super-resolution
Abstract: This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point. Central to our approach is a deep noise predictor to estimate the optimal noise maps for the forward diffusion process. Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result. Compared to existing approaches, our method offers a flexible and efficient sampling mechanism that supports an arbitrary number of sampling steps, ranging from one to five. Even with a single sampling step, our method demonstrates superior or comparable performance to recent state-of-the-art approaches. The code and model are publicly available at this https URL.
摘要：本研究提出了一种基于扩散反演的新型图像超分辨率 (SR) 技术，旨在利用大型预训练扩散模型中封装的丰富图像先验来提高 SR 性能。我们设计了一种部分噪声预测策略来构建扩散模型的中间状态，作为起始采样点。我们方法的核心是一个深度噪声预测器，用于估计前向扩散过程的最佳噪声图。一旦训练完成，这个噪声预测器就可以用来沿着扩散轨迹部分初始化采样过程，从而生成理想的高分辨率结果。与现有方法相比，我们的方法提供了一种灵活而高效的采样机制，支持任意数量的采样步骤，范围从一到五。即使只有一个采样步骤，我们的方法也表现出优于或可与最近最先进的方法相媲美的性能。代码和模型可在此 https URL 上公开获取。

Title: Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

Authors: Hang Zhou, Jiale Cai, Yuteng Ye, Yonghui Feng, Chenxing Gao, Junqing Yu, Zikai Song, Wei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09026
Pdf URL: https://arxiv.org/pdf/2412.09026
Copy Paste: [[2412.09026]] Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model(https://arxiv.org/abs/2412.09026)
Keywords: generation
Abstract: A recent endeavor in one class of video anomaly detection is to leverage diffusion models and posit the task as a generation problem, where the diffusion model is trained to recover normal patterns exclusively, thus reporting abnormal patterns as outliers. Yet, existing attempts neglect the various formations of anomaly and predict normal samples at the feature level regardless that abnormal objects in surveillance videos are often relatively small. To address this, a novel patch-based diffusion model is proposed, specifically engineered to capture fine-grained local information. We further observe that anomalies in videos manifest themselves as deviations in both appearance and motion. Therefore, we argue that a comprehensive solution must consider both of these aspects simultaneously to achieve accurate frame prediction. To address this, we introduce innovative motion and appearance conditions that are seamlessly integrated into our patch diffusion model. These conditions are designed to guide the model in generating coherent and contextually appropriate predictions for both semantic content and motion relations. Experimental results in four challenging video anomaly detection datasets empirically substantiate the efficacy of our proposed approach, demonstrating that it consistently outperforms most existing methods in detecting abnormal behaviors.
摘要：在一类视频异常检测中，最近的一项尝试是利用扩散模型并将任务假设为生成问题，其中扩散模型被训练为专门恢复正常模式，从而将异常模式报告为异常值。然而，现有的尝试忽略了异常的各种形成，并在特征级别预测正常样本，而不管监控视频中的异常物体通常相对较小。为了解决这个问题，提出了一种新的基于补丁的扩散模型，专门用于捕获细粒度的局部信息。我们进一步观察到，视频中的异常表现为外观和运动的偏差。因此，我们认为，一个全面的解决方案必须同时考虑这两个方面才能实现准确的帧预测。为了解决这个问题，我们引入了创新的运动和外观条件，这些条件无缝集成到我们的补丁扩散模型中。这些条件旨在指导模型为语义内容和运动关系生成连贯且适合上下文的预测。在四个具有挑战性的视频异常检测数据集中的实验结果从经验上证实了我们提出的方法的有效性，表明它在检测异常行为方面始终优于大多数现有方法。

Title: An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques

Authors: Chunxiao Li, Xiaoxiao Wang, Boming Miao, Chuanlong Xie, Zizhe Wang, Yao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09063
Pdf URL: https://arxiv.org/pdf/2412.09063
Copy Paste: [[2412.09063]] An Efficient Framework for Enhancing Discriminative Models via Diffusion Techniques(https://arxiv.org/abs/2412.09063)
Keywords: generative
Abstract: Image classification serves as the cornerstone of computer vision, traditionally achieved through discriminative models based on deep neural networks. Recent advancements have introduced classification methods derived from generative models, which offer the advantage of zero-shot classification. However, these methods suffer from two main drawbacks: high computational overhead and inferior performance compared to discriminative models. Inspired by the coordinated cognitive processes of rapid-slow pathway interactions in the human brain during visual signal recognition, we propose the Diffusion-Based Discriminative Model Enhancement Framework (DBMEF). This framework seamlessly integrates discriminative and generative models in a training-free manner, leveraging discriminative models for initial predictions and endowing deep neural networks with rethinking capabilities via diffusion models. Consequently, DBMEF can effectively enhance the classification accuracy and generalization capability of discriminative models in a plug-and-play manner. We have conducted extensive experiments across 17 prevalent deep model architectures with different training methods, including both CNN-based models such as ResNet and Transformer-based models like ViT, to demonstrate the effectiveness of the proposed DBMEF. Specifically, the framework yields a 1.51\% performance improvement for ResNet-50 on the ImageNet dataset and 3.02\% on the ImageNet-A dataset. In conclusion, our research introduces a novel paradigm for image classification, demonstrating stable improvements across different datasets and neural networks.
摘要：图像分类是计算机视觉的基石，传统上是通过基于深度神经网络的判别模型实现的。最近的进展引入了源自生成模型的分类方法，这些方法具有零样本分类的优势。然而，这些方法有两个主要缺点：计算开销高，性能不如判别模型。受人类大脑在视觉信号识别过程中快慢通路相互作用的协调认知过程的启发，我们提出了基于扩散的判别模型增强框架 (DBMEF)。该框架以无需训练的方式无缝集成了判别模型和生成模型，利用判别模型进行初始预测，并通过扩散模型赋予深度神经网络重新思考的能力。因此，DBMEF 可以即插即用的方式有效地提高判别模型的分类准确率和泛化能力。我们对 17 种流行的深度模型架构进行了广泛的实验，并采用了不同的训练方法，包括基于 CNN 的模型（例如 ResNet）和基于 Transformer 的模型（例如 ViT），以证明所提出的 DBMEF 的有效性。具体来说，该框架在 ImageNet 数据集上使 ResNet-50 的性能提高了 1.51%，在 ImageNet-A 数据集上的性能提高了 3.02%。总之，我们的研究为图像分类引入了一种新颖的范式，展示了不同数据集和神经网络之间的稳定改进。

Title: Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Authors: Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09082
Pdf URL: https://arxiv.org/pdf/2412.09082
Copy Paste: [[2412.09082]] Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method(https://arxiv.org/abs/2412.09082)
Keywords: generation
Abstract: Existing Vision-Language Navigation (VLN) methods primarily focus on single-stage navigation, limiting their effectiveness in multi-stage and long-horizon tasks within complex and dynamic environments. To address these limitations, we propose a novel VLN task, named Long-Horizon Vision-Language Navigation (LH-VLN), which emphasizes long-term planning and decision consistency across consecutive subtasks. Furthermore, to support LH-VLN, we develop an automated data generation platform NavGen, which constructs datasets with complex task structures and improves data utility through a bidirectional, multi-granularity generation approach. To accurately evaluate complex tasks, we construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark consisting of 3,260 tasks with an average of 150 task steps, serving as the first dataset specifically designed for the long-horizon vision-language navigation task. Furthermore, we propose Independent Success Rate (ISR), Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics, to provide fine-grained assessments of task completion. To improve model adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic Memory (MGDM) module that integrates short-term memory blurring with long-term memory retrieval to enable flexible navigation in dynamic environments. Our platform, benchmark and method supply LH-VLN with a robust data generation pipeline, comprehensive model evaluation dataset, reasonable metrics, and a novel VLN model, establishing a foundational framework for advancing LH-VLN.
摘要：现有的视觉-语言导航 (VLN) 方法主要侧重于单阶段导航，这限制了它们在复杂动态环境中的多阶段和长远任务中的有效性。为了解决这些限制，我们提出了一项新颖的 VLN 任务，称为长远视觉-语言导航 (LH-VLN)，它强调连续子任务之间的长期规划和决策一致性。此外，为了支持 LH-VLN，我们开发了一个自动数据生成平台 NavGen，它构建具有复杂任务结构的数据集并通过双向、多粒度生成方法提高数据效用。为了准确评估复杂任务，我们构建了 VLN 中的长远规划和推理 (LHPR-VLN) 基准，该基准由 3,260 个任务组成，平均有 150 个任务步骤，是第一个专门为长远视觉-语言导航任务设计的数据集。此外，我们提出了独立成功率 (ISR)、条件成功率 (CSR) 和 CSR 权重 (CGT) 指标，以提供对任务完成情况的细粒度评估。为了提高模型在复杂任务中的适应性，我们提出了一种新颖的多粒度动态记忆 (MGDM) 模块，该模块将短期记忆模糊与长期记忆检索相结合，以实现在动态环境中的灵活导航。我们的平台、基准和方法为 LH-VLN 提供了强大的数据生成管道、全面的模型评估数据集、合理的指标和新颖的 VLN 模型，为推进 LH-VLN 建立了基础框架。

Title: LVMark: Robust Watermark for latent video diffusion models

Authors: MinHyuk Jang, Youngdong Jang, JaeHyeok Lee, Kodai Kawamura, Feng Yang, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09122
Pdf URL: https://arxiv.org/pdf/2412.09122
Copy Paste: [[2412.09122]] LVMark: Robust Watermark for latent video diffusion models(https://arxiv.org/abs/2412.09122)
Keywords: generative
Abstract: Rapid advancements in generative models have made it possible to create hyper-realistic videos. As their applicability increases, their unauthorized use has raised significant concerns, leading to the growing demand for techniques to protect the ownership of the generative model itself. While existing watermarking methods effectively embed watermarks into image-generative models, they fail to account for temporal information, resulting in poor performance when applied to video-generative models. To address this issue, we introduce a novel watermarking method called LVMark, which embeds watermarks into video diffusion models. A key component of LVMark is a selective weight modulation strategy that efficiently embeds watermark messages into the video diffusion model while preserving the quality of the generated videos. To accurately decode messages in the presence of malicious attacks, we design a watermark decoder that leverages spatio-temporal information in the 3D wavelet domain through a cross-attention module. To the best of our knowledge, our approach is the first to highlight the potential of video-generative model watermarking as a valuable tool for enhancing the effectiveness of ownership protection in video-generative models.
摘要：生成模型的快速发展使得创建超现实视频成为可能。随着其适用性的提高，其未经授权的使用引起了人们的极大担忧，导致对保护生成模型本身所有权的技术的需求日益增长。虽然现有的水印方法可以有效地将水印嵌入图像生成模型，但它们未能考虑时间信息，导致应用于视频生成模型时性能不佳。为了解决这个问题，我们引入了一种名为 LVMark 的新型水印方法，该方法将水印嵌入视频扩散模型。LVMark 的一个关键组件是选择性权重调制策略，它可以有效地将水印消息嵌入视频扩散模型，同时保持生成的视频的质量。为了在存在恶意攻击的情况下准确解码消息，我们设计了一个水印解码器，它通过交叉注意模块利用 3D 小波域中的时空信息。据我们所知，我们的方法首次突出了视频生成模型水印作为增强视频生成模型所有权保护有效性的有价值工具的潜力。

Title: Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation

Authors: Kirill Sirotkin, Marcos Escudero-Viñolo, Pablo Carballeira, Mayug Maniparambil, Catarina Barata, Noel E. O'Connor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09160
Pdf URL: https://arxiv.org/pdf/2412.09160
Copy Paste: [[2412.09160]] Pinpoint Counterfactuals: Reducing social bias in foundation models via localized counterfactual generation(https://arxiv.org/abs/2412.09160)
Keywords: generation
Abstract: Foundation models trained on web-scraped datasets propagate societal biases to downstream tasks. While counterfactual generation enables bias analysis, existing methods introduce artifacts by modifying contextual elements like clothing and background. We present a localized counterfactual generation method that preserves image context by constraining counterfactual modifications to specific attribute-relevant regions through automated masking and guided inpainting. When applied to the Conceptual Captions dataset for creating gender counterfactuals, our method results in higher visual and semantic fidelity than state-of-the-art alternatives, while maintaining the performance of models trained using only real data on non-human-centric tasks. Models fine-tuned with our counterfactuals demonstrate measurable bias reduction across multiple metrics, including a decrease in gender classification disparity and balanced person preference scores, while preserving ImageNet zero-shot performance. The results establish a framework for creating balanced datasets that enable both accurate bias profiling and effective mitigation.
摘要：在网络抓取的数据集上训练的基础模型会将社会偏见传播到下游任务。虽然反事实生成可以进行偏见分析，但现有方法会通过修改衣服和背景等上下文元素来引入伪影。我们提出了一种局部反事实生成方法，该方法通过自动遮罩和引导修复将反事实修改限制在特定的属性相关区域，从而保留图像上下文。当应用于概念字幕数据集以创建性别反事实时，我们的方法比最先进的替代方案具有更高的视觉和语义保真度，同时保持了仅使用非以人为中心的任务的真实数据训练的模型的性能。使用我们的反事实进行微调的模型在多个指标上表现出可衡量的偏差减少，包括性别分类差异和平衡的人格偏好分数的减少，同时保持了 ImageNet 零样本性能。结果建立了一个创建平衡数据集的框架，既可以准确分析偏差，又可以有效缓解偏差。

Title: RAD: Region-Aware Diffusion Models for Image Inpainting

Authors: Sora Kim, Sungho Suh, Minsik Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09191
Pdf URL: https://arxiv.org/pdf/2412.09191
Copy Paste: [[2412.09191]] RAD: Region-Aware Diffusion Models for Image Inpainting(https://arxiv.org/abs/2412.09191)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains. Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, \ie, conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
摘要：扩散模型在图像生成方面取得了显著的成功，其应用范围也不断扩大。修复就是其中一种可以从扩散模型中获益匪浅的应用。现有方法要么劫持预训练扩散模型的逆过程，要么将问题置于更大的框架中，即条件生成。然而，这些方法通常需要在生成过程中嵌套循环或需要额外的组件进行条件处理。在本文中，我们提出了用于修复的区域感知扩散模型 (RAD)，并对原始扩散模型进行了简单而有效的重新表述。RAD 对每个像素使用不同的噪声计划，允许在考虑全局图像上下文的同时异步生成局部区域。简单的逆过程不需要额外的组件，使 RAD 能够实现比最先进方法快 100 倍的推理时间。此外，我们采用低秩自适应 (LoRA) 基于其他预训练扩散模型对 RAD 进行微调，从而减轻训练中的计算负担。实验表明，RAD 在 FFHQ、LSUN Bedroom 和 ImageNet 数据集上提供了定性和定量方面最先进的结果。

Title: ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring

Authors: Zhongbao Yang, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09193
Pdf URL: https://arxiv.org/pdf/2412.09193
Copy Paste: [[2412.09193]] ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring(https://arxiv.org/abs/2412.09193)
Keywords: restoration
Abstract: Removing blur caused by moving objects is challenging, as the moving objects are usually significantly blurry while the static background remains clear. Existing methods that rely on local blur detection often suffer from inaccuracies and cannot generate satisfactory results when focusing solely on blurred regions. To overcome these problems, we first design a context-based local blur detection module that incorporates additional contextual information to improve the identification of blurry regions. Considering that modern smartphones are equipped with cameras capable of providing short-exposure images, we develop a blur-aware guided image restoration method that utilizes sharp structural details from short-exposure images, facilitating accurate reconstruction of heavily blurred regions. Furthermore, to restore images realistically and visually-pleasant, we develop a short-exposure guided diffusion model that explores useful features from short-exposure images and blurred regions to better constrain the diffusion process. Finally, we formulate the above components into a simple yet effective network, named ExpRDiff. Experimental results show that ExpRDiff performs favorably against state-of-the-art methods.
摘要：去除由移动物体引起的模糊是一项挑战，因为移动物体通常非常模糊，而静态背景仍然清晰。现有的依赖于局部模糊检测的方法通常存在不准确性，并且在仅关注模糊区域时无法产生令人满意的结果。为了克服这些问题，我们首先设计了一个基于上下文的局部模糊检测模块，该模块结合了额外的上下文信息来改善对模糊区域的识别。考虑到现代智能手机配备了能够提供短曝光图像的摄像头，我们开发了一种模糊感知引导图像恢复方法，该方法利用短曝光图像中的清晰结构细节，有助于准确重建严重模糊的区域。此外，为了真实且视觉上令人愉悦地恢复图像，我们开发了一个短曝光引导扩散模型，该模型探索短曝光图像和模糊区域中的有用特征，以更好地约束扩散过程。最后，我们将上述组件公式化为一个简单而有效的网络，名为 ExpRDiff。实验结果表明，ExpRDiff 的表现优于最先进的方法。

Title: eCARLA-scenes: A synthetically generated dataset for event-based optical flow prediction

Authors: Jad Mansour, Hayat Rajani, Rafael Garcia, Nuno Gracias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09209
Pdf URL: https://arxiv.org/pdf/2412.09209
Copy Paste: [[2412.09209]] eCARLA-scenes: A synthetically generated dataset for event-based optical flow prediction(https://arxiv.org/abs/2412.09209)
Keywords: generation
Abstract: The joint use of event-based vision and Spiking Neural Networks (SNNs) is expected to have a large impact in robotics in the near future, in tasks such as, visual odometry and obstacle avoidance. While researchers have used real-world event datasets for optical flow prediction (mostly captured with Unmanned Aerial Vehicles (UAVs)), these datasets are limited in diversity, scalability, and are challenging to collect. Thus, synthetic datasets offer a scalable alternative by bridging the gap between reality and simulation. In this work, we address the lack of datasets by introducing eWiz, a comprehensive library for processing event-based data. It includes tools for data loading, augmentation, visualization, encoding, and generation of training data, along with loss functions and performance metrics. We further present a synthetic event-based datasets and data generation pipelines for optical flow prediction tasks. Built on top of eWiz, eCARLA-scenes makes use of the CARLA simulator to simulate self-driving car scenarios. The ultimate goal of this dataset is the depiction of diverse environments while laying a foundation for advancing event-based camera applications in autonomous field vehicle navigation, paving the way for using SNNs on neuromorphic hardware such as the Intel Loihi.
摘要：预计在不久的将来，基于事件的视觉和脉冲神经网络 (SNN) 的联合使用将在机器人技术方面产生巨大影响，例如视觉里程计和避障等任务。虽然研究人员已经使用现实世界的事件数据集进行光流预测（大部分是用无人机 (UAV) 捕获的），但这些数据集的多样性和可扩展性有限，并且难以收集。因此，合成数据集通过弥合现实与模拟之间的差距提供了一种可扩展的替代方案。在这项工作中，我们通过引入 eWiz（一个用于处理基于事件的数据的综合库）来解决数据集不足的问题。它包括用于数据加载、增强、可视化、编码和生成训练数据的工具，以及损失函数和性能指标。我们进一步介绍了用于光流预测任务的基于事件的合成数据集和数据生成管道。eCARLA-scenes 建立在 eWiz 之上，利用 CARLA 模拟器来模拟自动驾驶汽车场景。该数据集的最终目标是描绘多样化的环境，同时为在自动现场车辆导航中推进基于事件的摄像头应用奠定基础，为在英特尔 Loihi 等神经形态硬件上使用 SNN 铺平道路。

Title: LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

Authors: Chunyu Li, Chao Zhang, Weikai Xu, Jinghui Xie, Weiguo Feng, Bingyue Peng, Weiwei Xing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09262
Pdf URL: https://arxiv.org/pdf/2412.09262
Copy Paste: [[2412.09262]] LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync(https://arxiv.org/abs/2412.09262)
Keywords: generation
Abstract: We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.
摘要：我们提出了 LatentSync，这是一种基于音频条件潜在扩散模型的端到端口型同步框架，没有任何中间运动表示，不同于以前基于像素空间扩散或两阶段生成的基于扩散的口型同步方法。我们的框架可以利用稳定扩散的强大功能直接模拟复杂的视听相关性。此外，我们发现基于扩散的口型同步方法表现出较差的时间一致性，因为不同帧之间的扩散过程不一致。我们提出了时间表示对齐 (TREPA) 来增强时间一致性，同时保持口型同步准确性。TREPA 使用由大规模自监督视频模型提取的时间表示来将生成的帧与地面真实帧对齐。此外，我们观察了常见的 SyncNet 收敛问题并进行了全面的实证研究，从模型架构、训练超参数和数据预处理方法方面确定了影响 SyncNet 收敛的关键因素。我们显著提高了 SyncNet 在 HDTF 测试集上的准确率，从 91% 提高到 94%。由于我们没有改变 SyncNet 的整体训练框架，我们的经验也可以应用于其他利用 SyncNet 的口型同步和音频驱动的肖像动画方法。基于上述创新，我们的方法在 HDTF 和 VoxCeleb2 数据集上的各种指标上都优于最先进的口型同步方法。

Title: InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Authors: Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09283
Pdf URL: https://arxiv.org/pdf/2412.09283
Copy Paste: [[2412.09283]] InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption(https://arxiv.org/abs/2412.09283)
Keywords: generation
Abstract: Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.
摘要：近年来，文本到视频的生成发展迅速，取得了显著的成果。训练通常依赖于视频字幕配对数据，这对提高生成性能至关重要。然而，目前的视频字幕往往存在细节不足、幻觉和运动描绘不精确等问题，影响了生成视频的保真度和一致性。在这项工作中，我们提出了一种新颖的实例感知结构化字幕框架，称为 InstanceCap，首次实现了实例级和细粒度的视频字幕。基于此方案，我们设计了一个辅助模型集群，将原始视频转换为实例，以增强实例保真度。视频实例进一步用于将密集的提示细化为结构化短语，实现简洁而精确的描述。此外，我们还精心挑选了一个 22K InstanceVid 数据集用于训练，并提出了一种针对 InstanceCap 结构量身定制的增强管道用于推理。实验结果表明，我们提出的 InstanceCap 明显优于以前的模型，在减少幻觉的同时确保了字幕和视频之间的高保真度。

Title: Transfer Learning of RSSI to Improve Indoor Localisation Performance

Authors: Thanaphon Suwannaphong, Ryan McConville, Ian Craddock
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09292
Pdf URL: https://arxiv.org/pdf/2412.09292
Copy Paste: [[2412.09292]] Transfer Learning of RSSI to Improve Indoor Localisation Performance(https://arxiv.org/abs/2412.09292)
Keywords: generative
Abstract: With the growing demand for health monitoring systems, in-home localisation is essential for tracking patient conditions. The unique spatial characteristics of each house required annotated data for Bluetooth Low Energy (BLE) Received Signal Strength Indicator (RSSI)-based monitoring system. However, collecting annotated training data is time-consuming, particularly for patients with limited health conditions. To address this, we propose Conditional Generative Adversarial Networks (ConGAN)-based augmentation, combined with our transfer learning framework (T-ConGAN), to enable the transfer of generic RSSI information between different homes, even when data is collected using different experimental protocols. This enhances the performance and scalability of such intelligent systems by reducing the need for annotation in each home. We are the first to demonstrate that BLE RSSI data can be shared across different homes, and that shared information can improve the indoor localisation performance. Our T-ConGAN enhances the macro F1 score of room-level indoor localisation by up to 12.2%, with a remarkable 51% improvement in challenging areas such as stairways or outside spaces. This state-of-the-art RSSI augmentation model significantly enhances the robustness of in-home health monitoring systems.
摘要：随着对健康监测系统的需求不断增长，家庭定位对于跟踪患者状况至关重要。每栋房屋独特的空间特征都需要带注释的数据，以用于基于蓝牙低功耗 (BLE) 接收信号强度指示器 (RSSI) 的监测系统。然而，收集带注释的训练数据非常耗时，尤其是对于健康状况有限的患者。为了解决这个问题，我们提出了基于条件生成对抗网络 (ConGAN) 的增强，结合我们的迁移学习框架 (T-ConGAN)，以便在不同家庭之间传输通用 RSSI 信息，即使使用不同的实验协议收集数据也是如此。这通过减少每个家庭的注释需求来提高此类智能系统的性能和可扩展性。我们首次证明 BLE RSSI 数据可以在不同家庭之间共享，并且共享信息可以提高室内定位性能。我们的 T-ConGAN 将房间级室内定位的宏观 F1 得分提高了 12.2%，在楼梯或室外空间等具有挑战性的区域显著提高了 51%。这种最先进的 RSSI 增强模型显著增强了家庭健康监测系统的稳健性。

Title: GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression

Authors: Ziqi Zhou, Weize Quan, Hailin Shi, Wei Li, Lili Wang, Dong-ming Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09296
Pdf URL: https://arxiv.org/pdf/2412.09296
Copy Paste: [[2412.09296]] GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression(https://arxiv.org/abs/2412.09296)
Keywords: generation
Abstract: Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.
摘要：音频驱动的说话头部生成需要无缝集成音频和视觉数据，以应对输入肖像多样化以及音频和面部动作之间错综复杂的关联所带来的挑战。为此，我们提出了一个强大的框架 GoHD，旨在根据任何参考身份和任何动作制作高度逼真、富有表现力且可控的肖像视频。GoHD 通过三个关键模块进行了创新：首先，引入了一个利用潜在导航的动画模块，以提高对看不见的输入样式的泛化能力。该模块实现了运动和身份的高度分离，并且还结合了凝视方向来纠正以前被忽视的不自然的眼球运动。其次，设计了一个顺应者结构的条件扩散模型来保证头部姿势能够感知韵律。第三，为了在有限的训练数据中从输入音频中估计出唇形同步且逼真的表情，我们设计了一种两阶段训练策略，将频繁且逐帧的唇形运动提取与其他更依赖于时间但与音频关系较小的动作（例如眨眼和皱眉）的生成分离开来。大量实验验证了 GoHD 的高级泛化能力，证明了其在为任意对象生成逼真的说话面部结果方面的有效性。

Title: T-SVG: Text-Driven Stereoscopic Video Generation

Authors: Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, Yongdong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09323
Pdf URL: https://arxiv.org/pdf/2412.09323
Copy Paste: [[2412.09323]] T-SVG: Text-Driven Stereoscopic Video Generation(https://arxiv.org/abs/2412.09323)
Keywords: generation
Abstract: The advent of stereoscopic videos has opened new horizons in multimedia, particularly in extended reality (XR) and virtual reality (VR) applications, where immersive content captivates audiences across various platforms. Despite its growing popularity, producing stereoscopic videos remains challenging due to the technical complexities involved in generating stereo parallax. This refers to the positional differences of objects viewed from two distinct perspectives and is crucial for creating depth perception. This complex process poses significant challenges for creators aiming to deliver convincing and engaging presentations. To address these challenges, this paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system. This innovative, model-agnostic, zero-shot approach streamlines video generation by using text prompts to create reference videos. These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences, achieving a natural stereoscopic effect. T-SVG represents a significant advancement in stereoscopic content creation by integrating state-of-the-art, training-free techniques in text-to-video generation, depth estimation, and video inpainting. Its flexible architecture ensures high efficiency and user-friendliness, allowing seamless updates with newer models without retraining. By simplifying the production pipeline, T-SVG makes stereoscopic video generation accessible to a broader audience, demonstrating its potential to revolutionize the field.
摘要：立体视频的出现为多媒体开辟了新视野，尤其是在扩展现实 (XR) 和虚拟现实 (VR) 应用中，沉浸式内容吸引了各个平台的观众。尽管立体视频越来越受欢迎，但由于生成立体视差的技术复杂性，制作立体视频仍然具有挑战性。这是指从两个不同角度看到的物体的位置差异，对于创建深度感知至关重要。这个复杂的过程对旨在提供令人信服和引人入胜的演示的创作者提出了重大挑战。为了应对这些挑战，本文介绍了文本驱动的立体视频生成 (T-SVG) 系统。这种创新的、与模型无关的零镜头方法通过使用文本提示来创建参考视频，从而简化了视频生成。这些视频被转换成 3D 点云序列，从具有微妙视差差异的两个角度渲染，实现自然的立体效果。 T-SVG 集成了文本转视频生成、深度估计和视频修复方面最先进的免训练技术，代表了立体内容创作的重大进步。其灵活的架构确保了高效率和用户友好性，允许无缝更新较新的模型而无需重新训练。通过简化制作流程，T-SVG 让更广泛的受众能够生成立体视频，展现了其彻底改变该领域的潜力。

Title: Are Conditional Latent Diffusion Models Effective for Image Restoration?

Authors: Yunchen Yuan, Junyuan Xiao, Xinjie Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09324
Pdf URL: https://arxiv.org/pdf/2412.09324
Copy Paste: [[2412.09324]] Are Conditional Latent Diffusion Models Effective for Image Restoration?(https://arxiv.org/abs/2412.09324)
Keywords: restoration, generation
Abstract: Recent advancements in image restoration increasingly employ conditional latent diffusion models (CLDMs). While these models have demonstrated notable performance improvements in recent years, this work questions their suitability for IR tasks. CLDMs excel in capturing high-level semantic correlations, making them effective for tasks like text-to-image generation with spatial conditioning. However, in IR, where the goal is to enhance image perceptual quality, these models face difficulty of modeling the relationship between degraded images and ground truth images using a low-level representation. To support our claims, we compare state-of-the-art CLDMs with traditional image restoration models through extensive experiments. Results reveal that despite the scaling advantages of CLDMs, they suffer from high distortion and semantic deviation, especially in cases with minimal degradation, where traditional methods outperform them. Additionally, we perform empirical studies to examine the impact of various CLDM design elements on their restoration performance. We hope this finding inspires a reexamination of current CLDM-based IR solutions, opening up more opportunities in this field.
摘要：图像恢复领域的最新进展越来越多地采用条件潜在扩散模型 (CLDM)。尽管这些模型近年来表现出显着的性能改进，但这项工作质疑它们是否适用于 IR 任务。CLDM 在捕获高级语义相关性方面表现出色，使其能够有效地完成诸如使用空间条件的文本到图像生成等任务。然而，在 IR 中，目标是提高图像感知质量，这些模型面临着使用低级表示对退化图像和地面真实图像之间的关系进行建模的困难。为了支持我们的主张，我们通过大量实验将最先进的 CLDM 与传统图像恢复模型进行了比较。结果表明，尽管 CLDM 具有扩展优势，但它们存在高失真和语义偏差，尤其是在退化程度最小的情况下，传统方法的表现优于它们。此外，我们还进行了实证研究，以检查各种 CLDM 设计元素对其恢复性能的影响。我们希望这一发现能够激发对当前基于 CLDM 的 IR 解决方案的重新审视，从而为该领域开辟更多机会。

Title: Auto-Regressive Moving Diffusion Models for Time Series Forecasting

Authors: Jiaxin Gao, Qinglong Cao, Yuntian Chen
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2412.09328
Pdf URL: https://arxiv.org/pdf/2412.09328
Copy Paste: [[2412.09328]] Auto-Regressive Moving Diffusion Models for Time Series Forecasting(https://arxiv.org/abs/2412.09328)
Keywords: generation
Abstract: Time series forecasting (TSF) is essential in various domains, and recent advancements in diffusion-based TSF models have shown considerable promise. However, these models typically adopt traditional diffusion patterns, treating TSF as a noise-based conditional generation task. This approach neglects the inherent continuous sequential nature of time series, leading to a fundamental misalignment between diffusion mechanisms and the TSF objective, thereby severely impairing performance. To bridge this misalignment, and inspired by the classic Auto-Regressive Moving Average (ARMA) theory, which views time series as continuous sequential progressions evolving from previous data points, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to first achieve the continuous sequential diffusion-based TSF. Unlike previous methods that start from white Gaussian noise, our model employs chain-based diffusion with priors, accurately modeling the evolution of time series and leveraging intermediate state information to improve forecasting accuracy and stability. Specifically, our approach reinterprets the diffusion process by considering future series as the initial state and historical series as the final state, with intermediate series generated using a sliding-based technique during the forward process. This design aligns the diffusion model's sampling procedure with the forecasting objective, resulting in an unconditional, continuous sequential diffusion TSF model. Extensive experiments conducted on seven widely used datasets demonstrate that our model achieves state-of-the-art performance, significantly outperforming existing diffusion-based TSF models. Our code is available on GitHub: this https URL.
摘要：时间序列预测 (TSF) 在各个领域都至关重要，基于扩散的 TSF 模型的最新进展显示出了巨大的潜力。然而，这些模型通常采用传统的扩散模式，将 TSF 视为基于噪声的条件生成任务。这种方法忽略了时间序列固有的连续顺序性质，导致扩散机制与 TSF 目标之间存在根本性的不一致，从而严重损害了性能。为了弥合这种不一致，我们受到经典自回归移动平均 (ARMA) 理论的启发，该理论将时间序列视为从先前数据点演变而来的连续顺序进程，提出了一种新颖的自回归移动扩散 (ARMD) 模型，首先实现基于连续顺序扩散的 TSF。与以前从白高斯噪声开始的方法不同，我们的模型采用基于链的先验扩散，准确模拟时间序列的演变并利用中间状态信息来提高预测准确性和稳定性。具体而言，我们的方法重新解释了扩散过程，将未来序列视为初始状态，将历史序列视为最终状态，并在前向过程中使用基于滑动的技术生成中间序列。这种设计将扩散模型的采样过程与预测目标相结合，从而形成了一个无条件的连续顺序扩散 TSF 模型。在七个广泛使用的数据集上进行的大量实验表明，我们的模型实现了最先进的性能，明显优于现有的基于扩散的 TSF 模型。我们的代码可在 GitHub 上找到：此 https URL。

Title: Causal Graphical Models for Vision-Language Compositional Understanding

Authors: Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.09353
Pdf URL: https://arxiv.org/pdf/2412.09353
Copy Paste: [[2412.09353]] Causal Graphical Models for Vision-Language Compositional Understanding(https://arxiv.org/abs/2412.09353)
Keywords: generative
Abstract: Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.
摘要：最近的研究通过实证研究显示，视觉语言模型 (VLM) 很难完全理解人类语言的组成特性，通常将图像标题建模为“词袋”。因此，它们在组成任务上表现不佳，因为组成任务需要更深入地理解句子的不同实体（主语、动词等）及其相互关系才能解决。在本文中，我们使用因果图模型 (CGM) 来建模文本和视觉标记之间的依赖关系，该模型使用依赖解析器构建，并训练由 VLM 视觉编码器调节的解码器。与标准自回归或并行预测不同，我们的解码器的生成过程遵循 CGM 结构进行部分排序。这种结构鼓励解码器仅学习句子中的主要因果依赖关系，丢弃虚假相关性。通过在五个组合基准上进行大量实验，我们表明，我们的方法显著优于所有最先进的组合方法，并且比使用更大的数据集训练的方法有所改进。

Title: UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

Authors: Delong Liu, Zhaohui Hou, Mingjie Zhan, Shihao Han, Zhicheng Zhao, Fei Su
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09389
Pdf URL: https://arxiv.org/pdf/2412.09389
Copy Paste: [[2412.09389]] UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer(https://arxiv.org/abs/2412.09389)
Keywords: generation
Abstract: Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at this https URL.
摘要：最近，基于扩散的视频生成模型取得了显著的成功。然而，现有的模型往往存在一致性弱、图像质量随时间下降等问题。为了克服这些挑战，受美学原则的启发，我们提出了一种非侵入式插件，称为统一帧组织器（UFO），它与任何基于扩散的视频生成模型兼容。UFO 包含一系列可调强度的自适应适配器，可以显著增强视频前景和背景之间的一致性，并在集成时不改变原始模型参数的情况下提高图像质量。UFO 的训练简单、高效、所需资源最少，并支持风格化训练。其模块化设计允许多个 UFO 组合，从而实现个性化视频生成模型的定制。此外，UFO 还支持在同一规格的不同模型之间直接迁移，而无需进行特定的再训练。实验结果表明，UFO 有效地提高了视频生成质量，并在公共视频生成基准中展示了其优越性。代码将在此 https URL 上公开提供。

Title: Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Authors: Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.09428
Pdf URL: https://arxiv.org/pdf/2412.09428
Copy Paste: [[2412.09428]] Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation(https://arxiv.org/abs/2412.09428)
Keywords: generation
Abstract: Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at this https URL.
摘要：多模态音乐生成旨在从多种输入模态（包括文本、视频和图像）中生成音乐。现有方法使用通用嵌入空间进行多模态融合。尽管它们在其他模态中很有效，但它们在多模态音乐生成中的应用面临着数据稀缺、跨模态对齐薄弱和可控性有限的挑战。本文通过使用文本和音乐的显式桥梁进行多模态对齐来解决这些问题。我们介绍了一种名为“视觉音乐桥梁”（VMB）的新方法。具体来说，多模态音乐描述模型将视觉输入转换为详细的文本描述以提供文本桥梁；双轨音乐检索模块结合了广泛和有针对性的检索策略来提供音乐桥梁并允许用户控制。最后，我们设计了一个显式条件音乐生成框架，以基于这两个桥梁生成音乐。我们对视频到音乐、图像到音乐、文本到音乐和可控音乐生成任务进行了实验，并对可控性进行了实验。结果表明，与以前的方法相比，VMB 显著提高了音乐质量、模态和定制一致性。VMB 为可解释和富有表现力的多模态音乐生成树立了新标准，可应用于各种多媒体领域。演示和代码可在此 https URL 上找到。

Title: Search Strategy Generation for Branch and Bound Using Genetic Programming

Authors: Gwen Maudet, Grégoire Danoy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.09444
Pdf URL: https://arxiv.org/pdf/2412.09444
Copy Paste: [[2412.09444]] Search Strategy Generation for Branch and Bound Using Genetic Programming(https://arxiv.org/abs/2412.09444)
Keywords: generation
Abstract: Branch-and-Bound (B\&B) is an exact method in integer programming that recursively divides the search space into a tree. During the resolution process, determining the next subproblem to explore within the tree-known as the search strategy-is crucial. Hand-crafted heuristics are commonly used, but none are effective over all problem classes. Recent approaches utilizing neural networks claim to make more intelligent decisions but are computationally expensive. In this paper, we introduce GP2S (Genetic Programming for Search Strategy), a novel machine learning approach that automatically generates a B\&B search strategy heuristic, aiming to make intelligent decisions while being computationally lightweight. We define a policy as a function that evaluates the quality of a B\&B node by combining features from the node and the problem; the search strategy policy is then defined by a best-first search based on this node ranking. The policy space is explored using a genetic programming algorithm, and the policy that achieves the best performance on a training set is selected. We compare our approach with the standard method of the SCIP solver, a recent graph neural network-based method, and handcrafted heuristics. Our first evaluation includes three types of primal hard problems, tested on instances similar to the training set and on larger instances. Our method is at most 2\% slower than the best baseline and consistently outperforms SCIP, achieving an average speedup of 11.3\%. Additionally, GP2S is tested on the MIPLIB 2017 dataset, generating multiple heuristics from different subsets of instances. It exceeds SCIP's average performance in 7 out of 10 cases across 15 times more instances and under a time limit 15 times longer, with some GP2S methods leading on most experiments in terms of the number of feasible solutions or optimality gap.
摘要：分支定界 (B\&B) 是整数规划中的一种精确方法，它以递归方式将搜索空间划分为一棵树。在解决过程中，确定树中要探索的下一个子问题（称为搜索策略）至关重要。通常使用手工制作的启发式方法，但没有一种对所有问题类别都有效。最近利用神经网络的方法声称可以做出更智能的决策，但计算成本高昂。在本文中，我们介绍了 GP2S（搜索策略的遗传编程），这是一种新颖的机器学习方法，可自动生成 B\&B 搜索策略启发式方法，旨在做出智能决策，同时计算量小。我们将策略定义为一个函数，该函数通过结合节点和问题的特征来评估 B\&B 节点的质量；然后通过基于此节点排名的最佳优先搜索来定义搜索策略策略。使用遗传编程算法探索策略空间，并选择在训练集上实现最佳性能的策略。我们将我们的方法与 SCIP 求解器的标准方法、一种最新的基于图神经网络的方法以及手工启发式方法进行了比较。我们的首次评估包括三种类型的原始难题，在与训练集相似的实例和更大的实例上进行了测试。我们的方法最多比最佳基线慢 2%，并且始终优于 SCIP，平均加速 11.3%。此外，GP2S 在 MIPLIB 2017 数据集上进行了测试，从不同的实例子集生成了多个启发式方法。在 15 倍的实例和 15 倍的时间限制内，它在 10 个案例中有 7 个超过了 SCIP 的平均性能，一些 GP2S 方法在可行解决方案数量或最优性差距方面在大多数实验中处于领先地位。

Title: OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

Authors: Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09465
Pdf URL: https://arxiv.org/pdf/2412.09465
Copy Paste: [[2412.09465]] OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs(https://arxiv.org/abs/2412.09465)
Keywords: restoration, super-resolution, generative
Abstract: Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model's single-step predictions from initial states match the teacher's predictions from a closer intermediate state. Through extensive experiments on challenging datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Code and pre-trained models are available at this https URL and this https URL, respectively.
摘要：扩散和基于流的生成模型的最新进展在图像恢复任务中取得了显著的成功，与传统的深度学习方法相比，它们实现了卓越的感知质量。然而，这些方法要么需要大量的采样步骤来生成高质量的图像，从而产生大量的计算开销，要么依赖于模型蒸馏，这通常会施加固定的保真度-真实度权衡，因此缺乏灵活性。在本文中，我们介绍了 OFTSR，这是一种基于流的新型单步图像超分辨率框架，可以产生具有可调保真度和真实度的输出。我们的方法首先训练一个条件基于流的超分辨率模型作为教师模型。然后，我们通过应用专门的约束来提炼这个教师模型。具体来说，我们强制我们的单步学生模型对相同输入的预测位于教师模型的相同采样 ODE 轨迹上。这种对齐确保学生模型从初始状态的单步预测与教师从更接近的中间状态的预测相匹配。通过在 FFHQ (256$\times$256)、DIV2K 和 ImageNet (256$\times$256) 等具有挑战性的数据集上进行大量实验，我们证明了 OFTSR 在一步图像超分辨率方面实现了最先进的性能，同时能够灵活地调整保真度与真实度之间的权衡。代码和预训练模型分别在此 https URL 和此 https URL 上提供。

Title: SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing

Authors: Xueting Li, Ye Yuan, Shalini De Mello, Gilles Daviet, Jonathan Leaf, Miles Macklin, Jan Kautz, Umar Iqbal
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09545
Pdf URL: https://arxiv.org/pdf/2412.09545
Copy Paste: [[2412.09545]] SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing(https://arxiv.org/abs/2412.09545)
Keywords: generation, generative
Abstract: We introduce SimAvatar, a framework designed to generate simulation-ready clothed 3D human avatars from a text prompt. Current text-driven human avatar generation methods either model hair, clothing, and the human body using a unified geometry or produce hair and garments that are not easily adaptable for simulation within existing simulation pipelines. The primary challenge lies in representing the hair and garment geometry in a way that allows leveraging established prior knowledge from foundational image diffusion models (e.g., Stable Diffusion) while being simulation-ready using either physics or neural simulators. To address this task, we propose a two-stage framework that combines the flexibility of 3D Gaussians with simulation-ready hair strands and garment meshes. Specifically, we first employ three text-conditioned 3D generative models to generate garment mesh, body shape and hair strands from the given text prompt. To leverage prior knowledge from foundational diffusion models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair strands and learn the avatar appearance through optimization. To drive the avatar given a pose sequence, we first apply physics simulators onto the garment meshes and hair strands. We then transfer the motion onto 3D Gaussians through carefully designed mechanisms for each body part. As a result, our synthesized avatars have vivid texture and realistic dynamic motion. To the best of our knowledge, our method is the first to produce highly realistic, fully simulation-ready 3D avatars, surpassing the capabilities of current approaches.
摘要：我们介绍了 SimAvatar，这是一个旨在根据文本提示生成可用于模拟的穿着衣服的 3D 人体头像的框架。当前基于文本的人体头像生成方法要么使用统一的几何图形来建模头发、衣服和人体，要么生成不易适应现有模拟流程模拟的头发和服装。主要挑战在于以一种允许利用基础图像扩散模型（例如稳定扩散）中已建立的先验知识的方式来表示头发和服装的几何形状，同时可以使用物理或神经模拟器进行模拟。为了完成这项任务，我们提出了一个两阶段框架，将 3D 高斯的灵活性与可用于模拟的发束和服装网格相结合。具体来说，我们首先采用三个文本条件 3D 生成模型，根据给定的文本提示生成服装网格、体形和发束。为了利用基础扩散模型中的先验知识，我们将 3D 高斯附加到身体网格、服装网格以及发束上，并通过优化来学习头像外观。为了根据给定的姿势序列驱动虚拟角色，我们首先将物理模拟器应用到服装网格和发束上。然后，我们通过为每个身体部位精心设计的机制将运动转移到 3D 高斯上。因此，我们合成的虚拟角色具有生动的纹理和逼真的动态运动。据我们所知，我们的方法是第一个能够生成高度逼真、完全可模拟的 3D 虚拟角色的方法，超越了当前方法的能力。

Title: Video Creation by Demonstration

Authors: Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09551
Pdf URL: https://arxiv.org/pdf/2412.09551
Copy Paste: [[2412.09551]] Video Creation by Demonstration(https://arxiv.org/abs/2412.09551)
Keywords: generation
Abstract: We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $\delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $\delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at this https URL.
摘要：我们探索了一种新颖的视频创作体验，即通过演示进行视频创作。给定一个演示视频和一个来自不同场景的上下文图像，我们生成一个物理上可信的视频，该视频自然地延续了上下文图像并执行了演示中的动作概念。为了实现此功能，我们提出了 $\delta$-Diffusion，这是一种自监督训练方法，通过条件未来帧预测从未标记的视频中学习。与大多数基于显式信号的现有视频生成控制不同，我们采用隐式潜在控制的形式，以实现一般视频所需的最大灵活性和表现力。通过利用顶部具有外观瓶颈设计的视频基础模型，我们从演示视频中提取动作潜伏，以最小的外观泄漏来调节生成过程。从经验上看，$\delta$-Diffusion 在人类偏好和大规模机器评估方面都优于相关基线，并展示了交互式世界模拟的潜力。采样视频生成结果可在此 https URL 上获得。

Title: LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors

Authors: Yabo Chen, Chen Yang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Wei Shen, Wenrui Dai, Hongkai Xiong, Qi Tian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.09597
Pdf URL: https://arxiv.org/pdf/2412.09597
Copy Paste: [[2412.09597]] LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors(https://arxiv.org/abs/2412.09597)
Keywords: generation, generative
Abstract: Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
摘要：由于固有的几何模糊性和有限的视点信息，单幅图像 3D 重建仍然是计算机视觉领域的一项基本挑战。潜在视频扩散模型 (LVDM) 的最新进展提供了从大规模视频数据中学习到的有希望的 3D 先验。然而，有效利用这些先验面临三个关键挑战：(1) 大相机运动导致质量下降，(2) 难以实现精确的相机控制，以及 (3) 扩散过程固有的几何扭曲会损害 3D 一致性。我们通过提出 LiftImage3D 来解决这些挑战，这是一个有效释放 LVDM 的生成先验同时确保 3D 一致性的框架。具体来说，我们设计了一种铰接轨迹策略来生成视频帧，将具有大相机运动的视频序列分解为具有可控小运动的视频序列。然后我们使用强大的神经匹配模型，即 MASt3R，来校准生成帧的相机姿势并生成相应的点云。最后，我们提出了一种失真感知的 3D 高斯 splatting 表示，它可以学习帧之间的独立失真并输出未失真的正则高斯。大量实验表明，LiftImage3D 在两个具有挑战性的数据集（即 LLFF、DL3DV 和 Tanks and Temples）上实现了最先进的性能，并且可以很好地推广到各种自然图像，从卡通插图到复杂的现实世界场景。

Title: Owl-1: Omni World Model for Consistent Long Video Generation

Authors: Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09600
Pdf URL: https://arxiv.org/pdf/2412.09600
Copy Paste: [[2412.09600]] Owl-1: Omni World Model for Consistent Long Video Generation(https://arxiv.org/abs/2412.09600)
Keywords: generation
Abstract: Video generation models (VGMs) have received extensive attention recently and serve as promising candidates for general-purpose large vision models. While they can only generate short videos each time, existing methods achieve long video generation by iteratively calling the VGMs, using the last-frame output as the condition for the next-round generation. However, the last frame only contains short-term fine-grained information about the scene, resulting in inconsistency in the long horizon. To address this, we propose an Omni World modeL (Owl-1) to produce long-term coherent and comprehensive conditions for consistent long video generation. As videos are observations of the underlying evolving world, we propose to model the long-term developments in a latent space and use VGMs to film them into videos. Specifically, we represent the world with a latent state variable which can be decoded into explicit video observations. These observations serve as a basis for anticipating temporal dynamics which in turn update the state variable. The interaction between evolving dynamics and persistent state enhances the diversity and consistency of the long videos. Extensive experiments show that Owl-1 achieves comparable performance with SOTA methods on VBench-I2V and VBench-Long, validating its ability to generate high-quality video observations. Code: this https URL.
摘要：视频生成模型 (VGM) 最近受到广泛关注，并成为通用大型视觉模型的有希望的候选者。虽然它们每次只能生成短视频，但现有方法通过迭代调用 VGM 来实现长视频生成，使用最后一帧的输出作为下一轮生成的条件。然而，最后一帧只包含关于场景的短期细粒度信息，导致长期不一致。为了解决这个问题，我们提出了一个 Omni World 模型 (Owl-1)，以产生长期连贯和全面的条件，以实现一致的长视频生成。由于视频是对底层不断发展的世界的观察，我们建议在潜在空间中对长期发展进行建模，并使用 VGM 将它们拍摄成视频。具体来说，我们用一个潜在状态变量来表示世界，该变量可以解码为明确的视频观察。这些观察是预测时间动态的基础，进而更新状态变量。不断发展的动态和持久状态之间的相互作用增强了长视频的多样性和一致性。大量实验表明，Owl-1 在 VBench-I2V 和 VBench-Long 上实现了与 SOTA 方法相当的性能，验证了其生成高质量视频观测的能力。代码：这个 https URL。

Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09604
Pdf URL: https://arxiv.org/pdf/2412.09604
Copy Paste: [[2412.09604]] SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding(https://arxiv.org/abs/2412.09604)
Keywords: generation
Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.
摘要：大型语言模型 (LLM) 的显著成功已扩展到多模态领域，在图像理解和生成方面取得了出色的表现。最近，人们致力于开发集成这些功能的统一多模态大型语言模型 (MLLM)，并取得了令人鼓舞的成果。然而，现有的方法通常涉及模型架构或训练流程的复杂设计，增加了模型训练和扩展的难度。在本文中，我们提出了 SynerGen-VL，这是一种简单但功能强大的无编码器 MLLM，能够同时进行图像理解和生成。为了解决现有无编码器统一 MLLM 中发现的挑战，我们引入了 token 折叠机制和基于视觉专家的渐进式对齐预训练策略，它们有效地支持高分辨率图像理解，同时降低了训练复杂度。在使用统一的下一个 token 预测目标对大规模混合图像文本数据进行训练后，SynerGen-VL 达到或超越了具有相当或更小参数大小的现有无编码器统一 MLLM 的性能，并缩小了与特定任务的最先进的模型的差距，为未来统一 MLLM 指明了一条有希望的道路。我们的代码和模型将会发布。

Title: Spectral Image Tokenizer

Authors: Carlos Esteves, Mohammed Suhail, Ameesh Makadia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09607
Pdf URL: https://arxiv.org/pdf/2412.09607
Copy Paste: [[2412.09607]] Spectral Image Tokenizer(https://arxiv.org/abs/2412.09607)
Keywords: generation
Abstract: Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction -- instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.
摘要：图像标记器将图像映射到离散标记序列，是基于自回归变换器的图像生成的关键组件。标记通常与输入图像中的空间位置相关联，按光栅扫描顺序排列，这对于自回归建模来说并不理想。在本文中，我们建议对从离散小波变换 (DWT) 获得的图像频谱进行标记，以便标记序列以由粗到细的方式表示图像。我们的标记器带来了几个优点：1) 它利用了自然图像在高频下更易压缩的特点，2) 它可以拍摄和重建不同分辨率的图像而无需重新训练，3) 它改进了下一个标记预测的条件——它不是对图像的部分逐行重建进行条件，而是对整个图像进行粗略重建，4) 它支持部分解码，其中前几个生成的标记可以重建图像的粗略版本，5) 它使自回归模型可用于图像上采样。我们评估了标记器重建指标以及多尺度图像生成、文本引导的图像上采样和编辑。

Title: FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Authors: Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09611
Pdf URL: https://arxiv.org/pdf/2412.09611
Copy Paste: [[2412.09611]] FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers(https://arxiv.org/abs/2412.09611)
Keywords: generation
Abstract: Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.
摘要：整流流模型已成为图像生成领域的主流方法，在高质量图像合成方面展现出令人印象深刻的能力。然而，尽管整流流模型在视觉生成方面非常有效，但它们在图像解缠编辑方面往往存在困难。这一限制阻碍了执行精确的、特定于属性的修改而不影响图像的不相关方面的能力。在本文中，我们介绍了 FluxSpace，这是一种与领域无关的图像编辑方法，它利用表示空间来控制整流流转换器（如 Flux）生成的图像的语义。通过利用整流流模型中的转换器块学习到的表示，我们提出了一组语义上可解释的表示，可实现从细粒度图像编辑到艺术创作等广泛的图像编辑任务。这项工作提供了一种可扩展且有效的图像编辑方法及其解缠功能。

Title: Olympus: A Universal Task Router for Computer Vision Tasks

Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09612
Pdf URL: https://arxiv.org/pdf/2412.09612
Copy Paste: [[2412.09612]] Olympus: A Universal Task Router for Computer Vision Tasks(https://arxiv.org/abs/2412.09612)
Keywords: generative
Abstract: We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: this https URL
摘要：我们引入了 Olympus，这是一种将多模态大型语言模型 (MLLM) 转换为能够处理各种计算机视觉任务的统一框架的新方法。利用控制器 MLLM，Olympus 将图像、视频和 3D 对象的 20 多个专门任务委托给专用模块。这种基于指令的路由通过链式操作实现复杂的工作流程，而无需训练繁重的生成模型。Olympus 可轻松与现有的 MLLM 集成，以可比的性能扩展其功能。实验结果表明，Olympus 在 20 个任务中实现了 94.75% 的平均路由准确率，在链式操作场景中实现了 91.82% 的精度，展示了其作为通用任务路由器的有效性，可以解决各种计算机视觉任务。项目页面：此 https URL

Title: Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG

Authors: Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09614
Pdf URL: https://arxiv.org/pdf/2412.09614
Copy Paste: [[2412.09614]] Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge Graph-Based RAG(https://arxiv.org/abs/2412.09614)
Keywords: generation
Abstract: We introduce a novel approach to enhance the capabilities of text-to-image models by incorporating a graph-based RAG. Our system dynamically retrieves detailed character information and relational data from the knowledge graph, enabling the generation of visually accurate and contextually rich images. This capability significantly improves upon the limitations of existing T2I models, which often struggle with the accurate depiction of complex or culturally specific subjects due to dataset constraints. Furthermore, we propose a novel self-correcting mechanism for text-to-image models to ensure consistency and fidelity in visual outputs, leveraging the rich context from the graph to guide corrections. Our qualitative and quantitative experiments demonstrate that Context Canvas significantly enhances the capabilities of popular models such as Flux, Stable Diffusion, and DALL-E, and improves the functionality of ControlNet for fine-grained image editing tasks. To our knowledge, Context Canvas represents the first application of graph-based RAG in enhancing T2I models, representing a significant advancement for producing high-fidelity, context-aware multi-faceted images.
摘要：我们引入了一种新方法，通过结合基于图的 RAG 来增强文本到图像模型的功能。我们的系统从知识图谱中动态检索详细的字符信息和关系数据，从而生成视觉准确且上下文丰富的图像。此功能显著改善了现有 T2I 模型的局限性，这些模型由于数据集限制而经常难以准确描绘复杂或特定文化的主题。此外，我们为文本到图像模型提出了一种新颖的自我校正机制，以确保视觉输出的一致性和保真度，利用图中的丰富上下文来指导校正。我们的定性和定量实验表明，Context Canvas 显著增强了 Flux、Stable Diffusion 和 DALL-E 等流行模型的功能，并改进了 ControlNet 在细粒度图像编辑任务中的功能。据我们所知，Context Canvas 代表了基于图的 RAG 在增强 T2I 模型方面的首次应用，代表了生成高保真、上下文感知的多面图像的重大进步。

Title: EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Authors: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09618
Pdf URL: https://arxiv.org/pdf/2412.09618
Copy Paste: [[2412.09618]] EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM(https://arxiv.org/abs/2412.09618)
Keywords: generation
Abstract: Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
摘要：扩散模型的个性化已经取得了重大成就。传统的免调优方法大多通过平均图像嵌入作为注入条件来对多个参考图像进行编码，但这种与图像无关的操作无法在图像之间进行交互以捕获多个参考中的一致视觉元素。虽然基于调优的低秩自适应 (LoRA) 可以通过训练过程有效地提取多幅图像中的一致元素，但它需要对每个不同的图像组进行特定的微调。本文介绍了一种新颖的即插即用自适应方法 EasyRef，它使扩散模型能够以多幅参考图像和文本提示为条件。为了有效地利用多幅图像中的一致视觉元素，我们利用多模态大语言模型 (MLLM) 的多图像理解和指令跟踪能力，促使它根据指令捕获一致的视觉元素。此外，通过适配器将 MLLM 的表示注入到扩散过程中可以轻松推广到看不见的领域，挖掘看不见的数据中的一致视觉元素。为了降低计算成本并增强细粒度细节保存，我们引入了一种高效的参考聚合策略和一种渐进式训练方案。最后，我们引入了 MRBench，这是一种新的多参考图像生成基准。实验结果表明，EasyRef 超越了 IP-Adapter 等无需调整的方法和 LoRA 等基于调整的方法，实现了卓越的美学质量和跨不同领域的稳健零样本泛化。

Title: SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Authors: Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S.-H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09619
Pdf URL: https://arxiv.org/pdf/2412.09619
Copy Paste: [[2412.09619]] SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training(https://arxiv.org/abs/2412.09619)
Keywords: generation
Abstract: Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).
摘要：现有的文本到图像 (T2I) 扩散模型面临一些限制，包括模型尺寸大、运行时间慢以及在移动设备上的生成质量低。本文旨在通过开发一种极小且快速的 T2I 模型来解决所有这些挑战，该模型可在移动平台上生成高分辨率和高质量的图像。我们提出了几种技术来实现这一目标。首先，我们系统地检查网络架构的设计选择，以减少模型参数和延迟，同时确保高质量的生成。其次，为了进一步提高生成质量，我们从更大的模型中采用跨架构知识提炼，使用多层次方法从头开始指导我们的模型训练。第三，我们通过将对抗性指导与知识提炼相结合来实现几步生成。我们的模型 SnapGen 首次展示了在移动设备上大约 1.4 秒内生成 1024x1024 像素图像。在 ImageNet-1K 上，我们的模型仅具有 372M 个参数，在 256x256 像素生成中实现了 2.06 的 FID。在 T2I 基准测试（即 GenEval 和 DPG-Bench）上，我们的模型仅具有 379M 个参数，以显著较小的尺寸超越了具有数十亿个参数的大规模模型（例如，比 SDXL 小 7 倍，比 IF-XL 小 14 倍）。

Title: LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Authors: Enis Simsar, Thomas Hofmann, Federico Tombari, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09622
Pdf URL: https://arxiv.org/pdf/2412.09622
Copy Paste: [[2412.09622]] LoRACLR: Contrastive Adaptation for Customization of Diffusion Models(https://arxiv.org/abs/2412.09622)
Keywords: generation
Abstract: Recent advances in text-to-image customization have enabled high-fidelity, context-rich generation of personalized images, allowing specific concepts to appear in a variety of scenarios. However, current methods struggle with combining multiple personalized models, often leading to attribute entanglement or requiring separate training to preserve concept distinctiveness. We present LoRACLR, a novel approach for multi-concept image generation that merges multiple LoRA models, each fine-tuned for a distinct concept, into a single, unified model without additional individual fine-tuning. LoRACLR uses a contrastive objective to align and merge the weight spaces of these models, ensuring compatibility while minimizing interference. By enforcing distinct yet cohesive representations for each concept, LoRACLR enables efficient, scalable model composition for high-quality, multi-concept image synthesis. Our results highlight the effectiveness of LoRACLR in accurately merging multiple concepts, advancing the capabilities of personalized image generation.
摘要：文本到图像定制领域的最新进展使得高保真、内容丰富的个性化图像生成成为可能，使特定概念能够出现在各种场景中。然而，当前的方法难以结合多个个性化模型，这往往会导致属性纠缠或需要单独训练才能保持概念的独特性。我们提出了一种用于多概念图像生成的新方法 LoRACLR，它将多个 LoRA 模型（每个模型针对不同的概念进行微调）合并为一个统一的模型，而无需进行额外的单独微调。LoRACLR 使用对比目标来对齐和合并这些模型的权重空间，确保兼容性同时最大限度地减少干扰。通过为每个概念强制执行独特但有凝聚力的表示，LoRACLR 可以实现高效、可扩展的模型组合，从而实现高质量的多概念图像合成。我们的结果突出了 LoRACLR 在准确合并多个概念方面的有效性，提高了个性化图像生成的能力。

Title: OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

Authors: Weiqi Li, Shijie Zhao, Chong Mou, Xuhan Sheng, Zhenyu Zhang, Qian Wang, Junlin Li, Li Zhang, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09623
Pdf URL: https://arxiv.org/pdf/2412.09623
Copy Paste: [[2412.09623]] OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation(https://arxiv.org/abs/2412.09623)
Keywords: generation
Abstract: As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. The project page is available at this https URL.
摘要：随着虚拟现实越来越受欢迎，对可控的沉浸式动态全向视频 (ODV) 创作的需求也日益增加。虽然以前的文本到 ODV 生成方法取得了令人印象深刻的结果，但由于仅依赖文本输入，它们在内容不准确和不一致方面存在困难。尽管最近的运动控制技术为视频生成提供了细粒度的控制，但将这些方法直接应用于 ODV 通常会导致空间失真和性能不理想，尤其是在复杂的球面运动中。为了应对这些挑战，我们提出了 OmniDrag，这是第一种能够实现场景和对象级运动控制的方法，可实现准确、高质量的全向图像到视频生成。在预训练的视频扩散模型的基础上，我们引入了一个全向控制模块，该模块与时间注意层联合微调，以有效处理复杂的球面运动。此外，我们开发了一种新颖的球面运动估计器，可以准确提取运动控制信号，并允许用户通过简单地绘制手柄和目标点来执行拖动式 ODV 生成。我们还提出了一个名为 Move360 的新数据集，以解决具有大量场景和物体运动的 ODV 数据稀缺的问题。实验证明了 OmniDrag 在实现 ODV 生成的整体场景级和细粒度物体级控制方面具有显著优势。项目页面可在此 https URL 上找到。

Title: GenEx: Generating an Explorable World

Authors: Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.09624
Pdf URL: https://arxiv.org/pdf/2412.09624
Copy Paste: [[2412.09624]] GenEx: Generating an Explorable World(https://arxiv.org/abs/2412.09624)
Keywords: generation, generative
Abstract: Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.
摘要：理解、导航和探索 3D 物理现实世界长期以来一直是人工智能发展的核心挑战。在这项工作中，我们通过引入 GenEx 向这一目标迈进了一步，GenEx 是一个能够规划复杂具象世界探索的系统，由其生成性想象力引导，形成对周围环境的先验（期望）。GenEx 只需一张 RGB 图像即可生成整个 3D 一致的想象环境，并通过全景视频流将其变为现实。利用虚幻引擎提供的可扩展 3D 世界数据，我们的生成模型在物理世界中得到了完善。它毫不费力地捕捉连续的 360 度环境，为 AI 代理提供无边无际的景观供其探索和交互。GenEx 实现了高质量的世界生成、长轨迹上的强大循环一致性，并展示了强大的 3D 功能，例如一致性和主动 3D 映射。借助对世界的生成性想象，GPT 辅助代理能够执行复杂的具身任务，包括目标无关的探索和目标驱动的导航。这些代理利用对物理世界未见部分的预测期望来完善他们的信念，根据潜在决策模拟不同的结果，并做出更明智的选择。总之，我们证明了 GenEx 为在想象空间中推进具身 AI 提供了一个变革性平台，并为将这些功能扩展到现实世界的探索带来了潜力。

Title: Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors

Authors: Yue Feng, Vaibhav Sanjay, Spencer Lutz, Badour AlBahar, Songwei Ge, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09625
Pdf URL: https://arxiv.org/pdf/2412.09625
Copy Paste: [[2412.09625]] Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors(https://arxiv.org/abs/2412.09625)
Keywords: generation
Abstract: Automatically generating multiview illusions is a compelling challenge, where a single piece of visual content offers distinct interpretations from different viewing perspectives. Traditional methods, such as shadow art and wire art, create interesting 3D illusions but are limited to simple visual outputs (i.e., figure-ground or line drawing), restricting their artistic expressiveness and practical versatility. Recent diffusion-based illusion generation methods can generate more intricate designs but are confined to 2D images. In this work, we present a simple yet effective approach for creating 3D multiview illusions based on user-provided text prompts or images. Our method leverages a pre-trained text-to-image diffusion model to optimize the textures and geometry of neural 3D representations through differentiable rendering. When viewed from multiple angles, this produces different interpretations. We develop several techniques to improve the quality of the generated 3D multiview illusions. We demonstrate the effectiveness of our approach through extensive experiments and showcase illusion generation with diverse 3D forms.
摘要：自动生成多视图幻觉是一项极具挑战性的挑战，因为单个视觉内容从不同的观看角度提供不同的解释。传统方法，例如阴影艺术和线艺术，可以创建有趣的 3D 幻觉，但仅限于简单的视觉输出（即图形背景或线条画），限制了它们的艺术表现力和实用多功能性。最近基于扩散的幻觉生成方法可以生成更复杂的设计，但仅限于 2D 图像。在这项工作中，我们提出了一种基于用户提供的文本提示或图像创建 3D 多视图幻觉的简单而有效的方法。我们的方法利用预先训练的文本到图像扩散模型，通过可微分渲染来优化神经 3D 表示的纹理和几何形状。从多个角度看，这会产生不同的解释。我们开发了几种技术来提高生成的 3D 多视图幻觉的质量。我们通过大量实验证明了我们方法的有效性，并展示了具有多种 3D 形式的幻觉生成。

Title: FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Authors: Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09626
Pdf URL: https://arxiv.org/pdf/2412.09626
Copy Paste: [[2412.09626]] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion(https://arxiv.org/abs/2412.09626)
Keywords: generation
Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.
摘要：视觉扩散模型取得了显著进展，但由于缺乏高分辨率数据和计算资源受限，它们通常在有限的分辨率下进行训练，从而阻碍了它们以更高分辨率生成高保真图像或视频的能力。最近的努力探索了无需调整的策略，以展示预训练模型尚未开发的高分辨率视觉生成潜力。然而，这些方法仍然容易产生具有重复模式的低质量视觉内容。关键障碍在于，当模型生成的视觉内容超过其训练分辨率时，高频信息不可避免地会增加，从而导致由累积错误产生的不良重复模式。为了应对这一挑战，我们提出了 FreeScale，这是一种无需调整的推理范式，可通过尺度融合实现更高分辨率的视觉生成。具体而言，FreeScale 处理来自不同接受尺度的信息，然后通过提取所需的频率分量对其进行融合。大量实验验证了我们的范式在扩展图像和视频模型的高分辨率视觉生成能力方面的优越性。值得注意的是，与之前表现最佳的方法相比，FreeScale首次解锁了8k分辨率图像的生成。

Title: Doe-1: Closed-Loop Autonomous Driving with Large World Model

Authors: Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.09627
Pdf URL: https://arxiv.org/pdf/2412.09627
Copy Paste: [[2412.09627]] Doe-1: Closed-Loop Autonomous Driving with Large World Model(https://arxiv.org/abs/2412.09627)
Keywords: generation
Abstract: End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: this https URL.
摘要：端到端自动驾驶因其从大量数据中学习的潜力而受到越来越多的关注。然而，大多数现有方法仍然是开环的，并且存在可扩展性弱、缺乏高阶交互和决策效率低下的问题。在本文中，我们探索了自动驾驶的闭环框架，并提出了一个大型驾驶世界模型 (Doe-1)，用于统一感知、预测和规划。我们将自动驾驶制定为下一个 token 生成问题，并使用多模态 token 来完成不同的任务。具体来说，我们使用自由格式的文本（即场景描述）进行感知，并使用图像 token 直接在 RGB 空间中生成未来预测。对于规划，我们使用位置感知 tokenizer 来有效地将动作编码为离散 token。我们训练一个多模态转换器，以端到端和统一的方式自回归生成感知、预测和规划 token。在广泛使用的 nuScenes 数据集上进行的实验证明了 Doe-1 在各种任务中的有效性，包括视觉问答、动作条件视频生成和运动规划。代码：此 https URL。