2024-12-02

Title: Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop

Authors: Zhaofang Qian, Abolfazl Sharifi, Tucker Carroll, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18644
Pdf URL: https://arxiv.org/pdf/2411.18644
Copy Paste: [[2411.18644]] Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop(https://arxiv.org/abs/2411.18644)
Keywords: generation
Abstract: Video generation has achieved impressive quality, but it still suffers from artifacts such as temporal inconsistency and violation of physical laws. Leveraging 3D scenes can fundamentally resolve these issues by providing precise control over scene entities. To facilitate the easy generation of diverse photorealistic scenes, we propose Scene Copilot, a framework combining large language models (LLMs) with a procedural 3D scene generator. Specifically, Scene Copilot consists of Scene Codex, BlenderGPT, and Human in the loop. Scene Codex is designed to translate textual user input into commands understandable by the 3D scene generator. BlenderGPT provides users with an intuitive and direct way to precisely control the generated 3D scene and the final output video. Furthermore, users can utilize Blender UI to receive instant visual feedback. Additionally, we have curated a procedural dataset of objects in code format to further enhance our system's capabilities. Each component works seamlessly together to support users in generating desired 3D scenes. Extensive experiments demonstrate the capability of our framework in customizing 3D scenes and video generation.
摘要：视频生成已达到令人印象深刻的质量，但仍然存在诸如时间不一致和违反物理定律等瑕疵。利用 3D 场景可以通过提供对场景实体的精确控制从根本上解决这些问题。为了便于轻松生成各种逼真的场景，我们提出了 Scene Copilot，这是一个将大型语言模型 (LLM) 与程序 3D 场景生成器相结合的框架。具体来说，Scene Copilot 由 Scene Codex、BlenderGPT 和 Human in the loop 组成。Scene Codex 旨在将文本用户输入转换为 3D 场景生成器可以理解的命令。BlenderGPT 为用户提供了一种直观直接的方式来精确控制生成的 3D 场景和最终输出视频。此外，用户可以利用 Blender UI 来接收即时视觉反馈。此外，我们还整理了一个代码格式的对象程序数据集，以进一步增强我们系统的功能。每个组件无缝协作，以支持用户生成所需的 3D 场景。大量实验证明了我们的框架在定制 3D 场景和视频生成方面的能力。

Title: RoMo: Robust Motion Segmentation Improves Structure from Motion

Authors: Lily Goli, Sara Sabour, Mark Matthews, Marcus Brubaker, Dmitry Lagun, Alec Jacobson, David J. Fleet, Saurabh Saxena, Andrea Tagliasacchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18650
Pdf URL: https://arxiv.org/pdf/2411.18650
Copy Paste: [[2411.18650]] RoMo: Robust Motion Segmentation Improves Structure from Motion(https://arxiv.org/abs/2411.18650)
Keywords: generation
Abstract: There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure-from-motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera-calibration pipelines. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
摘要：从单目随意拍摄的视频重建和生成 4D 场景方面取得了长足的进步。虽然这些任务严重依赖于已知的相机姿势，但使用运动结构 (SfM) 查找此类姿势的问题通常取决于稳健地将视频的静态部分与动态部分分开。缺乏对这个问题的稳健解决方案限制了 SfM 相机校准管道的性能。我们提出了一种基于视频的运动分割的新方法，以识别相对于固定世界框架移动的场景组件。我们简单但有效的迭代方法 RoMo 将光流和极线线索与预先训练的视频分割模型相结合。它的表现优于运动分割的无监督基线以及从合成数据训练的监督基线。更重要的是，现成的 SfM 管道与我们的分割掩码相结合，为具有动态内容的场景建立了一种新的最先进的相机校准方法，其表现远远优于现有方法。

Title: AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

Authors: Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18654
Pdf URL: https://arxiv.org/pdf/2411.18654
Copy Paste: [[2411.18654]] AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward(https://arxiv.org/abs/2411.18654)
Keywords: generation
Abstract: Recently, text-to-motion models have opened new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex relationship between textual prompts and desired motion outcomes. To address this, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.
摘要：最近，文本到动作模型为以更高的效率和灵活性创建逼真的人体运动开辟了新的可能性。然而，由于文本提示和期望的运动结果之间存在复杂的关系，将动作生成与事件级文本描述对齐带来了独特的挑战。为了解决这个问题，我们引入了 AToM，这是一个通过利用 GPT-4Vision 的奖励来增强生成的动作和文本提示之间的对齐的框架。AToM 包括三个主要阶段：首先，我们构建一个数据集 MotionPrefer，将三种类型的事件级文本提示与生成的动作配对，涵盖动作的完整性、时间关系和频率。其次，我们设计了一个利用 GPT-4Vision 进行详细运动注释的范式，包括每个子任务的视觉数据格式、任务特定指令和评分规则。最后，我们使用以该范式为指导的强化学习对现有的文本到动作模型进行微调。实验结果表明，AToM 显著提高了文本到动作生成的事件级对齐质量。

Title: OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains

Authors: Yixuan Zhang, Hui Yang, Chuanchen Luo, Junran Peng, Yuxi Wang, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18660
Pdf URL: https://arxiv.org/pdf/2411.18660
Copy Paste: [[2411.18660]] OOD-HOI: Text-Driven 3D Whole-Body Human-Object Interactions Generation Beyond Training Domains(https://arxiv.org/abs/2411.18660)
Keywords: generation
Abstract: Generating realistic 3D human-object interactions (HOIs) from text descriptions is a active research topic with potential applications in virtual and augmented reality, robotics, and animation. However, creating high-quality 3D HOIs remains challenging due to the lack of large-scale interaction data and the difficulty of ensuring physical plausibility, especially in out-of-domain (OOD) scenarios. Current methods tend to focus either on the body or the hands, which limits their ability to produce cohesive and realistic interactions. In this paper, we propose OOD-HOI, a text-driven framework for generating whole-body human-object interactions that generalize well to new objects and actions. Our approach integrates a dual-branch reciprocal diffusion model to synthesize initial interaction poses, a contact-guided interaction refiner to improve physical accuracy based on predicted contact areas, and a dynamic adaptation mechanism which includes semantic adjustment and geometry deformation to improve robustness. Experimental results demonstrate that our OOD-HOI could generate more realistic and physically plausible 3D interaction pose in OOD scenarios compared to existing methods.
摘要：从文本描述生成逼真的 3D 人机交互 (HOI) 是一个活跃的研究课题，在虚拟和增强现实、机器人和动画领域具有潜在应用。然而，由于缺乏大规模交互数据和难以确保物理合理性，尤其是在域外 (OOD) 场景中，创建高质量的 3D HOI 仍然具有挑战性。当前的方法往往侧重于身体或手部，这限制了它们产生连贯而逼真的交互的能力。在本文中，我们提出了 OOD-HOI，这是一个文本驱动的框架，用于生成全身人机交互，可以很好地推广到新物体和动作。我们的方法集成了双分支互惠扩散模型来合成初始交互姿势、接触引导交互细化器以根据预测的接触面积提高物理准确性，以及动态自适应机制，其中包括语义调整和几何变形以提高鲁棒性。实验结果表明，与现有方法相比，我们的 OOD-HOI 可以在 OOD 场景中生成更逼真、物理上更合理的 3D 交互姿势。

Title: HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior

Authors: Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C.K. Chan, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18662
Pdf URL: https://arxiv.org/pdf/2411.18662
Copy Paste: [[2411.18662]] HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior(https://arxiv.org/abs/2411.18662)
Keywords: super-resolution
Abstract: Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion-based Real-ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation-CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real-ISR scenarios through reduced prompt noise and enhanced spatial control.
摘要：文本到图像的扩散模型已成为现实世界图像超分辨率 (Real-ISR) 的强大先验。然而，现有方法可能会由于文本提示噪声大和缺乏空间信息而产生意想不到的结果。在本文中，我们提出了 HoliSDiP，这是一个利用语义分割为基于扩散的 Real-ISR 提供精确文本和空间指导的框架。我们的方法使用语义标签作为简洁的文本提示，同时通过分割掩码和我们提出的 Segmentation-CLIP Map 引入密集的语义指导。大量实验表明，HoliSDiP 通过降低提示噪声和增强空间控制，在各种 Real-ISR 场景中实现了图像质量的显着改善。

Title: Towards Chunk-Wise Generation for Long Videos

Authors: Siyang Zhang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18668
Pdf URL: https://arxiv.org/pdf/2411.18668
Copy Paste: [[2411.18668]] Towards Chunk-Wise Generation for Long Videos(https://arxiv.org/abs/2411.18668)
Keywords: generation, generative
Abstract: Generating long-duration videos has always been a significant challenge due to the inherent complexity of spatio-temporal domain and the substantial GPU memory demands required to calculate huge size tensors. While diffusion based generative models achieve state-of-the-art performance in video generation task, they are typically trained with predefined video resolutions and lengths. During inference, a noise tensor with specific resolution and length should be specified at first, and the model will perform denoising on the entire video tensor simultaneously, all the frames together. Such approach will easily raise an out-of-memory (OOM) problem when the specified resolution and/or length exceed a certain limit. One of the solutions to this problem is to generate many short video chunks autoregressively with strong inter-chunk spatio-temporal relation and then concatenate them together to form a long video. In this approach, a long video generation task is divided into multiple short video generation subtasks, and the cost of each subtask is reduced to a feasible level. In this paper, we conduct a detailed survey on long video generation with the autoregressive chunk-by-chunk strategy. We address common problems caused by applying short image-to-video models to long video tasks and design an efficient $k$-step search solution to mitigate these problems.
摘要：由于时空域的固有复杂性以及计算巨大张量所需的大量 GPU 内存需求，生成长时视频一直是一项重大挑战。虽然基于扩散的生成模型在视频生成任务中实现了最先进的性能，但它们通常是使用预定义的视频分辨率和长度进行训练的。在推理过程中，应首先指定具有特定分辨率和长度的噪声张量，然后模型将同时对整个视频张量（所有帧一起）进行去噪。当指定的分辨率和/或长度超过一定限制时，这种方法很容易引发内存不足 (OOM) 问题。解决该问题的方法之一是自回归地生成许多具有强块间时空关系的短视频块，然后将它们连接在一起形成长视频。在这种方法中，长视频生成任务被分成多个短视频生成子任务，并将每个子任务的成本降低到可行的水平。在本文中，我们对使用自回归逐块策略的长视频生成进行了详细调查。我们解决了将短图像到视频模型应用于长视频任务时引起的常见问题，并设计了一种高效的 $k$ 步搜索解决方案来缓解这些问题。

Title: FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Authors: Alice Heiman, Xiaoman Zhang, Emma Chen, Sung Eun Kim, Pranav Rajpurkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18672
Pdf URL: https://arxiv.org/pdf/2411.18672
Copy Paste: [[2411.18672]] FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models(https://arxiv.org/abs/2411.18672)
Keywords: generation
Abstract: Medical vision-language model models often struggle with generating accurate quantitative measurements in radiology reports, leading to hallucinations that undermine clinical reliability. We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. Specifically, FactCheXcker employs specialized modules and the code generation capabilities of large language models to solve measurement queries generated based on the original report. After extracting measurable findings, the results are incorporated into an updated report. We evaluate FactCheXcker on endotracheal tube placement, which accounts for an average of 78% of report measurements, using the MIMIC-CXR dataset and 11 medical report-generation models. Our results show that FactCheXcker significantly reduces hallucinations, improves measurement precision, and maintains the quality of the original reports. Specifically, FactCheXcker improves the performance of all 11 models and achieves an average improvement of 94.0% in reducing measurement hallucinations measured by mean absolute error.
摘要：医学视觉语言模型通常难以在放射学报告中生成准确的定量测量值，从而导致幻觉，损害临床可靠性。我们引入了 FactCheXcker，这是一个模块化框架，它通过利用改进的查询代码更新范式来消除放射学报告测量中的幻觉。具体来说，FactCheXcker 采用专门的模块和大型语言模型的代码生成功能来解决基于原始报告生成的测量查询。在提取可测量的发现后，结果将纳入更新的报告中。我们使用 MIMIC-CXR 数据集和 11 个医学报告生成模型评估了 FactCheXcker 在气管插管方面的效果，气管插管平均占报告测量值的 78%。我们的结果表明，FactCheXcker 显著减少了幻觉，提高了测量精度，并保持了原始报告的质量。具体来说，FactCheXcker 提高了所有 11 个模型的性能，在减少以平均绝对误差衡量的测量幻觉方面平均提高了 94.0%。

Title: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18673
Pdf URL: https://arxiv.org/pdf/2411.18673
Copy Paste: [[2411.18673]] AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers(https://arxiv.org/abs/2411.18673)
Keywords: generation, generative
Abstract: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.
摘要：最近，许多研究将 3D 摄像头控制集成到基础文本转视频模型中，但由此产生的摄像头控制通常不精确，并且视频生成质量会受到影响。在这项工作中，我们从第一原理的角度分析了摄像头运动，发现了一些见解，这些见解可以在不影响合成质量的情况下实现精确的 3D 摄像头操作。首先，我们确定视频中摄像头运动引起的运动本质上是低频的。这促使我们调整训练和测试姿势调节计划，加速训练收敛，同时提高视觉和运动质量。然后，通过探测无条件视频扩散变换器的表示，我们观察到它们在后台隐式执行摄像头姿势估计，并且只有一小部分层包含摄像头信息。这建议我们将摄像头调节的注入限制在架构的一个子集，以防止干扰其他视频功能，从而将训练参数减少 4 倍，提高训练速度，并将视觉质量提高 10%。最后，我们用一个精选的数据集补充了用于摄像头控制学习的典型数据集，该数据集包含 20K 个带有固定摄像头的多样化动态视频。这有助于模型消除摄像机和场景运动之间的差异，并改善生成的姿势调节视频的动态效果。我们将这些发现结合起来，设计了高级 3D 摄像机控制 (AC3D) 架构，这是用于生成具有摄像机控制的视频建模的全新先进模型。

Title: Active Data Curation Effectively Distills Large-Scale Multimodal Models

Authors: Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J. Hénaff
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18674
Pdf URL: https://arxiv.org/pdf/2411.18674
Copy Paste: [[2411.18674]] Active Data Curation Effectively Distills Large-Scale Multimodal Models(https://arxiv.org/abs/2411.18674)
Keywords: generative
Abstract: Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.
摘要：知识蒸馏 (KD) 是将大规模模型压缩为较小模型的事实标准。先前的研究已经探索了涉及不同目标函数、教师集成和权重继承的更复杂的 KD 策略。在这项工作中，我们探索了一种替代但简单的方法——主动数据管理作为对比多模态预训练的有效蒸馏。我们的简单在线批量选择方法 ACID 在各种模型、数据和计算配置中的表现都优于强大的 KD 基线。此外，我们发现这种主动数据管理策略实际上是对标准 KD 的补充，并且可以有效地结合起来训练高性能的推理效率模型。我们简单且可扩展的预训练框架 ACED 在 27 个零样本分类和检索任务中取得了最先进的结果，推理 FLOP 减少了 11%。我们进一步证明，我们的 ACED 模型可以生成强大的视觉编码器，用于在 LiT-Decoder 设置中训练生成多模态模型，其表现优于用于图像字幕和视觉问答任务的大型视觉编码器。

Title: MatchDiffusion: Training-free Generation of Match-cuts

Authors: Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18677
Pdf URL: https://arxiv.org/pdf/2411.18677
Copy Paste: [[2411.18677]] MatchDiffusion: Training-free Generation of Match-cuts(https://arxiv.org/abs/2411.18677)
Keywords: generation
Abstract: Match-cuts are powerful cinematic tools that create seamless transitions between scenes, delivering strong visual and metaphorical connections. However, crafting match-cuts is a challenging, resource-intensive process requiring deliberate artistic planning. In MatchDiffusion, we present the first training-free method for match-cut generation using text-to-video diffusion models. MatchDiffusion leverages a key property of diffusion models: early denoising steps define the scene's broad structure, while later steps add details. Guided by this insight, MatchDiffusion employs "Joint Diffusion" to initialize generation for two prompts from shared noise, aligning structure and motion. It then applies "Disjoint Diffusion", allowing the videos to diverge and introduce unique details. This approach produces visually coherent videos suited for match-cuts. User studies and metrics demonstrate MatchDiffusion's effectiveness and potential to democratize match-cut creation.
摘要：匹配剪辑是一种强大的电影工具，可在场景之间创建无缝过渡，提供强大的视觉和隐喻联系。然而，制作匹配剪辑是一个具有挑战性、资源密集型的过程，需要深思熟虑的艺术规划。在 MatchDiffusion 中，我们展示了第一种使用文本到视频扩散模型生成匹配剪辑的无需训练的方法。MatchDiffusion 利用了扩散模型的一个关键属性：早期的去噪步骤定义了场景的大致结构，而后续步骤则添加了细节。在这一洞察的指导下，MatchDiffusion 采用“联合扩散”从共享噪声中初始化两个提示的生成，使结构和运动保持一致。然后它应用“不相交扩散”，允许视频发散并引入独特的细节。这种方法可以制作适合匹配剪辑的视觉连贯的视频。用户研究和指标证明了 MatchDiffusion 的有效性和使匹配剪辑创作民主化的潜力。

Title: Random Walks with Tweedie: A Unified Framework for Diffusion Models

Authors: Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.18702
Pdf URL: https://arxiv.org/pdf/2411.18702
Copy Paste: [[2411.18702]] Random Walks with Tweedie: A Unified Framework for Diffusion Models(https://arxiv.org/abs/2411.18702)
Keywords: generative
Abstract: We present a simple template for designing generative diffusion model algorithms based on an interpretation of diffusion sampling as a sequence of random walks. Score-based diffusion models are widely used to generate high-quality images. Diffusion models have also been shown to yield state-of-the-art performance in many inverse problems. While these algorithms are often surprisingly simple, the theory behind them is not, and multiple complex theoretical justifications exist in the literature. Here, we provide a simple and largely self-contained theoretical justification for score-based-diffusion models that avoids using the theory of Markov chains or reverse diffusion, instead centering the theory of random walks and Tweedie's formula. This approach leads to unified algorithmic templates for network training and sampling. In particular, these templates cleanly separate training from sampling, e.g., the noise schedule used during training need not match the one used during sampling. We show that several existing diffusion models correspond to particular choices within this template and demonstrate that other, more straightforward algorithmic choices lead to effective diffusion models. The proposed framework has the added benefit of enabling conditional sampling without any likelihood approximation.
摘要：我们提出了一个简单的模板，用于设计生成扩散模型算法，该模板基于将扩散采样解释为一系列随机游动。基于分数的扩散模型被广泛用于生成高质量图像。扩散模型还已被证明在许多逆问题中产生最先进的性能。虽然这些算法通常非常简单，但其背后的理论却并非如此，文献中存在多种复杂的理论依据。在这里，我们为基于分数的扩散模型提供了一个简单且基本独立的理论依据，避免使用马尔可夫链或反向扩散理论，而是以随机游动理论和 Tweedie 公式为中心。这种方法为网络训练和采样提供了统一的算法模板。特别是，这些模板将训练与采样完全分开，例如，训练期间使用的噪声计划不必与采样期间使用的噪声计划相匹配。我们表明，现有的几种扩散模型对应于此模板中的特定选择，并证明其他更直接的算法选择可以产生有效的扩散模型。所提出的框架还有一个额外的好处，就是能够进行条件采样，而无需任何似然近似。

Title: Generative Visual Communication in the Era of Vision-Language Models

Authors: Yael Vinker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.18727
Pdf URL: https://arxiv.org/pdf/2411.18727
Copy Paste: [[2411.18727]] Generative Visual Communication in the Era of Vision-Language Models(https://arxiv.org/abs/2411.18727)
Keywords: generative
Abstract: Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.
摘要：视觉传达可以追溯到史前的洞穴壁画，它使用视觉元素来传达思想和信息。在当今这个视觉饱和的世界中，有效的设计需要了解平面设计原则、视觉叙事、人类心理学，并具备将复杂信息提炼为清晰视觉效果的能力。本论文探讨了如何利用视觉语言模型 (VLM) 的最新进展来自动创建有效的视觉传达设计。尽管生成模型在从文本生成图像方面取得了巨大进步，但它们仍然难以将复杂的想法简化为清晰、抽象的视觉效果，并且受到基于像素的输出的限制，这在许多设计任务中缺乏灵活性。为了应对这些挑战，我们限制了模型的操作空间并引入了特定于任务的正则化。我们探索视觉传达的各个方面，即草图和视觉抽象、排版、动画和视觉灵感。

Title: DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

Authors: Zheyan Zhang, Diego Klabjan, Renee CB Manworren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18745
Pdf URL: https://arxiv.org/pdf/2411.18745
Copy Paste: [[2411.18745]] DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration(https://arxiv.org/abs/2411.18745)
Keywords: restoration
Abstract: In this work, we address a challenge in video inpainting: reconstructing occluded regions in dynamic, real-world scenarios. Motivated by the need for continuous human motion monitoring in healthcare settings, where facial features are frequently obscured, we propose a diffusion-based video-level inpainting model, DiffMVR. Our approach introduces a dynamic dual-guided image prompting system, leveraging adaptive reference frames to guide the inpainting process. This enables the model to capture both fine-grained details and smooth transitions between video frames, offering precise control over inpainting direction and significantly improving restoration accuracy in challenging, dynamic environments. DiffMVR represents a significant advancement in the field of diffusion-based inpainting, with practical implications for real-time applications in various dynamic settings.
摘要：在这项工作中，我们解决了视频修复中的一项挑战：在动态的现实场景中重建被遮挡的区域。由于医疗环境中需要持续监测人体运动，而面部特征经常被遮挡，因此我们提出了一种基于扩散的视频级修复模型 DiffMVR。我们的方法引入了一种动态双引导图像提示系统，利用自适应参考帧来指导修复过程。这使模型能够捕捉精细的细节和视频帧之间的平滑过渡，提供对修复方向的精确控制，并显著提高具有挑战性的动态环境中的恢复精度。DiffMVR 代表了基于扩散的修复领域的重大进步，对各种动态设置中的实时应用具有实际意义。

Title: Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18810
Pdf URL: https://arxiv.org/pdf/2411.18810
Copy Paste: [[2411.18810]] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds(https://arxiv.org/abs/2411.18810)
Keywords: generation
Abstract: Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{\alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{\alpha}.
摘要：文本到图像的扩散模型已经展示了从任意文本提示生成逼真图像的卓越能力。然而，它们通常会对诸如“两只狗”或“碗右边的企鹅”之类的构图提示产生不一致的结果。了解这些不一致对于可靠的图像生成至关重要。在本文中，我们强调了初始噪声在这些不一致中的重要作用，其中某些噪声模式对于构图提示比其他模式更可靠。我们的分析表明，不同的初始随机种子倾向于引导模型将对象放置在不同的图像区域中，可能遵循与种子相关的特定相机角度和图像构图模式。为了提高模型的构图能力，我们提出了一种挖掘这些可靠案例的方法，从而生成一组精心挑选的生成图像训练集，而无需任何手动注释。通过对这些生成的图像微调文本到图像模型，我们显著增强了它们的构图能力。对于数值构图，我们观察到稳定扩散和 PixArt-{\alpha} 的相对增长分别为 29.3% 和 19.5%。空间构图的增益更大，Stable Diffusion 增益为 60.7%，PixArt-{\alpha} 增益为 21.1%。

Title: FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Authors: Junyang Chen, Jinshan Pan, Jiangxin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18824
Pdf URL: https://arxiv.org/pdf/2411.18824
Copy Paste: [[2411.18824]] FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution(https://arxiv.org/abs/2411.18824)
Keywords: super-resolution, generation
Abstract: Faithful image super-resolution (SR) not only needs to recover images that appear realistic, similar to image generation tasks, but also requires that the restored images maintain fidelity and structural consistency with the input. To this end, we propose a simple and effective method, named FaithDiff, to fully harness the impressive power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures. As there exists a significant gap between the features of degraded inputs and the noisy latent from the diffusion model, we then develop an effective alignment module to explore useful features from degraded inputs to align well with the diffusion process. Considering the indispensable roles and interplay of the encoder and diffusion model in LDMs, we jointly fine-tune them in a unified optimization framework, facilitating the encoder to extract useful features that coincide with diffusion process. Extensive experimental results demonstrate that FaithDiff outperforms state-of-the-art methods, providing high-quality and faithful SR results.
摘要：忠实图像超分辨率 (SR) 不仅需要恢复看起来逼真的图像，类似于图像生成任务，还要求恢复的图像保持与输入的保真度和结构一致性。为此，我们提出了一种简单有效的方法，称为 FaithDiff，以充分利用潜在扩散模型 (LDM) 的强大功能来实现忠实图像 SR。与现有的基于扩散的 SR 方法（冻结在高质量图像上预先训练的扩散模型）不同，我们建议先释放扩散以识别有用信息并恢复忠实结构。由于退化输入的特征与扩散模型中的噪声潜在特征之间存在显着差距，因此我们开发了一个有效的对齐模块来探索退化输入中的有用特征，以便与扩散过程很好地对齐。考虑到编码器和扩散模型在 LDM 中不可或缺的作用和相互作用，我们在统一的优化框架中共同对它们进行微调，以方便编码器提取与扩散过程相吻合的有用特征。大量实验结果表明，FaithDiff 优于最先进的方法，可提供高质量且忠实的 SR 结果。

Title: CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction

Authors: Lipeng Gu, Xuefeng Yan, Weiming Wang, Honghua Chen, Dingkun Zhu, Liangliang Nan, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18850
Pdf URL: https://arxiv.org/pdf/2411.18850
Copy Paste: [[2411.18850]] CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction(https://arxiv.org/abs/2411.18850)
Keywords: generation
Abstract: The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.
摘要：基于摄像头和激光雷达的检测融合为缓解 3D 多目标跟踪 (MOT) 中的跟踪失败提供了一种有前途的解决方案。然而，现有方法主要利用摄像头检测来纠正由潜在激光雷达检测问题导致的跟踪失败，而忽略了使用激光雷达数据细化摄像头检测的相互好处。这种限制源于它们的单阶段架构，类似于单阶段物体检测器，缺乏专用的轨迹细化模块来充分利用互补的多模态信息。为此，我们引入了 CrossTracker，一种用于在线多模态 3D MOT 的新型两阶段范式。CrossTracker 以由粗到精的方式运行，最初生成粗轨迹，然后通过独立的细化过程对其进行细化。具体来说，CrossTracker 包含三个基本模块：i）多模态建模 (M^3) 模块，通过融合多模态信息（图像、点云甚至从图像中提取的平面几何），为后续轨迹生成提供稳健的度量。ii）粗轨迹生成 (C-TG) 模块，用于生成初始粗双流轨迹；iii）轨迹细化 (TR) 模块，通过摄像头和激光雷达流之间的交叉校正来细化粗轨迹。全面的实验证明了我们的 CrossTracker 优于其 18 个竞争对手的性能，强调了它在利用摄像头和激光雷达传感器的协同优势实现稳健的多模态 3D MOT 方面的有效性。

Title: RIGI: Rectifying Image-to-3D Generation Inconsistency via Uncertainty-aware Learning

Authors: Jiacheng Wang, Zhedong Zheng, Wei Xu, Ping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18866
Pdf URL: https://arxiv.org/pdf/2411.18866
Copy Paste: [[2411.18866]] RIGI: Rectifying Image-to-3D Generation Inconsistency via Uncertainty-aware Learning(https://arxiv.org/abs/2411.18866)
Keywords: generation
Abstract: Given a single image of a target object, image-to-3D generation aims to reconstruct its texture and geometric shape. Recent methods often utilize intermediate media, such as multi-view images or videos, to bridge the gap between input image and the 3D target, thereby guiding the generation of both shape and texture. However, inconsistencies in the generated multi-view snapshots frequently introduce noise and artifacts along object boundaries, undermining the 3D reconstruction process. To address this challenge, we leverage 3D Gaussian Splatting (3DGS) for 3D reconstruction, and explicitly integrate uncertainty-aware learning into the reconstruction process. By capturing the stochasticity between two Gaussian models, we estimate an uncertainty map, which is subsequently used for uncertainty-aware regularization to rectify the impact of inconsistencies. Specifically, we optimize both Gaussian models simultaneously, calculating the uncertainty map by evaluating the discrepancies between rendered images from identical viewpoints. Based on the uncertainty map, we apply adaptive pixel-wise loss weighting to regularize the models, reducing reconstruction intensity in high-uncertainty regions. This approach dynamically detects and mitigates conflicts in multi-view labels, leading to smoother results and effectively reducing artifacts. Extensive experiments show the effectiveness of our method in improving 3D generation quality by reducing inconsistencies and artifacts.
摘要：给定目标对象的单张图像，图像到 3D 生成旨在重建其纹理和几何形状。最近的方法通常利用中间媒体（例如多视图图像或视频）来弥合输入图像和 3D 目标之间的差距，从而指导形状和纹理的生成。然而，生成的多视图快照中的不一致性经常会在对象边界上引入噪声和伪影，从而破坏 3D 重建过程。为了应对这一挑战，我们利用 3D 高斯分层 (3DGS) 进行 3D 重建，并将不确定性感知学习明确地集成到重建过程中。通过捕获两个高斯模型之间的随机性，我们估计一个不确定性图，随后将其用于不确定性感知正则化以纠正不一致的影响。具体来说，我们同时优化两个高斯模型，通过评估从相同视点渲染的图像之间的差异来计算不确定性图。基于不确定性图，我们应用自适应像素级损失权重来规范模型，从而降低高不确定性区域的重建强度。这种方法可以动态检测和缓解多视图标签中的冲突，从而获得更平滑的结果并有效减少伪影。大量实验表明，我们的方法通过减少不一致和伪影来提高 3D 生成质量是有效的。

Title: T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

Authors: Changsheng Lv, Mengshi Qi, Liang Liu, Huadong Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18894
Pdf URL: https://arxiv.org/pdf/2411.18894
Copy Paste: [[2411.18894]] T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving(https://arxiv.org/abs/2411.18894)
Keywords: generation
Abstract: Understanding the traffic scenes and then generating high-definition (HD) maps present significant challenges in autonomous driving. In this paper, we defined a novel Traffic Topology Scene Graph, a unified scene graph explicitly modeling the lane, controlled and guided by different road signals (e.g., right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. For the generation of T2SG, we propose TopoFormer, a novel one-stage Topology Scene Graph TransFormer with two newly designed layers. Specifically, TopoFormer incorporates a Lane Aggregation Layer (LAL) that leverages the geometric distance among the centerline of lanes to guide the aggregation of global information. Furthermore, we proposed a Counterfactual Intervention Layer (CIL) to model the reasonable road structure ( e.g., intersection, straight) among lanes under counterfactual intervention. Then the generated T2SG can provide a more accurate and explainable description of the topological structure in traffic scenes. Experimental results demonstrate that TopoFormer outperforms existing methods on the T2SG generation task, and the generated T2SG significantly enhances traffic topology reasoning in downstream tasks, achieving a state-of-the-art performance of 46.3 OLS on the OpenLane-V2 benchmark. We will release our source code and model.
摘要：理解交通场景并生成高清地图是自动驾驶面临的重大挑战。在本文中，我们定义了一种新颖的交通拓扑场景图，这是一个统一的场景图，明确模拟车道、由不同的道路信号（例如右转）控制和引导以及它们之间的拓扑关系，而以前的高清地图绘制方法始终忽略了这些关系。为了生成 T2SG，我们提出了 TopoFormer，这是一种新颖的单阶段拓扑场景图转换器，具有两个新设计的层。具体而言，TopoFormer 结合了车道聚合层 (LAL)，该层利用车道中心线之间的几何距离来指导全局信息的聚合。此外，我们提出了一个反事实干预层 (CIL) 来在反事实干预下模拟车道间的合理道路结构（例如交叉路口、直道）。然后，生成的 T2SG 可以提供交通场景中拓扑结构的更准确和可解释的描述。实验结果表明，TopoFormer 在 T2SG 生成任务上的表现优于现有方法，生成的 T2SG 显著增强了下游任务中的交通拓扑推理能力，在 OpenLane-V2 基准上取得了 46.3 OLS 的最佳性能。我们将发布源代码和模型。

Title: Textured As-Is BIM via GIS-informed Point Cloud Segmentation

Authors: Mohamed S. H. Alabassy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18898
Pdf URL: https://arxiv.org/pdf/2411.18898
Copy Paste: [[2411.18898]] Textured As-Is BIM via GIS-informed Point Cloud Segmentation(https://arxiv.org/abs/2411.18898)
Keywords: generation
Abstract: Creating as-is models from scratch is to this day still a time- and money-consuming task due to its high manual effort. Therefore, projects, especially those with a big spatial extent, could profit from automating the process of creating semantically rich 3D geometries from surveying data such as Point Cloud Data (PCD). An automation can be achieved by using Machine and Deep Learning Models for object recognition and semantic segmentation of PCD. As PCDs do not usually include more than the mere position and RGB colour values of points, tapping into semantically enriched Geoinformation System (GIS) data can be used to enhance the process of creating meaningful as-is models. This paper presents a methodology, an implementation framework and a proof of concept for the automated generation of GIS-informed and BIM-ready as-is Building Information Models (BIM) for railway projects. The results show a high potential for cost savings and reveal the unemployed resources of freely accessible GIS data within.
摘要：时至今日，从头开始创建原样模型仍然是一项耗时耗力的任务，因为需要大量的人工。因此，项目，尤其是那些具有大空间范围的项目，可以从自动化从测量数据（如点云数据 (PCD)）创建语义丰富的 3D 几何图形的过程中获益。可以通过使用机器和深度学习模型对 PCD 进行对象识别和语义分割来实现自动化。由于 PCD 通常只包含点的位置和 RGB 颜色值，因此可以利用语义丰富的地理信息系统 (GIS) 数据来增强创建有意义的原样模型的过程。本文介绍了一种自动生成铁路项目 GIS 信息和 BIM 就绪的原样建筑信息模型 (BIM) 的方法、实施框架和概念证明。结果显示成本节约潜力很大，并揭示了其中可免费访问的 GIS 数据的未利用资源。

Title: Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?

Authors: Adrian Tormos, Blanca Llauradó, Fernando Núñez, Axel Romero, Dario Garcia-Gasulla, Javier Béjar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.18926
Pdf URL: https://arxiv.org/pdf/2411.18926
Copy Paste: [[2411.18926]] Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough?(https://arxiv.org/abs/2411.18926)
Keywords: generative
Abstract: The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.
摘要：医学领域的数据稀缺阻碍了深度学习模型的性能。数据增强技术可以缓解这一问题，但它们通常依赖于数据的功能转换，而这些转换不能保证保留原始任务。使用生成模型来近似数据分布是一种减少该问题的方法，也可以获得与原始数据相似的新样本。去噪扩散模型是一种很有前途的深度学习技术，可以学习不同类型数据（如图像、时间序列或表格数据）的良好近似值。自动结肠镜检查分析，特别是结肠镜检查视频中的息肉定位是一项可以协助临床诊断和治疗的任务。用于训练深度学习模型的视频帧注释是一项耗时的任务，通常只能获得较小的数据集。使用大量生成的数据集对应用模型进行微调可以成为提高其性能的替代方法。我们进行了一组实验，训练不同的扩散模型，这些模型可以使用现有开放数据集的组合来联合生成带有定位注释的结肠镜检查图像。生成的数据用于息肉定位任务中的各种迁移学习实验，其中模型基于低数据条件下的 YOLO v9。

Title: VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

Authors: Sakshi Agarwal, Gabe Hoope, Erik B. Sudderth
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18929
Pdf URL: https://arxiv.org/pdf/2411.18929
Copy Paste: [[2411.18929]] VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference(https://arxiv.org/abs/2411.18929)
Keywords: generative
Abstract: Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.
摘要：扩散概率模型学习去除在训练期间人为添加到数据的噪声。然后可以通过一系列去噪操作从高斯噪声中生成新数据（如图像）。虽然此马尔可夫过程隐式定义了无噪声数据的联合分布，但要对屏蔽或部分图像进行生成过程的条件调整并不简单。已经提出了许多启发式采样程序来解决具有扩散先验的逆问题，但这些方法并不直接近似推理查询所施加的真实条件分布，并且通常对较大的屏蔽区域无效。此外，许多这些基线不能应用于使用图像编码来提高效率的潜在扩散模型。我们改为开发一种分层变分推理算法，该算法可以分析边缘化缺失特征，并使用严格的变分界限来优化真实扩散后验的非高斯马尔可夫近似。通过对基于像素和潜在扩散图像模型进行大量实验，我们表明我们的 VIPaint 方法在插补的合理性和多样性方面明显优于以前的方法，并且很容易推广到去模糊和超分辨率等其他逆问题。

Title: ICLERB: In-Context Learning Embedding and Reranker Benchmark

Authors: Marie Al Ghossein, Emile Contal, Alexandre Robicquet
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2411.18947
Pdf URL: https://arxiv.org/pdf/2411.18947
Copy Paste: [[2411.18947]] ICLERB: In-Context Learning Embedding and Reranker Benchmark(https://arxiv.org/abs/2411.18947)
Keywords: generation
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform new tasks by conditioning on prompts with relevant information. Retrieval-Augmented Generation (RAG) enhances ICL by incorporating retrieved documents into the LLM's context at query time. However, traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. In this paper, we propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks. We introduce the In-Context Learning Embedding and Reranker Benchmark (ICLERB), a novel evaluation framework that compares retrievers based on their ability to enhance LLM accuracy in ICL settings. Additionally, we propose a novel Reinforcement Learning-to-Rank from AI Feedback (RLRAIF) algorithm, designed to fine-tune retrieval models using minimal feedback from the LLM. Our experimental results reveal notable differences between ICLERB and existing benchmarks, and demonstrate that small models fine-tuned with our RLRAIF algorithm outperform large state-of-the-art retrieval models. These findings highlight the limitations of existing evaluation methods and the need for specialized benchmarks and training strategies adapted to ICL.
摘要：上下文学习 (ICL) 使大型语言模型 (LLM) 能够通过调节具有相关信息的提示来执行新任务。检索增强生成 (RAG) 通过在查询时将检索到的文档合并到 LLM 的上下文中来增强 ICL。但是，传统的检索方法侧重于语义相关性，将检索视为搜索问题。在本文中，我们建议将 ICL 的检索重新定义为推荐问题，旨在选择在 ICL 任务中效用最大化的文档。我们引入了上下文学习嵌入和重新排序基准 (ICLERB)，这是一种新颖的评估框架，它根据检索器在 ICL 设置中增强 LLM 准确性的能力对其进行比较。此外，我们提出了一种新颖的强化学习从 AI 反馈进行排名 (RLRAIF) 算法，旨在使用来自 LLM 的最少反馈来微调检索模型。我们的实验结果揭示了 ICLERB 与现有基准之间的显著差异，并表明使用我们的 RLRAIF 算法进行微调的小型模型优于大型最先进的检索模型。这些发现凸显了现有评估方法的局限性以及对适用于 ICL 的专门基准和训练策略的需求。

Title: Random Sampling for Diffusion-based Adversarial Purification

Authors: Jiancheng Zhang, Peiran Dong, Yongyong Chen, Yin-Ping Zhao, Song Guo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.18956
Pdf URL: https://arxiv.org/pdf/2411.18956
Copy Paste: [[2411.18956]] Random Sampling for Diffusion-based Adversarial Purification(https://arxiv.org/abs/2411.18956)
Keywords: generation
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention in adversarial purification. Current diffusion-based works focus on designing effective condition-guided mechanisms while ignoring a fundamental problem, i.e., the original DDPM sampling is intended for stable generation, which may not be the optimal solution for adversarial purification. Inspired by the stability of the Denoising Diffusion Implicit Model (DDIM), we propose an opposite sampling scheme called random sampling. In brief, random sampling will sample from a random noisy space during each diffusion process, while DDPM and DDIM sampling will continuously sample from the adjacent or original noisy space. Thus, random sampling obtains more randomness and achieves stronger robustness against adversarial attacks. Correspondingly, we also introduce a novel mediator conditional guidance to guarantee the consistency of the prediction under the purified image and clean image input. To expand awareness of guided diffusion purification, we conduct a detailed evaluation with different sampling methods and our random sampling achieves an impressive improvement in multiple settings. Leveraging mediator-guided random sampling, we also establish a baseline method named DiffAP, which significantly outperforms state-of-the-art (SOTA) approaches in performance and defensive stability. Remarkably, under strong attack, our DiffAP even achieves a more than 20% robustness advantage with 10$\times$ sampling acceleration.
摘要：去噪扩散概率模型 (DDPM) 在对抗性净化中获得了极大的关注。当前基于扩散的工作侧重于设计有效的条件引导机制，而忽略了一个基本问题，即原始的 DDPM 采样旨在实现稳定生成，这可能不是对抗性净化的最佳解决方案。受去噪扩散隐式模型 (DDIM) 稳定性的启发，我们提出了一种称为随机采样的相反采样方案。简而言之，随机采样将在每个扩散过程中从随机噪声空间中采样，而 DDPM 和 DDIM 采样将连续从相邻或原始噪声空间中采样。因此，随机采样获得了更多的随机性，并实现了更强的对抗性攻击鲁棒性。相应地，我们还引入了一种新颖的中介条件指导，以保证在净化图像和干净图像输入下的预测的一致性。为了扩大对引导扩散净化的认识，我们对不同的采样方法进行了详细的评估，我们的随机采样在多种设置中取得了令人瞩目的改进。利用中介引导的随机采样，我们还建立了一种名为 DiffAP 的基线方法，其性能和防御稳定性明显优于最先进的 (SOTA) 方法。值得注意的是，在强攻击下，我们的 DiffAP 甚至在 10$\times$ 采样加速下实现了超过 20% 的稳健性优势。

Title: SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

Authors: Rong-Cheng Tu, Wenhao Sun, Zhao Jin, Jingyi Liao, Jiaxing Huang, Dacheng Tao
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2411.18983
Pdf URL: https://arxiv.org/pdf/2411.18983
Copy Paste: [[2411.18983]] SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing(https://arxiv.org/abs/2411.18983)
Keywords: generation, generative
Abstract: While open-source video generation and editing models have made significant progress, individual models are typically limited to specific tasks, failing to meet the diverse needs of users. Effectively coordinating these models can unlock a wide range of video generation and editing capabilities. However, manual coordination is complex and time-consuming, requiring users to deeply understand task requirements and possess comprehensive knowledge of each model's performance, applicability, and limitations, thereby increasing the barrier to entry. To address these challenges, we propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent). SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models, enhancing the adaptability, efficiency, and overall quality of video generation and editing. Specifically, the SPAgent assembles a tool library integrating state-of-the-art open-source image and video generation and editing models as tools. After fine-tuning on our manually annotated dataset, SPAgent can automatically coordinate the tools for video generation and editing, through our novelly designed three-step framework: (1) decoupled intent recognition, (2) principle-guided route planning, and (3) capability-based execution model selection. Additionally, we enhance the SPAgent's video quality evaluation capability, enabling it to autonomously assess and incorporate new video generation and editing models into its tool library without human intervention. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos, highlighting its versatility and adaptability across various video tasks.
摘要：虽然开源视频生成和编辑模型取得了重大进展，但单个模型通常局限于特定任务，无法满足用户的多样化需求。有效地协调这些模型可以释放广泛的视频生成和编辑功能。然而，手动协调既复杂又耗时，需要用户深入了解任务要求，并全面了解每个模型的性能、适用性和局限性，从而增加了进入门槛。为了应对这些挑战，我们提出了一种由语义规划代理 (SPAgent) 驱动的新型视频生成和编辑系统。SPAgent 弥合了不同用户意图与现有生成模型的有效利用之间的差距，提高了视频生成和编辑的适应性、效率和整体质量。具体来说，SPAgent 组装了一个工具库，将最先进的开源图像和视频生成和编辑模型集成为工具。在我们手动注释的数据集上进行微调后，SPAgent 可以通过我们新设计的三步框架自动协调视频生成和编辑工具：（1）解耦意图识别、（2）原则指导的路线规划和（3）基于能力的执行模型选择。此外，我们还增强了 SPAgent 的视频质量评估能力，使其能够自主评估并将新的视频生成和编辑模型纳入其工具库，而无需人工干预。实验结果表明，SPAgent 可以有效协调模型来生成或编辑视频，凸显了其在各种视频任务中的多功能性和适应性。

Title: Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement

Authors: Muhammad Umer Ramzan, Ali Zia, Abdelwahed Khamis, yman Elgharabawy, Ahmad Liaqat, Usman Ali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19005
Pdf URL: https://arxiv.org/pdf/2411.19005
Copy Paste: [[2411.19005]] Locally-Focused Face Representation for Sketch-to-Image Generation Using Noise-Induced Refinement(https://arxiv.org/abs/2411.19005)
Keywords: generation, generative
Abstract: This paper presents a novel deep-learning framework that significantly enhances the transformation of rudimentary face sketches into high-fidelity colour images. Employing a Convolutional Block Attention-based Auto-encoder Network (CA2N), our approach effectively captures and enhances critical facial features through a block attention mechanism within an encoder-decoder architecture. Subsequently, the framework utilises a noise-induced conditional Generative Adversarial Network (cGAN) process that allows the system to maintain high performance even on domains unseen during the training. These enhancements lead to considerable improvements in image realism and fidelity, with our model achieving superior performance metrics that outperform the best method by FID margin of 17, 23, and 38 on CelebAMask-HQ, CUHK, and CUFSF datasets; respectively. The model sets a new state-of-the-art in sketch-to-image generation, can generalize across sketch types, and offers a robust solution for applications such as criminal identification in law enforcement.
摘要：本文介绍了一种新颖的深度学习框架，可显著增强将基本人脸素描转换为高保真彩色图像的能力。我们的方法采用基于卷积块注意力的自动编码器网络 (CA2N)，通过编码器-解码器架构中的块注意力机制有效捕捉和增强关键面部特征。随后，该框架利用噪声诱导的条件生成对抗网络 (cGAN) 过程，使系统即使在训练期间未见过的域上也能保持高性能。这些增强功能大大提高了图像的真实感和保真度，我们的模型实现了卓越的性能指标，在 CelebAMask-HQ、CUHK 和 CUFSF 数据集上分别以 17、23 和 38 的 FID 幅度超越最佳方法。该模型在素描到图像生成方面树立了新的领先地位，可以跨素描类型进行推广，并为执法中的犯罪分子识别等应用提供了强大的解决方案。

Title: 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Authors: Tejaswini Medi, Arianna Rampini, Pradyumna Reddy, Pradeep Kumar Jayaraman, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19037
Pdf URL: https://arxiv.org/pdf/2411.19037
Copy Paste: [[2411.19037]] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes(https://arxiv.org/abs/2411.19037)
Keywords: generation, generative
Abstract: Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.
摘要：自回归 (AR) 模型在自然语言和图像生成方面取得了显著的成功，但它们在 3D 形状建模中的应用仍未得到广泛探索。与扩散模型不同，AR 模型能够实现更高效、更可控的生成，推理时间更快，因此特别适合数据密集型领域。使用 AR 方法的传统 3D 生成模型通常依赖于体素或点级别的“下一个标记”预测。虽然这些方法对某些应用有效，但在处理大规模 3D 数据时，它们可能会受到限制且计算成本高昂。为了应对这些挑战，我们引入了 3D-WAG，这是一种用于 3D 隐式距离场的 AR 模型，可以执行无条件形状生成、类条件和文本条件形状生成。我们的主要思想是将形状编码为多尺度小波标记图，并使用 Transformer 以自回归方式预测“下一个更高分辨率的标记图”。通过将 3D AR 生成任务重新定义为“下一个规模”预测，与传统的“下一个标记”预测模型相比，我们降低了生成的计算成本，同时以更结构化和分层的方式保留了 3D 形状的基本几何细节。我们通过与广泛使用的基准上最先进的方法进行定量和定性比较来评估 3D-WAG，以展示其优势。我们的结果表明，3D-WAG 在覆盖率和 MMD 等关键指标方面取得了卓越的表现，生成了与真实数据分布紧密匹配的高保真 3D 形状。

Title: I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Authors: Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19050
Pdf URL: https://arxiv.org/pdf/2411.19050
Copy Paste: [[2411.19050]] I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting(https://arxiv.org/abs/2411.19050)
Keywords: generation
Abstract: Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting, we introduce the novel task of multi-mask inpainting, where multiple regions are simultaneously inpainted using distinct prompts. Furthermore, we design a fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion, which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention, enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code, data, and trained models are available at this https URL.
摘要：修复侧重于填充图像中缺失或损坏的区域，使其与周围的内容和样式无缝融合。虽然条件扩散模型已被证明对文本引导的修复有效，但我们引入了多掩码修复的新任务，其中使用不同的提示同时修复多个区域。此外，我们为多模态 LLM（如 LLaVA）设计了一个微调程序，以使用损坏的图像作为输入自动生成多掩码提示。这些模型可以生成有用且详细的提示建议来填充蒙版区域。然后将生成的提示输入到稳定扩散，该扩散使用校正交叉注意针对多掩码修复问题进行微调，将提示强制执行到其指定区域进行填充。对来自 WikiArt 和密集标题图像数据集的数字化绘画进行的实验表明，我们的管道提供了富有创意且准确的修复结果。我们的代码、数据和训练有素的模型可在此 https URL 上获得。

Title: VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Authors: Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.19103
Pdf URL: https://arxiv.org/pdf/2411.19103
Copy Paste: [[2411.19103]] VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models(https://arxiv.org/abs/2411.19103)
Keywords: generation
Abstract: In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at this https URL.
摘要：在本文中，我们介绍了一种开源韩语-英语视觉语言模型 (VLM)，VARCO-VISION。我们采用了一种循序渐进的训练策略，使模型能够学习语言和视觉信息，同时保留骨干模型的知识。与类似大小的模型相比，我们的模型在需要双语图像文本理解和生成能力的各种环境中表现出色。VARCO-VISION 还能够进行基础、引用和 OCR，从而扩大了其在现实世界场景中的使用范围和潜在应用。除了该模型之外，我们还发布了五个韩语评估数据集，包括四个封闭集和一个开放集基准。我们预计，我们的里程碑将为旨在训练 VLM 的 AI 研究人员拓宽机会。VARCO-VISION 可在此 https URL 上找到。

Title: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Authors: Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19108
Pdf URL: https://arxiv.org/pdf/2411.19108
Copy Paste: [[2411.19108]] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model(https://arxiv.org/abs/2411.19108)
Keywords: generation
Abstract: As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.
摘要：作为视频生成的基本支柱，由于去噪的顺序性，扩散模型面临着推理速度低的挑战。以前的方法通过在统一选择的时间步长上缓存和重用模型输出来加速模型。然而，这种策略忽略了模型输出之间的差异在时间步长上并不统一的事实，这阻碍了选择合适的模型输出进行缓存，导致推理效率和视觉质量之间的平衡不佳。在本研究中，我们引入了时间步长嵌入感知缓存 (TeaCache)，这是一种无需训练的缓存方法，可以估计和利用模型输出在时间步长之间的波动差异。TeaCache 不是直接使用耗时的模型输出，而是专注于模型输入，这些输入与模型输出具有很强的相关性，同时产生的计算成本可以忽略不计。TeaCache 首先使用时间步长嵌入来调节噪声输入，以确保它们的差异更好地接近模型输出的差异。然后，TeaCache 引入了一种重新缩放策略来改进估计的差异，并利用它们来指示输出缓存。实验表明，TeaCache 比 Open-Sora-Plan 实现了高达 4.41 倍的加速，而视觉质量的下降可以忽略不计（-0.07% Vbench 得分）。

Title: Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models

Authors: Chung-Ting Tsai, Ching-Yun Ko, I-Hsin Chung, Yu-Chiang Frank Wang, Pin-Yu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19117
Pdf URL: https://arxiv.org/pdf/2411.19117
Copy Paste: [[2411.19117]] Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models(https://arxiv.org/abs/2411.19117)
Keywords: generative
Abstract: The rapid advancement of generative models has introduced serious risks, including deepfake techniques for facial synthesis and editing. Traditional approaches rely on training classifiers and enhancing generalizability through various feature extraction techniques. Meanwhile, training-free detection methods address issues like limited data and overfitting by directly leveraging statistical properties from vision foundation models to distinguish between real and fake images. The current leading training-free approach, RIGID, utilizes DINOv2 sensitivity to perturbations in image space for detecting fake images, with fake image embeddings exhibiting greater sensitivity than those of real images. This observation prompts us to investigate how detection performance varies across model backbones, perturbation types, and datasets. Our experiments reveal that detection performance is closely linked to model robustness, with self-supervised (SSL) models providing more reliable representations. While Gaussian noise effectively detects general objects, it performs worse on facial images, whereas Gaussian blur is more effective due to potential frequency artifacts. To further improve detection, we introduce Contrastive Blur, which enhances performance on facial images, and MINDER (MINimum distance DetEctoR), which addresses noise type bias, balancing performance across domains. Beyond performance gains, our work offers valuable insights for both the generative and detection communities, contributing to a deeper understanding of model robustness property utilized for deepfake detection.
摘要：生成模型的快速发展带来了严重的风险，包括用于面部合成和编辑的深度伪造技术。传统方法依赖于训练分类器并通过各种特征提取技术增强通用性。同时，无需训练的检测方法通过直接利用视觉基础模型的统计特性来区分真假图像，解决了数据有限和过度拟合等问题。当前领先的无需训练方法 RIGID 利用 DINOv2 对图像空间扰动的敏感性来检测假图像，其中假图像嵌入比真实图像表现出更高的敏感性。这一观察促使我们研究检测性能在模型主干、扰动类型和数据集之间的差异。我们的实验表明，检测性能与模型鲁棒性密切相关，自监督 (SSL) 模型提供更可靠的表示。虽然高斯噪声可以有效地检测一般物体，但它在面部图像上的表现较差，而高斯模糊由于潜在的频率伪影而更有效。为了进一步提高检测能力，我们引入了对比模糊，以提高面部图像的性能，以及 MINDER（最小距离检测器），它可以解决噪声类型偏差，平衡各个域的性能。除了性能提升之外，我们的工作还为生成和检测社区提供了宝贵的见解，有助于更深入地了解用于深度伪造检测的模型鲁棒性属性。

Title: MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Authors: Daewon Yoon, Hyungsuk Lee, Wonsik Shin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19121
Pdf URL: https://arxiv.org/pdf/2411.19121
Copy Paste: [[2411.19121]] MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation(https://arxiv.org/abs/2411.19121)
Keywords: generation
Abstract: This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.
摘要：本文讨论了基于连续场景生成多场景视频所需的指标，与传统的短视频生成不同。基于场景的视频需要全面评估，考虑多种因素，例如角色一致性、艺术连贯性、美学质量以及生成内容与预期提示的一致性。此外，在视频生成中，与单幅图像不同，角色跨帧的移动会带来失真或意外变化等潜在问题，必须对这些问题进行有效评估和纠正。在扩散等概率模型的背景下，生成所需的场景需要反复采样和手动选择，类似于电影导演从众多镜头中选择最佳镜头的方式。我们提出了一个基于分数的评估基准，可以自动化此过程，从而更客观、更有效地评估这些复杂性。这种方法允许通过基于自动评分而不是手动检查来选择最佳结果来生成高质量的多场景视频。

Title: Deep Learning for GWP Prediction: A Framework Using PCA, Quantile Transformation, and Ensemble Modeling

Authors: Navin Rajapriya, Kotaro Kawajiri
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2411.19124
Pdf URL: https://arxiv.org/pdf/2411.19124
Copy Paste: [[2411.19124]] Deep Learning for GWP Prediction: A Framework Using PCA, Quantile Transformation, and Ensemble Modeling(https://arxiv.org/abs/2411.19124)
Keywords: generation
Abstract: Developing environmentally sustainable refrigerants is critical for mitigating the impact of anthropogenic greenhouse gases on global warming. This study presents a predictive modeling framework to estimate the 100-year global warming potential (GWP 100) of single-component refrigerants using a fully connected neural network implemented on the Multi-Sigma platform. Molecular descriptors from RDKit, Mordred, and alvaDesc were utilized to capture various chemical features. The RDKit-based model achieved the best performance, with a Root Mean Square Error (RMSE) of 481.9 and an R2 score of 0.918, demonstrating superior predictive accuracy and generalizability. Dimensionality reduction through Principal Component Analysis (PCA) and quantile transformation were applied to address the high-dimensional and skewed nature of the dataset,enhancing model stability and performance. Factor analysis identified vital molecular features, including molecular weight, lipophilicity, and functional groups, such as nitriles and allylic oxides, as significant contributors to GWP values. These insights provide actionable guidance for designing environmentally sustainable refrigerants. Integrating RDKit descriptors with Multi-Sigma's framework, which includes PCA, quantile transformation, and neural networks, provides a scalable solution for the rapid virtual screening of low-GWP refrigerants. This approach can potentially accelerate the identification of eco-friendly alternatives, directly contributing to climate mitigation by enabling the design of next-generation refrigerants aligned with global sustainability objectives.
摘要：开发环境可持续的制冷剂对于减轻人为温室气体对全球变暖的影响至关重要。本研究提出了一个预测模型框架，使用在 Multi-Sigma 平台上实现的完全连接神经网络来估计单组分制冷剂的 100 年全球变暖潜力 (GWP 100)。RDKit、Mordred 和 alvaDesc 中的分子描述符用于捕获各种化学特征。基于 RDKit 的模型实现了最佳性能，均方根误差 (RMSE) 为 481.9，R2 得分为 0.918，显示出卓越的预测准确性和普遍性。通过主成分分析 (PCA) 和分位数变换进行降维，以解决数据集的高维性和倾斜性，从而增强模型稳定性和性能。因子分析确定了重要的分子特征，包括分子量、亲脂性和功能基团，例如腈和烯丙基氧化物，是 GWP 值的重要贡献者。这些见解为设计环保型制冷剂提供了切实可行的指导。将 RDKit 描述符与 Multi-Sigma 的框架（包括 PCA、分位数变换和神经网络）相结合，为快速虚拟筛选低 GWP 制冷剂提供了可扩展的解决方案。这种方法可以加速环保替代品的识别，通过设计符合全球可持续发展目标的下一代制冷剂，直接为缓解气候变化做出贡献。

Title: LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair

Authors: Xue Song, Jiequan Cui, Hanwang Zhang, Jiaxin Shi, Jingjing Chen, Chi Zhang, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19156
Pdf URL: https://arxiv.org/pdf/2411.19156
Copy Paste: [[2411.19156]] LoRA of Change: Learning to Generate LoRA for the Editing Instruction from A Single Before-After Image Pair(https://arxiv.org/abs/2411.19156)
Keywords: generation
Abstract: In this paper, we propose the LoRA of Change (LoC) framework for image editing with visual instructions, i.e., before-after image pairs. Compared to the ambiguities, insufficient specificity, and diverse interpretations of natural language, visual instructions can accurately reflect users' intent. Building on the success of LoRA in text-based image editing and generation, we dynamically learn an instruction-specific LoRA to encode the "change" in a before-after image pair, enhancing the interpretability and reusability of our model. Furthermore, generalizable models for image editing with visual instructions typically require quad data, i.e., a before-after image pair, along with query and target images. Due to the scarcity of such quad data, existing models are limited to a narrow range of visual instructions. To overcome this limitation, we introduce the LoRA Reverse optimization technique, enabling large-scale training with paired data alone. Extensive qualitative and quantitative experiments demonstrate that our model produces high-quality images that align with user intent and support a broad spectrum of real-world visual instructions.
摘要：在本文中，我们提出了 LoRA of Change (LoC) 框架，用于使用视觉指令（即前后图像对）进行图像编辑。与自然语言的歧义、缺乏特异性和解释多样化相比，视觉指令可以准确反映用户的意图。基于 LoRA 在基于文本的图像编辑和生成方面的成功，我们动态学习了特定于指令的 LoRA 来对前后图像对中的“变化”进行编码，从而增强了我们模型的可解释性和可重用性。此外，用于使用视觉指令进行图像编辑的可泛化模型通常需要四元组数据（即前后图像对）以及查询和目标图像。由于这种四元组数据的稀缺性，现有模型仅限于狭窄范围的视觉指令。为了克服这一限制，我们引入了 LoRA 反向优化技术，仅使用配对数据即可进行大规模训练。大量的定性和定量实验表明，我们的模型可以生成符合用户意图的高质量图像，并支持广泛的现实世界视觉指令。

Title: SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Authors: Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19182
Pdf URL: https://arxiv.org/pdf/2411.19182
Copy Paste: [[2411.19182]] SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation(https://arxiv.org/abs/2411.19182)
Keywords: generation, generative
Abstract: Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
摘要：扩散生成模型源自物理学中的扩散现象，该现象描述了粒子的随机运动和碰撞，它模拟了数据空间中沿去噪轨迹的随机游走。这允许信息跨区域扩散，产生和谐的结果。然而，扩散模型中信息扩散的混乱和无序性质通常会导致图像区域之间产生不必要的干扰，从而导致细节保留质量下降和上下文不一致。在这项工作中，我们通过将无序扩散重新定义为文本视觉到图像生成 (TV2I) 任务的强大工具来解决这些挑战，实现像素级条件保真度，同时保持整个图像的视觉和语义连贯性。我们首先引入循环单向扩散 (COW)，它提供了一个有效的单向扩散框架，用于精确的信息传输，同时最大限度地减少破坏性干扰。在 COW 的基础上，我们进一步提出了选择性单向扩散 (SOW)，它利用多模态大型语言模型 (MLLM) 来阐明图像内的语义和空间关系。基于此，SOW 结合注意力机制，根据上下文关系动态调节扩散的方向和强度。大量实验证明了受控信息扩散的尚未开发的潜力，为以无学习方式实现更具适应性和多功能性的生成模型提供了一条途径。

Title: Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection

Authors: Tsun-Hin Cheung, Ka-Chun Fung, Songjiang Lai, Kwan-Ho Lin, Vincent Ng, Kin-Man Lam
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2411.19220
Pdf URL: https://arxiv.org/pdf/2411.19220
Copy Paste: [[2411.19220]] Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection(https://arxiv.org/abs/2411.19220)
Keywords: generation
Abstract: Identifying defects and anomalies in industrial products is a critical quality control task. Traditional manual inspection methods are slow, subjective, and error-prone. In this work, we propose a novel zero-shot training-free approach for automated industrial image anomaly detection using a multimodal machine learning pipeline, consisting of three foundation models. Our method first uses a large language model, i.e., GPT-3. generate text prompts describing the expected appearances of normal and abnormal products. We then use a grounding object detection model, called Grounding DINO, to locate the product in the image. Finally, we compare the cropped product image patches to the generated prompts using a zero-shot image-text matching model, called CLIP, to identify any anomalies. Our experiments on two datasets of industrial product images, namely MVTec-AD and VisA, demonstrate the effectiveness of this method, achieving high accuracy in detecting various types of defects and anomalies without the need for model training. Our proposed model enables efficient, scalable, and objective quality control in industrial manufacturing settings.
摘要：识别工业产品中的缺陷和异常是一项关键的质量控制任务。传统的人工检查方法速度慢、主观性强且容易出错。在这项工作中，我们提出了一种新颖的零样本无训练方法，用于使用多模态机器学习管道自动检测工业图像异常，该管道由三个基础模型组成。我们的方法首先使用大型语言模型，即 GPT-3。生成描述正常和异常产品预期外观的文本提示。然后，我们使用名为 Grounding DINO 的接地物体检测模型在图像中定位产品。最后，我们使用名为 CLIP 的零样本图像文本匹配模型将裁剪的产品图像块与生成的提示进行比较，以识别任何异常。我们在两个工业产品图像数据集 MVTec-AD 和 VisA 上进行的实验证明了该方法的有效性，无需模型训练即可高精度地检测各种类型的缺陷和异常。我们提出的模型可在工业制造环境中实现高效、可扩展和客观的质量控制。

Title: Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Authors: Xinxu Wei, Kanhao Zhao, Yong Jiao, Nancy B. Carlisle, Hua Xie, Yu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.19230
Pdf URL: https://arxiv.org/pdf/2411.19230
Copy Paste: [[2411.19230]] Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG(https://arxiv.org/abs/2411.19230)
Keywords: generative
Abstract: Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this by framing it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled/labeled and high/low-density EEG data. To fully leverage the abundant unlabeled EEG data, we introduce a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates Graph Contrastive Pre-training and Graph Masked Autoencoder Pre-training. This approach synergistically combines contrastive and generative pre-training techniques by reconstructing contrastive samples and contrasting the reconstructions. For knowledge distillation from high-density to low-density EEG data, we propose a Graph Topology Distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data, effectively handling missing electrodes through contrastive distillation. To integrate transfer learning and distillation, we jointly pre-train the teacher and student models by contrasting their queries and keys during pre-training, enabling robust distillers for downstream tasks. We demonstrate the effectiveness of our method on four classification tasks across two clinical EEG datasets with abundant unlabeled data and limited labeled data. The experimental results show that our approach significantly outperforms contemporary methods in both efficiency and accuracy.
摘要：有效利用大量未标记的高密度 EEG 数据来提高在标记低密度 EEG 数据有限的场景下的性能是一项重大挑战。在本文中，我们通过将其定义为图迁移学习和知识提炼问题来解决此问题。我们提出了一种统一的预训练图对比掩蔽自编码器提炼器，名为 EEG-DisGCMAE，以弥合未标记/标记和高/低密度 EEG 数据之间的差距。为了充分利用丰富的未标记 EEG 数据，我们引入了一种新颖的统一图自监督预训练范式，它无缝集成了图对比预训练和图掩蔽自编码器预训练。该方法通过重建对比样本和对比重建，协同结合了对比和生成预训练技术。对于从高密度到低密度 EEG 数据的知识蒸馏，我们提出了一种图拓扑蒸馏损失函数，允许在低密度数据上训练的轻量级学生模型从在高密度数据上训练的教师模型中学习，通过对比蒸馏有效地处理缺失电极。为了整合迁移学习和蒸馏，我们通过在预训练期间对比他们的查询和键来联合预训练教师和学生模型，从而为下游任务提供强大的蒸馏器。我们在两个临床 EEG 数据集上的四个分类任务中证明了我们的方法的有效性，这些数据集具有丰富的未标记数据和有限的标记数据。实验结果表明，我们的方法在效率和准确性方面都明显优于当代方法。

Title: Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19231
Pdf URL: https://arxiv.org/pdf/2411.19231
Copy Paste: [[2411.19231]] Z-STAR+: A Zero-shot Style Transfer Method via Adjusting Style Distribution(https://arxiv.org/abs/2411.19231)
Keywords: generative
Abstract: Style transfer presents a significant challenge, primarily centered on identifying an appropriate style representation. Conventional methods employ style loss, derived from second-order statistics or contrastive learning, to constrain style representation in the stylized result. However, these pre-defined style representations often limit stylistic expression, leading to artifacts. In contrast to existing approaches, we have discovered that latent features in vanilla diffusion models inherently contain natural style and content distributions. This allows for direct extraction of style information and seamless integration of generative priors into the content image without necessitating retraining. Our method adopts dual denoising paths to represent content and style references in latent space, subsequently guiding the content image denoising process with style latent codes. We introduce a Cross-attention Reweighting module that utilizes local content features to query style image information best suited to the input patch, thereby aligning the style distribution of the stylized results with that of the style image. Furthermore, we design a scaled adaptive instance normalization to mitigate inconsistencies in color distribution between style and stylized images on a global scale. Through theoretical analysis and extensive experimentation, we demonstrate the effectiveness and superiority of our diffusion-based \uline{z}ero-shot \uline{s}tyle \uline{t}ransfer via \uline{a}djusting style dist\uline{r}ibution, termed Z-STAR+.
摘要：风格迁移是一项重大挑战，主要集中在确定合适的风格表征。传统方法采用源自二阶统计或对比学习的风格损失来约束风格化结果中的风格表征。然而，这些预定义的风格表征通常会限制风格表达，从而导致伪像。与现有方法相比，我们发现原始扩散模型中的潜在特征固有地包含自然风格和内容分布。这允许直接提取风格信息并将生成先验无缝集成到内容图像中，而无需重新训练。我们的方法采用双去噪路径在潜在空间中表示内容和风格参考，随后使用风格潜在代码指导内容图像去噪过程。我们引入了一个交叉注意重加权模块，该模块利用局部内容特征来查询最适合输入补丁的风格图像信息，从而将风格化结果的风格分布与风格图像的风格分布对齐。此外，我们设计了一个缩放自适应实例规范化，以减轻全局范围内风格和风格化图像之间颜色分布的不一致。通过理论分析和大量实验，我们证明了基于扩散的 \uline{z}ero-shot \uline{s}tyle \uline{t}transfer（通过 \uline{a}djusting style dist\uline{r}distribution）的有效性和优越性，称为 Z-STAR+。

Title: Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

Authors: Thomas Wimmer, Michael Oechsle, Michael Niemeyer, Federico Tombari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19233
Pdf URL: https://arxiv.org/pdf/2411.19233
Copy Paste: [[2411.19233]] Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes(https://arxiv.org/abs/2411.19233)
Keywords: generative
Abstract: State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.
摘要：最先进的新颖视图合成方法在静态 3D 场景的多视图捕获方面取得了令人印象深刻的效果。然而，重建的场景仍然缺乏“生动性”，而生动性是创造引人入胜的 3D 体验的关键要素。最近，新颖的视频扩散模型可以生成具有复杂运动的逼真视频并支持 2D 图像的动画，但它们不能简单地用于为 3D 场景制作动画，因为它们缺乏多视图一致性。为了给静态世界注入活力，我们提出了 Gaussians2Life，这是一种在高斯 Splatting 表示中为高质量 3D 场景的部分制作动画的方法。我们的主要思想是利用强大的视频扩散模型作为我们模型的生成组件，并将其与强大的技术相结合，将 2D 视频提升为有意义的 3D 运动。我们发现，与之前的工作相比，这可以实现复杂、预先存在的 3D 场景的逼真动画，并进一步实现各种对象类的动画，而相关工作主要集中在基于先验的角色动画或单个 3D 对象上。我们的模型能够为任意场景创建一致、身临其境的 3D 体验。

Title: Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation

Authors: Xuehao Cui, Guangyang Wu, Zhenghao Gan, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19246
Pdf URL: https://arxiv.org/pdf/2411.19246
Copy Paste: [[2411.19246]] Face2QR: A Unified Framework for Aesthetic, Face-Preserving, and Scannable QR Code Generation(https://arxiv.org/abs/2411.19246)
Keywords: generation
Abstract: Existing methods to generate aesthetic QR codes, such as image and style transfer techniques, tend to compromise either the visual appeal or the scannability of QR codes when they incorporate human face identity. Addressing these imperfections, we present Face2QR-a novel pipeline specifically designed for generating personalized QR codes that harmoniously blend aesthetics, face identity, and scannability. Our pipeline introduces three innovative components. First, the ID-refined QR integration (IDQR) seamlessly intertwines the background styling with face ID, utilizing a unified Stable Diffusion (SD)-based framework with control networks. Second, the ID-aware QR ReShuffle (IDRS) effectively rectifies the conflicts between face IDs and QR patterns, rearranging QR modules to maintain the integrity of facial features without compromising scannability. Lastly, the ID-preserved Scannability Enhancement (IDSE) markedly boosts scanning robustness through latent code optimization, striking a delicate balance between face ID, aesthetic quality and QR functionality. In comprehensive experiments, Face2QR demonstrates remarkable performance, outperforming existing approaches, particularly in preserving facial recognition features within custom QR code designs. Codes are available at $\href{this https URL}{\text{this URL link}}$.
摘要：现有的生成美观二维码的方法（如图像和样式转换技术）在结合人脸身份时往往会损害二维码的视觉吸引力或可扫描性。为了解决这些缺陷，我们提出了 Face2QR，这是一种专为生成个性化二维码而设计的新型流程，可以和谐地融合美观性、人脸身份和可扫描性。我们的流程引入了三个创新组件。首先，ID 精炼二维码集成 (IDQR) 将背景样式与人脸 ID 无缝交织在一起，利用统一的稳定扩散 (SD) 框架和控制网络。其次，ID 感知二维码改组 (IDRS) 有效地纠正了人脸 ID 和二维码图案之间的冲突，重新排列二维码模块以保持面部特征的完整性而不影响可扫描性。最后，ID 保留可扫描性增强 (IDSE) 通过潜在代码优化显著提高了扫描稳定性，在人脸 ID、美观质量和二维码功能之间取得了微妙的平衡。在综合实验中，Face2QR 表现出色，优于现有方法，特别是在自定义二维码设计中保留面部识别特征方面。代码可在 $\href{this https URL}{\text{this URL link}}$ 获取。

Title: Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention

Authors: Huiguo He, Qiuyue Wang, Yuan Zhou, Yuxuan Cai, Hongyang Chao, Jian Yin, Huan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19261
Pdf URL: https://arxiv.org/pdf/2411.19261
Copy Paste: [[2411.19261]] Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention(https://arxiv.org/abs/2411.19261)
Keywords: generation
Abstract: Training-free diffusion models have achieved remarkable progress in generating multi-subject consistent images within open-domain scenarios. The key idea of these methods is to incorporate reference subject information within the attention layer. However, existing methods still obtain suboptimal performance when handling numerous subjects. This paper reveals the two primary issues contributing to this deficiency. Firstly, there is undesired interference among different subjects within the target image. Secondly, tokens tend to reference nearby tokens, which reduces the effectiveness of the attention mechanism when there is a significant positional difference between subjects in reference and target images. To address these challenges, we propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion. Specifically, Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject fusion. On the other hand, Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images. This ensures that subjects in the target image can better reference those in the reference image, thereby maintaining better consistency. Extensive experiments demonstrate that the proposed methods significantly enhance multi-subject consistency, outperforming all existing methods in open-domain scenarios.
摘要：无需训练的扩散模型在开放域场景中生成多主体一致图像方面取得了显著进展。这些方法的关键思想是将参考主体信息纳入注意层。然而，现有方法在处理大量主体时仍然无法获得最佳性能。本文揭示了导致这一缺陷的两个主要问题。首先，目标图像中不同主体之间存在不必要的干扰。其次，标记倾向于引用附近的标记，当参考图像和目标图像中的主体之间存在显著的位置差异时，这会降低注意机制的有效性。为了应对这些挑战，我们提出了一种带有隔离和重新定位注意的无需训练的扩散模型，称为 IR-Diffusion。具体而言，隔离注意确保目标图像中的多个主体不会相互引用，从而有效地消除主体融合。另一方面，重新定位注意涉及将参考图像和目标图像中的主体缩放和重新定位到图像中的相同位置。这确保目标图像中的主体可以更好地参考参考图像中的主体，从而保持更好的一致性。大量实验表明，所提出的方法显著增强了多主题一致性，在开放域场景中优于所有现有方法。

Title: Trajectory Attention for Fine-grained Video Motion Control

Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19324
Pdf URL: https://arxiv.org/pdf/2411.19324
Copy Paste: [[2411.19324]] Trajectory Attention for Fine-grained Video Motion Control(https://arxiv.org/abs/2411.19324)
Keywords: generation
Abstract: Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.
摘要：视频生成领域的最新进展在很大程度上受到视频扩散模型的推动，其中相机运动控制成为创建视图定制视觉内容的关键挑战。本文介绍了轨迹注意，这是一种新颖的方法，它沿着可用的像素轨迹执行注意，以实现细粒度的相机运动控制。与经常产生不精确输出或忽略时间相关性的现有方法不同，我们的方法具有更强的归纳偏差，可将轨迹信息无缝注入视频生成过程。重要的是，我们的方法将轨迹注意建模为传统时间注意的辅助分支。这种设计使原始时间注意和轨迹注意能够协同工作，确保精确的运动控制和新的内容生成能力，这在轨迹仅部分可用时至关重要。对图像和视频的相机运动控制的实验表明，在保持高质量生成的同时，精度和长距离一致性得到了显着提高。此外，我们表明我们的方法可以扩展到其他视频运动控制任务，例如第一帧引导的视频编辑，它在保持大空间和时间范围内的内容一致性方面表现出色。

Title: Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19378
Pdf URL: https://arxiv.org/pdf/2411.19378
Copy Paste: [[2411.19378]] Libra: Leveraging Temporal Images for Biomedical Radiology Analysis(https://arxiv.org/abs/2411.19378)
Keywords: generation
Abstract: Radiology report generation (RRG) is a challenging task, as it requires a thorough understanding of medical images, integration of multiple temporal inputs, and accurate report generation. Effective interpretation of medical images, such as chest X-rays (CXRs), demands sophisticated visual-language reasoning to map visual findings to structured reports. Recent studies have shown that multimodal large language models (MLLMs) can acquire multimodal capabilities by aligning with pre-trained vision encoders. However, current approaches predominantly focus on single-image analysis or utilise rule-based symbolic processing to handle multiple images, thereby overlooking the essential temporal information derived from comparing current images with prior ones. To overcome this critical limitation, we introduce Libra, a temporal-aware MLLM tailored for CXR report generation using temporal images. Libra integrates a radiology-specific image encoder with a MLLM and utilises a novel Temporal Alignment Connector to capture and synthesise temporal information of images across different time points with unprecedented precision. Extensive experiments show that Libra achieves new state-of-the-art performance among the same parameter scale MLLMs for RRG tasks on the MIMIC-CXR. Specifically, Libra improves the RadCliQ metric by 12.9% and makes substantial gains across all lexical metrics compared to previous models.
摘要：放射学报告生成 (RRG) 是一项具有挑战性的任务，因为它需要彻底了解医学图像、整合多个时间输入并准确生成报告。有效解释医学图像（例如胸部 X 光片 (CXR)）需要复杂的视觉语言推理，以将视觉发现映射到结构化报告中。最近的研究表明，多模态大型语言模型 (MLLM) 可以通过与预先训练的视觉编码器对齐来获得多模态功能。然而，当前的方法主要侧重于单图像分析或利用基于规则的符号处理来处理多幅图像，从而忽略了将当前图像与之前图像进行比较所获得的重要时间信息。为了克服这一关键限制，我们推出了 Libra，这是一种时间感知 MLLM，专为使用时间图像生成 CXR 报告而量身定制。Libra 将放射学专用图像编码器与 MLLM 集成在一起，并利用新颖的时间对齐连接器以前所未有的精度捕获和合成不同时间点图像的时间信息。大量实验表明，Libra 在 MIMIC-CXR 上针对 RRG 任务的同参数规模 MLLM 中取得了新的最佳性能。具体而言，与之前的模型相比，Libra 将 RadCliQ 指标提高了 12.9%，并在所有词汇指标上都取得了显著的进步。

Title: DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

Authors: Shwetha Ram, Tal Neiman, Qianli Feng, Andrew Stuart, Son Tran, Trishul Chilimbi
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19390
Pdf URL: https://arxiv.org/pdf/2411.19390
Copy Paste: [[2411.19390]] DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models(https://arxiv.org/abs/2411.19390)
Keywords: generation
Abstract: Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.
摘要：给定少量主题图像，个性化图像生成技术可以对大型预训练文本到图像扩散模型进行微调，以根据文本提示在新的情境中生成主题图像。在这样做时，需要在提示保真度、主题保真度和多样性之间进行权衡。随着预训练模型的微调，早期检查点会合成主题保真度低但提示保真度和多样性高的图像。相反，后期检查点会生成提示保真度和多样性低但主题保真度高的图像。这种固有的权衡限制了生成图像的提示保真度、主题保真度和多样性。在这项工作中，我们提出使用 DreamBlend 在推理过程中结合早期检查点的提示保真度和后期检查点的主题保真度。对于同一提示，我们从后期检查点执行交叉注意引导的图像合成，并由早期检查点生成的图像引导。这使得能够在具有挑战性的提示下生成具有更好的主体保真度、提示保真度和多样性的图像，从而超越最先进的微调方法。

Title: AMO Sampler: Enhancing Text Rendering with Overshooting

Authors: Xixi Hu, Keyang Xu, Bo Liu, Qiang Liu, Hongliang Fei
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19415
Pdf URL: https://arxiv.org/pdf/2411.19415
Copy Paste: [[2411.19415]] AMO Sampler: Enhancing Text Rendering with Overshooting(https://arxiv.org/abs/2411.19415)
Keywords: generation
Abstract: Achieving precise alignment between textual instructions and generated images in text-to-image generation is a significant challenge, particularly in rendering written text within images. Sate-of-the-art models like Stable Diffusion 3 (SD3), Flux, and AuraFlow still struggle with accurate text depiction, resulting in misspelled or inconsistent text. We introduce a training-free method with minimal computational overhead that significantly enhances text rendering quality. Specifically, we introduce an overshooting sampler for pretrained rectified flow (RF) models, by alternating between over-simulating the learned ordinary differential equation (ODE) and reintroducing noise. Compared to the Euler sampler, the overshooting sampler effectively introduces an extra Langevin dynamics term that can help correct the compounding error from successive Euler steps and therefore improve the text rendering. However, when the overshooting strength is high, we observe over-smoothing artifacts on the generated images. To address this issue, we propose an Attention Modulated Overshooting sampler (AMO), which adaptively controls the strength of overshooting for each image patch according to their attention score with the text content. AMO demonstrates a 32.3% and 35.9% improvement in text rendering accuracy on SD3 and Flux without compromising overall image quality or increasing inference cost.
摘要：在文本到图像生成中实现文本指令和生成的图像之间的精确对齐是一项重大挑战，特别是在图像中渲染书面文本时。稳定扩散 3 (SD3)、Flux 和 AuraFlow 等最先进的模型仍然难以准确描绘文本，导致文本拼写错误或不一致。我们引入了一种无需训练的方法，以最小的计算开销显著提高文本渲染质量。具体来说，我们为预训练的整流流 (RF) 模型引入了一个过冲采样器，通过交替过度模拟学习到的常微分方程 (ODE) 和重新引入噪声。与欧拉采样器相比，过冲采样器有效地引入了一个额外的朗之万动力学项，可以帮助纠正连续欧拉步骤的复合误差，从而改善文本渲染。但是，当过冲强度较高时，我们会在生成的图像上观察到过度平滑的伪影。为了解决这个问题，我们提出了一种注意力调节过冲采样器 (AMO)，它可以根据每个图像块对文本内容的注意力得分自适应地控制过冲强度。AMO 在 SD3 和 Flux 上的文本渲染准确率分别提高了 32.3% 和 35.9%，同时不影响整体图像质量或增加推理成本。

Title: Any-Resolution AI-Generated Image Detection by Spectral Learning

Authors: Dimitrios Karageorgiou, Symeon Papadopoulos, Ioannis Kompatsiaris, Efstratios Gavves
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.19417
Pdf URL: https://arxiv.org/pdf/2411.19417
Copy Paste: [[2411.19417]] Any-Resolution AI-Generated Image Detection by Spectral Learning(https://arxiv.org/abs/2411.19417)
Keywords: generative
Abstract: Recent works have established that AI models introduce spectral artifacts into generated images and propose approaches for learning to capture them using labeled data. However, the significant differences in such artifacts among different generative models hinder these approaches from generalizing to generators not seen during training. In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. To model this under a self-supervised setup, we employ masked spectral learning using the pretext task of frequency reconstruction. Since generated images constitute out-of-distribution samples for this model, we propose spectral reconstruction similarity to capture this divergence. Moreover, we introduce spectral context attention, which enables our approach to efficiently capture subtle spectral inconsistencies in images of any resolution. Our spectral AI-generated image detection approach (SPAI) achieves a 5.5% absolute improvement in AUC over the previous state-of-the-art across 13 recent generative approaches, while exhibiting robustness against common online perturbations.
摘要：最近的研究已经证实，人工智能模型将光谱伪影引入生成的图像中，并提出了使用标记数据学习捕获它们的方法。然而，不同生成模型之间此类伪影的显著差异阻碍了这些方法推广到训练期间未见过的生成器。在这项工作中，我们建立在这样一个关键思想之上：真实图像的光谱分布构成了人工智能生成图像检测的不变和高度判别的模式。为了在自监督设置下对此进行建模，我们使用频率重建的借口任务进行掩蔽光谱学习。由于生成的图像构成了该模型的分布外样本，我们提出了光谱重建相似性来捕捉这种差异。此外，我们引入了光谱上下文注意，这使我们的方法能够有效地捕捉任何分辨率图像中细微的光谱不一致。我们的光谱人工智能生成图像检测方法 (SPAI) 在 13 种最近的生成方法中实现了比之前最先进的 AUC 5.5% 的绝对改进，同时表现出对常见在线扰动的鲁棒性。

Title: Fleximo: Towards Flexible Text-to-Human Motion Video Generation

Authors: Yuhang Zhang, Yuan Zhou, Zeyu Liu, Yuxuan Cai, Qiuyue Wang, Aidong Men, Huan Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19459
Pdf URL: https://arxiv.org/pdf/2411.19459
Copy Paste: [[2411.19459]] Fleximo: Towards Flexible Text-to-Human Motion Video Generation(https://arxiv.org/abs/2411.19459)
Keywords: generation
Abstract: Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.
摘要：目前生成人体运动视频的方法依赖于从参考视频中提取姿势序列，这限制了灵活性和控制力。此外，由于姿势检测技术的局限性，提取的姿势序列有时可能不准确，导致视频输出质量低下。我们引入了一项新任务，旨在仅从参考图像和自然语言生成人体运动视频。这种方法提供了更大的灵活性和易用性，因为文本比所需的指导视频更容易理解。然而，为这项任务训练一个端到端模型需要数百万个高质量的文本和人体运动视频对，而这些对很难获得。为了解决这个问题，我们提出了一个名为 Fleximo 的新框架，它利用了大规模预训练的文本到 3D 运动模型。这种方法并不简单，因为文本生成的骨架可能与参考图像的比例不一致，可能缺乏详细信息。为了克服这些挑战，我们引入了一种基于锚点的重新缩放方法，并设计了一个骨架适配器来填补缺失的细节，弥合文本到运动和运动到视频生成之间的差距。我们还提出了一种视频细化过程，以进一步提高视频质量。采用大型语言模型 (LLM) 将自然语言分解为离散的运动序列，从而能够生成任意长度的运动视频。为了评估 Fleximo 的性能，我们引入了一个名为 MotionBench 的新基准，其中包括 20 个身份和 20 个动作的 400 个视频。我们还提出了一个新的指标 MotionScore，以评估运动跟踪的准确性。定性和定量结果都表明，我们的方法优于现有的文本条件图像到视频生成方法。所有代码和模型权重都将公开。

Title: V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Authors: Jeongsoo Choi, Ji-Hoon Kim, Jinyu Li, Joon Son Chung, Shujie Liu
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19486
Pdf URL: https://arxiv.org/pdf/2411.19486
Copy Paste: [[2411.19486]] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow(https://arxiv.org/abs/2411.19486)
Keywords: generation
Abstract: In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a Transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground truth utterances.
摘要：在本文中，我们介绍了 V2SFlow，这是一种新颖的视频转语音 (V2S) 框架，旨在直接从无声说话面部视频生成自然且清晰的语音。虽然最近的 V2S 系统在说话者和词汇量有限的受限数据集上显示出了良好的效果，但由于语音信号固有的可变性和复杂性，它们在现实世界中不受约束的数据集上的性能通常会下降。为了应对这些挑战，我们将语音信号分解为可管理的子空间（内容、音调和说话者信息），每个子空间代表不同的语音属性，并直接从视觉输入中预测它们。为了从这些预测属性生成连贯而逼真的语音，我们采用了基于 Transformer 架构构建的整流匹配解码器，该解码器对从随机噪声到目标语音分布的有效概率路径进行建模。大量实验表明，V2SFlow 的表现明显优于最先进的方法，甚至超过了真实话语的自然性。

Title: Interleaved-Modal Chain-of-Thought

Authors: Jun Gao, Yongqi Li, Ziqiang Cao, Wenjie Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19488
Pdf URL: https://arxiv.org/pdf/2411.19488
Copy Paste: [[2411.19488]] Interleaved-Modal Chain-of-Thought(https://arxiv.org/abs/2411.19488)
Keywords: generation
Abstract: Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named \textbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose \textbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14\%) and interpretability improvements compared to existing multimodal CoT prompting methods.
摘要：思路链 (CoT) 提示会引发大型语言模型 (LLM) 产生一系列中间推理步骤，然后才能得出最终答案。然而，当转换为视觉语言模型 (VLM) 时，它们的纯文本原理难以表达与原始图像的细粒度关联。在本文中，我们提出了一种结合图像的多模态思路链，称为 \textbf{交错模态思路链 (ICoT)}，它生成由成对的视觉和文本原理组成的顺序推理步骤来推断最终答案。直观地说，新颖的 ICoT 需要 VLM 能够生成细粒度的交错模态内容，而这对于当前的 VLM 来说很难实现。考虑到所需的视觉信息通常是输入图像的一部分，我们提出 \textbf{注意力驱动选择 (ADS)} 来实现现有 VLM 上的 ICoT。 ADS 智能地插入输入图像的区域，以生成交错模态推理步骤，且额外延迟可忽略不计。ADS 完全依赖于 VLM 的注意力图，无需参数化，因此它是一种即插即用策略，可以推广到一系列 VLM。我们应用 ADS 在两种不同架构的流行 VLM 上实现 ICoT。对三个基准的广泛评估表明，与现有的多模态 CoT 提示方法相比，ICoT 提示实现了显着的性能（高达 14\%）和可解释性改进。

Title: Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

Authors: Tianqi Li, Ruobing Zheng, Minghui Yang, Jingdong Chen, Ming Yang
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19509
Pdf URL: https://arxiv.org/pdf/2411.19509
Copy Paste: [[2411.19509]] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis(https://arxiv.org/abs/2411.19509)
Keywords: generation
Abstract: Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.
摘要：扩散模型的最新进展彻底改变了音频驱动的说话头合成。除了精确的唇部同步之外，基于扩散的方法在生成与音频信号高度一致的微妙表情和自然头部运动方面表现出色。然而，这些方法面临着推理速度慢、对面部运动的细粒度控制不足以及偶尔出现的视觉伪影（这主要是由于变分自编码器 (VAE) 派生的隐式潜在空间）的问题，这阻碍了它们在实时交互应用中的采用。为了解决这些问题，我们引入了 Ditto，这是一个基于扩散的框架，可以实现可控的实时说话头合成。我们的关键创新在于通过显式的身份无关运动空间连接运动生成和逼真的神经渲染，取代传统的 VAE 表示。这种设计大大降低了扩散学习的复杂性，同时实现了对合成说话头的精确控制。我们进一步提出了一种推理策略，可以联合优化三个关键组件：音频特征提取、运动生成和视频合成。这项优化实现了流式处理、实时推理和低首帧延迟，这些功能对于 AI 助手等交互式应用至关重要。大量实验结果表明，Ditto 可以生成引人注目的头部说话视频，并且在运动控制和实时性能方面远远优于现有方法。

Title: DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Authors: Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19527
Pdf URL: https://arxiv.org/pdf/2411.19527
Copy Paste: [[2411.19527]] DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding(https://arxiv.org/abs/2411.19527)
Keywords: generative
Abstract: Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this "discord" between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: this https URL.
摘要：人体运动本质上是连续和动态的，这对生成模型提出了重大挑战。尽管离散量化方法（例如 VQ-VAE）占据主导地位，但它们也存在固有的局限性，包括表达能力受限和逐帧噪声伪影。连续方法虽然可以产生更平滑、更自然的运动，但由于高维复杂性和有限的训练数据，往往会失败。为了解决离散和连续表示之间的这种“不一致”，我们引入了 DisCoRD：通过整流解码将离散标记转换为连续运动，这是一种通过整流将离散运动标记解码为连续运动的新方法。通过在连续空间中采用迭代细化过程，DisCoRD 可以捕捉细粒度的动态并确保更平滑、更自然的运动。我们的方法与任何基于离散的框架兼容，可以在不影响对调节信号的忠实度的情况下增强自然度。大量评估表明，DisCoRD 实现了最先进的性能，在 HumanML3D 上的 FID 为 0.032，在 KIT-ML 上的 FID 为 0.169。这些结果巩固了 DisCoRD 作为弥合离散效率和连续现实之间鸿沟的强大解决方案的地位。我们的项目页面位于：此 https URL。

Title: RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Authors: Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19528
Pdf URL: https://arxiv.org/pdf/2411.19528
Copy Paste: [[2411.19528]] RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation(https://arxiv.org/abs/2411.19528)
Keywords: generation, generative
Abstract: Standard clothing asset generation involves creating forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized sampling distributions and precise structural requirements in the generated images. Existing models have limited spatial perception and often exhibit structural hallucinations in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating external knowledge from LLM and databases. RAGDiffusion consists of two core processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a three-level alignment that ensures fidelity in structural, pattern, and decoding components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and detail-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.
摘要：标准服装资产生成涉及通过从各种现实世界环境中提取服装信息来创建在清晰背景上显示的正面平铺服装图像，这带来了重大挑战，因为生成的图像中采样分布高度标准化并且结构要求精确。现有模型的空间感知有限，并且经常在这种高规格生成任务中表现出结构幻觉。为了解决这个问题，我们提出了一种新颖的检索增强生成 (RAG) 框架，称为 RAGDiffusion，通过吸收来自 LLM 和数据库的外部知识来增强结构确定性并减轻幻觉。RAGDiffusion 包括两个核心过程：(1) 基于检索的结构聚合，它采用对比学习和结构局部线性嵌入 (SLLE) 来得出全局结构和空间标志，提供软指导和硬指导以抵消结构模糊性；(2) 全方位忠实服装生成，它引入了三级对齐，以确保扩散内结构、模式和解码组件的保真度。在具有挑战性的真实世界数据集上进行的大量实验表明，RAGDiffusion 合成了结构和细节忠实的服装资产，并显著提高了性能，代表了使用 RAG 进行高规格忠实生成的开创性努力，以对抗内在幻觉并提高保真度。

Title: QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain

Authors: Wenfang Sun, Yingjun Du, Gaowen Liu, Cees G. M. Snoek
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19534
Pdf URL: https://arxiv.org/pdf/2411.19534
Copy Paste: [[2411.19534]] QUOTA: Quantifying Objects with Text-to-Image Models for Any Domain(https://arxiv.org/abs/2411.19534)
Keywords: generation, generative
Abstract: We tackle the problem of quantifying the number of objects by a generative text-to-image model. Rather than retraining such a model for each new image domain of interest, which leads to high computational costs and limited scalability, we are the first to consider this problem from a domain-agnostic perspective. We propose QUOTA, an optimization framework for text-to-image models that enables effective object quantification across unseen domains without retraining. It leverages a dual-loop meta-learning strategy to optimize a domain-invariant prompt. Further, by integrating prompt learning with learnable counting and domain tokens, our method captures stylistic variations and maintains accuracy, even for object classes not encountered during training. For evaluation, we adopt a new benchmark specifically designed for object quantification in domain generalization, enabling rigorous assessment of object quantification accuracy and adaptability across unseen domains in text-to-image generation. Extensive experiments demonstrate that QUOTA outperforms conventional models in both object quantification accuracy and semantic consistency, setting a new benchmark for efficient and scalable text-to-image generation for any domain.
摘要：我们通过生成文本到图像模型来解决量化对象数量的问题。我们不是为每个感兴趣的新图像域重新训练这样的模型，因为这会导致高昂的计算成本和有限的可扩展性，而是首先从领域无关的角度考虑这个问题。我们提出了 QUOTA，这是一个文本到图像模型的优化框架，无需重新训练即可在看不见的域中实现有效的对象量化。它利用双环元学习策略来优化领域不变提示。此外，通过将提示学习与可学习的计数和域标记相结合，我们的方法可以捕捉风格变化并保持准确性，即使对于训练期间未遇到的对象类别也是如此。为了进行评估，我们采用了专门为领域泛化中的对象量化设计的新基准，从而能够严格评估文本到图像生成中看不见的域中的对象量化准确性和适应性。大量实验表明，QUOTA 在对象量化准确性和语义一致性方面均优于传统模型，为任何领域的高效、可扩展的文本到图像生成树立了新的基准。

Title: Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

Authors: Florinel-Alin Croitoru, Andrei-Iulian Hiji, Vlad Hondru, Nicolae Catalin Ristea, Paul Irofti, Marius Popescu, Cristian Rusu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Mubarak Shah
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2411.19537
Pdf URL: https://arxiv.org/pdf/2411.19537
Copy Paste: [[2411.19537]] Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook(https://arxiv.org/abs/2411.19537)
Keywords: generation, generative
Abstract: With the recent advancements in generative modeling, the realism of deepfake content has been increasing at a steady pace, even reaching the point where people often fail to detect manipulated media content online, thus being deceived into various kinds of scams. In this paper, we survey deepfake generation and detection techniques, including the most recent developments in the field, such as diffusion models and Neural Radiance Fields. Our literature review covers all deepfake media types, comprising image, video, audio and multimodal (audio-visual) content. We identify various kinds of deepfakes, according to the procedure used to alter or generate the fake content. We further construct a taxonomy of deepfake generation and detection methods, illustrating the important groups of methods and the domains where these methods are applied. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing deepfake detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfake content generated by unseen deepfake generators. Finally, we propose future directions to obtain robust and powerful deepfake detectors. Our project page and new benchmark are available at this https URL.
摘要：随着生成模型的最新进展，深度伪造内容的真实性一直在稳步提升，甚至达到了人们经常无法检测到在线操纵的媒体内容的地步，从而被各种骗局欺骗。在本文中，我们调查了深度伪造的生成和检测技术，包括该领域的最新发展，例如扩散模型和神经辐射场。我们的文献综述涵盖了所有深度伪造媒体类型，包括图像、视频、音频和多模态（视听）内容。我们根据用于修改或生成虚假内容的程序来识别各种类型的深度伪造。我们进一步构建了深度伪造生成和检测方法的分类法，说明了重要的方法组以及应用这些方法的领域。接下来，我们收集用于深度伪造检测的数据集，并提供在最流行的数据集上表现最佳的深度伪造检测器的最新排名。此外，我们还开发了一种新颖的多模态基准来评估深度伪造检测器在分发范围外的内容上的表现。结果表明，最先进的检测器无法推广到由未见过的深度伪造生成器生成的深度伪造内容。最后，我们提出了获得强大而强大的深度伪造检测器的未来方向。我们的项目页面和新基准可在此 https URL 上找到。

Title: ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Authors: Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Guan Huang, Chen Liu, Yuyin Chen, Yida Wang, Xueyang Zhang, Yifei Zhan, Kun Zhan, Peng Jia, Xianpeng Lang, Xingang Wang, Wenjun Mei
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2411.19548
Pdf URL: https://arxiv.org/pdf/2411.19548
Copy Paste: [[2411.19548]] ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration(https://arxiv.org/abs/2411.19548)
Keywords: restoration
Abstract: Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent works have demonstrated that integrating world model knowledge alleviates these issues. Despite their efficiency, these approaches still encounter difficulties in the accurate representation of more complex maneuvers, with multi-lane shifts being a notable example. Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, DriveRestorer is proposed to mitigate artifacts via online restoration. This is complemented by a progressive data update strategy designed to ensure high-quality rendering for more complex maneuvers. To the best of our knowledge, ReconDreamer is the first method to effectively render in large maneuvers. Experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%. Furthermore, ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87% in the NTA-IoU metric and a comprehensive user study.
摘要：闭环模拟对于端到端自动驾驶至关重要。现有的传感器模拟方法（例如 NeRF 和 3DGS）基于与训练数据分布密切相关的条件重建驾驶场景。然而，这些方法在渲染新轨迹（例如车道变换）时会遇到困难。最近的研究表明，整合世界模型知识可以缓解这些问题。尽管这些方法效率很高，但它们在准确表示更复杂的动作方面仍然遇到困难，多车道变换就是一个显著的例子。因此，我们引入了 ReconDreamer，它通过逐步整合世界模型知识来增强驾驶场景重建。具体来说，我们提出了 DriveRestorer 来通过在线恢复来减轻伪影。此外，我们还采用了渐进式数据更新策略，旨在确保更复杂动作的高质量渲染。据我们所知，ReconDreamer 是第一种有效渲染大型动作的方法。实验结果表明，ReconDreamer 在 NTA-IoU、NTL-IoU 和 FID 上的表现均优于 Street Gaussians，相对提升幅度分别为 24.87%、6.72% 和 29.97%。此外，ReconDreamer 在大动作渲染中超越了使用 PVG 的 DriveDreamer4D，这已通过 NTA-IoU 指标相对提升 195.87% 和全面的用户研究得到证实。

Title: Tortho-Gaussian: Splatting True Digital Orthophoto Maps

Authors: Xin Wang, Wendi Zhang, Hong Xie, Haibin Ai, Qiangqiang Yuan, Zongqian Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.19594
Pdf URL: https://arxiv.org/pdf/2411.19594
Copy Paste: [[2411.19594]] Tortho-Gaussian: Splatting True Digital Orthophoto Maps(https://arxiv.org/abs/2411.19594)
Keywords: generation
Abstract: True Digital Orthophoto Maps (TDOMs) are essential products for digital twins and Geographic Information Systems (GIS). Traditionally, TDOM generation involves a complex set of traditional photogrammetric process, which may deteriorate due to various challenges, including inaccurate Digital Surface Model (DSM), degenerated occlusion detections, and visual artifacts in weak texture regions and reflective surfaces, etc. To address these challenges, we introduce TOrtho-Gaussian, a novel method inspired by 3D Gaussian Splatting (3DGS) that generates TDOMs through orthogonal splatting of optimized anisotropic Gaussian kernel. More specifically, we first simplify the orthophoto generation by orthographically splatting the Gaussian kernels onto 2D image planes, formulating a geometrically elegant solution that avoids the need for explicit DSM and occlusion detection. Second, to produce TDOM of large-scale area, a divide-and-conquer strategy is adopted to optimize memory usage and time efficiency of training and rendering for 3DGS. Lastly, we design a fully anisotropic Gaussian kernel that adapts to the varying characteristics of different regions, particularly improving the rendering quality of reflective surfaces and slender structures. Extensive experimental evaluations demonstrate that our method outperforms existing commercial software in several aspects, including the accuracy of building boundaries, the visual quality of low-texture regions and building facades. These results underscore the potential of our approach for large-scale urban scene reconstruction, offering a robust alternative for enhancing TDOM quality and scalability.
摘要：真正的数字正射影像地图 (TDOM) 是数字孪生和地理信息系统 (GIS) 必不可少的产品。传统上，TDOM 生成涉及一组复杂的传统摄影测量过程，该过程可能会因各种挑战而恶化，包括不准确的数字表面模型 (DSM)、退化的遮挡检测以及弱纹理区域和反射表面中的视觉伪影等。为了应对这些挑战，我们引入了 TOrtho-Gaussian，这是一种受 3D 高斯溅射 (3DGS) 启发的新方法，它通过优化的各向异性高斯核的正交溅射来生成 TDOM。更具体地说，我们首先通过将高斯核正交溅射到 2D 图像平面上来简化正射影像生成，制定一个几何优雅的解决方案，避免了显式 DSM 和遮挡检测的需要。其次，为了生成大规模区域的 TDOM，我们采用分而治之的策略来优化 3DGS 的内存使用和训练与渲染的时间效率。最后，我们设计了一个完全各向异性的高斯核，以适应不同区域的不同特征，特别是提高反射表面和细长结构的渲染质量。大量的实验评估表明，我们的方法在多个方面优于现有的商业软件，包括建筑物边界的准确性、低纹理区域的视觉质量和建筑物立面。这些结果强调了我们的方法在大规模城市场景重建中的潜力，为提高 TDOM 质量和可扩展性提供了一种强大的替代方案。

Title: FairDD: Fair Dataset Distillation via Synchronized Matching

Authors: Qihang Zhou, Shenhao Fang, Shibo He, Wenchao Meng, Jiming Chen
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19623
Pdf URL: https://arxiv.org/pdf/2411.19623
Copy Paste: [[2411.19623]] FairDD: Fair Dataset Distillation via Synchronized Matching(https://arxiv.org/abs/2411.19623)
Keywords: generation
Abstract: Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.
摘要：将大型数据集压缩为较小的合成数据集已显示出其在图像分类方面的前景。然而，先前的研究忽略了图像识别中的一个关键问题：确保在压缩数据集上训练的模型对受保护属性 (PA)（例如性别和种族）不产生偏见。我们的调查显示，数据集蒸馏 (DD) 无法减轻原始数据集中对少数群体的不公平现象。此外，由于压缩数据集的规模较小，这种偏见通常会在压缩数据集中恶化。为了弥补研究差距，我们提出了一种新颖的公平数据集蒸馏 (FDD) 框架，即 FairDD，它可以无缝应用于各种基于匹配的 DD 方法，而无需修改其原始架构。FairDD 的关键创新在于将合成数据集同步匹配到原始数据集的 PA 组，而不是不加区分地与由多数群体主导的 vanilla DD 中的整个分布对齐。这种同步匹配允许合成数据集避免陷入多数群体，并将其平衡生成引导到所有 PA 组。因此，FairDD 可以有效地规范原始 DD，使其偏向于少数群体的生成，同时保持目标属性的准确性。理论分析和大量实验评估表明，与原始 DD 方法相比，FairDD 显著提高了公平性，同时不牺牲分类准确性。它在分布和梯度匹配等各种 DD 中始终如一的优势使其成为一种多功能的 FDD 方法。

Title: Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing

Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19652
Pdf URL: https://arxiv.org/pdf/2411.19652
Copy Paste: [[2411.19652]] Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing(https://arxiv.org/abs/2411.19652)
Keywords: generation
Abstract: Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at this https URL.
摘要：使用扩散模型进行文本引导的图像生成和编辑取得了显著的进步。其中，无需调整的方法因其无需大量模型调整即可进行编辑的能力而受到关注，提供了简单性和效率。然而，现有的无需调整的方法往往难以平衡保真度和编辑精度。DDIM 反演中的重建误差部分归因于 U-Net 中的交叉注意机制，该机制在反演和重建过程中引入了错位。为了解决这个问题，我们从结构角度分析了重建，并提出了一种新方法，用统一的注意力图取代传统的交叉注意，显著提高了图像重建的保真度。我们的方法有效地最大限度地减少了噪声预测过程中文本条件变化造成的扭曲。为了补充这一改进，我们引入了一种自适应掩码引导编辑技术，该技术与我们的重建方法无缝集成，确保编辑任务的一致性和准确性。实验结果表明，我们的方法不仅在实现高保真图像重建方面表现出色，而且在真实的图像合成和编辑场景中也表现出色。这项研究强调了统一注意力图在增强基于扩散的图像处理方法的保真度和多功能性方面的潜力。代码可在此 https URL 上获取。

Title: TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

Authors: Bojun Xiong, Jialun Liu, Jiakui Hu, Chenming Wu, Jinbo Wu, Xing Liu, Chen Zhao, Errui Ding, Zhouhui Lian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19654
Pdf URL: https://arxiv.org/pdf/2411.19654
Copy Paste: [[2411.19654]] TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting(https://arxiv.org/abs/2411.19654)
Keywords: generation
Abstract: Physically Based Rendering (PBR) materials play a crucial role in modern graphics, enabling photorealistic rendering across diverse environment maps. Developing an effective and efficient algorithm that is capable of automatically generating high-quality PBR materials rather than RGB texture for 3D meshes can significantly streamline the 3D content creation. Most existing methods leverage pre-trained 2D diffusion models for multi-view image synthesis, which often leads to severe inconsistency between the generated textures and input 3D meshes. This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. Specifically, we place each 3D Gaussian on the finest leaf node of the octree built from the input 3D mesh to render the multiview images not only for the albedo map but also for roughness and metallic. Moreover, our model is trained in a regression manner instead of diffusion denoising, capable of generating the PBR material for a 3D mesh in a single feed-forward process. Extensive experiments on publicly available benchmarks demonstrate that our method synthesizes more visually pleasing PBR materials and runs faster than previous methods in both unconditional and text-conditional scenarios, which exhibit better consistency with the given geometry. Our code and trained models are available at this https URL.
摘要：基于物理的渲染 (PBR) 材料在现代图形中起着至关重要的作用，可实现跨各种环境地图的逼真渲染。开发一种能够自动生成高质量 PBR 材料（而不是 RGB 纹理）的有效且高效的算法可以显著简化 3D 内容创建。大多数现有方法利用预先训练的 2D 扩散模型进行多视图图像合成，这通常会导致生成的纹理与输入的 3D 网格之间严重不一致。本文介绍了 TexGaussian，这是一种使用八分圆对齐的 3D 高斯 Splatting 快速生成 PBR 材料的新方法。具体而言，我们将每个 3D 高斯放置在从输入 3D 网格构建的八叉树的最细叶节点上，以渲染多视图图像，不仅用于反照率图，还用于粗糙度和金属度。此外，我们的模型以回归方式而不是扩散去噪进行训练，能够在单个前馈过程中为 3D 网格生成 PBR 材料。在公开可用的基准上进行的大量实验表明，我们的方法可以合成更美观的 PBR 材料，并且在无条件和文本条件场景中都比以前的方法运行得更快，并且与给定的几何图形具有更好的一致性。我们的代码和经过训练的模型可在此 https URL 上找到。

Title: JetFormer: An Autoregressive Generative Model of Raw Images and Text

Authors: Michael Tschannen, André Susano Pinto, Alexander Kolesnikov
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.19722
Pdf URL: https://arxiv.org/pdf/2411.19722
Copy Paste: [[2411.19722]] JetFormer: An Autoregressive Generative Model of Raw Images and Text(https://arxiv.org/abs/2411.19722)
Keywords: generation, generative
Abstract: Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.
摘要：消除建模约束和统一跨领域的架构是最近在训练大型多模态模型方面取得进展的关键驱动因素。然而，这些模型中的大多数仍然依赖于许多单独训练的组件，例如特定于模态的编码器和解码器。在这项工作中，我们进一步简化了图像和文本的联合生成建模。我们提出了一种自回归解码器专用转换器 - JetFormer - 它经过训练可以直接最大化原始数据的可能性，而无需依赖任何单独预训练的组件，并且可以理解和生成文本和图像。具体来说，我们利用规范化流模型来获得与自回归多模态转换器联合训练的软标记图像表示。规范化流模型在推理过程中既可用作感知任务的图像编码器，又可用作图像生成任务的图像解码器。JetFormer 实现的文本到图像生成质量可与最近基于 VQ-VAE 和 VAE 的基线相媲美。这些基线依赖于预训练的图像自动编码器，这些自动编码器经过复杂的损失混合训练，包括感知损失。同时，JetFormer 还展示了强大的图像理解能力。据我们所知，JetFormer 是第一个能够生成高保真图像并产生强对数似然边界的模型。

Title: A Note on Small Percolating Sets on Hypercubes via Generative AI

Authors: Gergely Bérczi, Adam Zsolt Wagner
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2411.19734
Pdf URL: https://arxiv.org/pdf/2411.19734
Copy Paste: [[2411.19734]] A Note on Small Percolating Sets on Hypercubes via Generative AI(https://arxiv.org/abs/2411.19734)
Keywords: generative
Abstract: We apply a generative AI pattern-recognition technique called PatternBoost to study bootstrap percolation on hypercubes. With this, we slightly improve the best existing upper bound for the size of percolating subsets of the hypercube.
摘要：我们应用了一种名为 PatternBoost 的生成式 AI 模式识别技术来研究超立方体上的引导渗透。借助这项技术，我们略微改进了超立方体渗透子集大小的最佳现有上限。

Title: MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks

Authors: Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19786
Pdf URL: https://arxiv.org/pdf/2411.19786
Copy Paste: [[2411.19786]] MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks(https://arxiv.org/abs/2411.19786)
Keywords: generation, generative
Abstract: Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing the motion sequences and textual descriptions from the extracted embeddings, respectively. MTDM, on the other hand, performs an iterative denoising process on the input context to handle diverse tasks. Experimental results on the benchmark datasets demonstrate the superior performance of our proposed method on text-to-motion generation and competitive performance on motion captioning.
摘要：最近，由于去噪扩散模型和大型语言模型等鼓舞人心的生成模型，人体运动分析取得了长足的进步。而现有的方法主要侧重于生成带有文本描述的动作，而忽略了互惠任务。在本文中，我们提出了~\textbf{MoTe}，这是一个统一的多模态模型，它可以通过同时学习运动和文本的边际、条件和联合分布来处理各种任务。通过简单地修改输入上下文，MoTe 让我们能够处理成对的文本运动生成、运动字幕和文本驱动的运动生成。具体而言，MoTe 由三个组件组成：运动编码器-解码器 (MED)、文本编码器-解码器 (TED) 和运动文本扩散模型 (MTDM)。具体而言，MED 和 TED 分别用于训练提取潜在嵌入，然后从提取的嵌入中重建运动序列和文本描述。另一方面，MTDM 对输入上下文执行迭代去噪过程以处理各种任务。基准数据集上的实验结果证明了我们提出的方法在文本到运动生成方面的卓越性能和在运动字幕方面的竞争性能。

Title: $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields

Authors: Prajwal Singh, Ashish Tiwari, Gautam Vashishtha, Shanmuganathan Raman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19903
Pdf URL: https://arxiv.org/pdf/2411.19903
Copy Paste: [[2411.19903]] $C^{3}$-NeRF: Modeling Multiple Scenes via Conditional-cum-Continual Neural Radiance Fields(https://arxiv.org/abs/2411.19903)
Keywords: generative
Abstract: Neural radiance fields (NeRF) have exhibited highly photorealistic rendering of novel views through per-scene optimization over a single 3D scene. With the growing popularity of NeRF and its variants, they have become ubiquitous and have been identified as efficient 3D resources. However, they are still far from being scalable since a separate model needs to be stored for each scene, and the training time increases linearly with every newly added scene. Surprisingly, the idea of encoding multiple 3D scenes into a single NeRF model is heavily under-explored. In this work, we propose a novel conditional-cum-continual framework, called $C^{3}$-NeRF, to accommodate multiple scenes into the parameters of a single neural radiance field. Unlike conventional approaches that leverage feature extractors and pre-trained priors for scene conditioning, we use simple pseudo-scene labels to model multiple scenes in NeRF. Interestingly, we observe the framework is also inherently continual (via generative replay) with minimal, if not no, forgetting of the previously learned scenes. Consequently, the proposed framework adapts to multiple new scenes without necessarily accessing the old data. Through extensive qualitative and quantitative evaluation using synthetic and real datasets, we demonstrate the inherent capacity of the NeRF model to accommodate multiple scenes with high-quality novel-view renderings without adding additional parameters. We provide implementation details and dynamic visualizations of our results in the supplementary file.
摘要：神经辐射场 (NeRF) 通过对单个 3D 场景进行逐场景优化，展现出高度逼真的新视图渲染。随着 NeRF 及其变体的日益普及，它们已变得无处不在，并被认为是高效的 3D 资源。然而，它们还远不能扩展，因为需要为每个场景存储一个单独的模型，并且训练时间会随着每个新添加的场景而线性增加。令人惊讶的是，将多个 3D 场景编码到单个 NeRF 模型中的想法尚未得到充分探索。在这项工作中，我们提出了一种新颖的条件兼连续框架，称为 $C^{3}$-NeRF，以将多个场景容纳到单个神经辐射场的参数中。与利用特征提取器和预训练先验进行场景调节的传统方法不同，我们使用简单的伪场景标签在 NeRF 中对多个场景进行建模。有趣的是，我们观察到该框架本质上也是连续的（通过生成重放），并且几乎不会忘记之前学习过的场景。因此，所提出的框架可以适应多种新场景，而不必访问旧数据。通过使用合成和真实数据集进行广泛的定性和定量评估，我们证明了 NeRF 模型的固有能力，无需添加额外参数即可适应多种场景并提供高质量的新视图渲染。我们在补充文件中提供了结果的实现细节和动态可视化。

Title: SIMS: Simulating Human-Scene Interactions with Real World Script Planning

Authors: Wenjia Wang, Liang Pan, Zhiyang Dou, Zhouyingcheng Liao, Yuke Lou, Lei Yang, Jingbo Wang, Taku Komura
Subjects: cs.CV, cs.AI, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2411.19921
Pdf URL: https://arxiv.org/pdf/2411.19921
Copy Paste: [[2411.19921]] SIMS: Simulating Human-Scene Interactions with Real World Script Planning(https://arxiv.org/abs/2411.19921)
Keywords: generation
Abstract: Simulating long-term human-scene interaction is a challenging yet fascinating task. Previous works have not effectively addressed the generation of long-term human scene interactions with detailed narratives for physics-based animation. This paper introduces a novel framework for the planning and controlling of long-horizon physical plausible human-scene interaction. On the one hand, films and shows with stylish human locomotions or interactions with scenes are abundantly available on the internet, providing a rich source of data for script planning. On the other hand, Large Language Models (LLMs) can understand and generate logical storylines. This motivates us to marry the two by using an LLM-based pipeline to extract scripts from videos, and then employ LLMs to imitate and create new scripts, capturing complex, time-series human behaviors and interactions with environments. By leveraging this, we utilize a dual-aware policy that achieves both language comprehension and scene understanding to guide character motions within contextual and spatial constraints. To facilitate training and evaluation, we contribute a comprehensive planning dataset containing diverse motion sequences extracted from real-world videos and expand them with large language models. We also collect and re-annotate motion clips from existing kinematic datasets to enable our policy learn diverse skills. Extensive experiments demonstrate the effectiveness of our framework in versatile task execution and its generalization ability to various scenarios, showing remarkably enhanced performance compared with existing methods. Our code and data will be publicly available soon.
摘要：模拟长期人景互动是一项具有挑战性但又令人着迷的任务。以前的研究尚未有效解决基于物理的动画中长期人景互动的详细叙述。本文介绍了一种用于规划和控制长期物理可信人景互动的新框架。一方面，互联网上有大量具有时尚人类运动或与场景互动的电影和节目，为剧本规划提供了丰富的数据来源。另一方面，大型语言模型 (LLM) 可以理解和生成合乎逻辑的故事情节。这促使我们将两者结合起来，使用基于 LLM 的管道从视频中提取脚本，然后使用 LLM 模仿和创建新脚本，捕捉复杂的时间序列人类行为和与环境的互动。通过利用这一点，我们利用双重感知策略，实现语言理解和场景理解，以在上下文和空间约束内引导角色运动。为了便于训练和评估，我们提供了一个全面的规划数据集，其中包含从真实世界视频中提取的各种运动序列，并使用大型语言模型对其进行扩展。我们还从现有的运动数据集中收集并重新注释运动片段，以使我们的策略能够学习各种技能。大量实验证明了我们的框架在多功能任务执行中的有效性及其对各种场景的泛化能力，与现有方法相比，其性能显著增强。我们的代码和数据将很快公开。

Title: Free-form Generation Enhances Challenging Clothed Human Modeling

Authors: Hang Ye, Xiaoxuan Ma, Hai Ci, Wentao Zhu, Yizhou Wang
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.19942
Pdf URL: https://arxiv.org/pdf/2411.19942
Copy Paste: [[2411.19942]] Free-form Generation Enhances Challenging Clothed Human Modeling(https://arxiv.org/abs/2411.19942)
Keywords: generation
Abstract: Achieving realistic animated human avatars requires accurate modeling of pose-dependent clothing deformations. Existing learning-based methods heavily rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like SMPL to model deformation. However, these methods struggle to handle loose clothing, such as long dresses, where the canonicalization process becomes ill-defined when the clothing is far from the body, leading to disjointed and fragmented results. To overcome this limitation, we propose a novel hybrid framework to model challenging clothed humans. Our core idea is to use dedicated strategies to model different regions, depending on whether they are close to or distant from the body. Specifically, we segment the human body into three categories: unclothed, deformed, and generated. We simply replicate unclothed regions that require no deformation. For deformed regions close to the body, we leverage LBS to handle the deformation. As for the generated regions, which correspond to loose clothing areas, we introduce a novel free-form, part-aware generator to model them, as they are less affected by movements. This free-form generation paradigm brings enhanced flexibility and expressiveness to our hybrid framework, enabling it to capture the intricate geometric details of challenging loose clothing, such as skirts and dresses. Experimental results on the benchmark dataset featuring loose clothing demonstrate that our method achieves state-of-the-art performance with superior visual fidelity and realism, particularly in the most challenging cases.
摘要：要制作逼真的动画人体形象，需要对姿势相关的服装变形进行精确建模。现有的基于学习的方法严重依赖 SMPL 等极少着装人体模型的线性混合蒙皮 (LBS) 来建模变形。然而，这些方法难以处理宽松的服装，例如长裙，当服装远离身体时，规范化过程会变得不明确，导致结果脱节和碎片化。为了克服这一限制，我们提出了一种新颖的混合框架来建模具有挑战性的着装人体。我们的核心思想是使用专用策略来建模不同的区域，具体取决于它们是靠近身体还是远离身体。具体来说，我们将人体分为三类：裸露、变形和生成。我们只是复制不需要变形的裸露区域。对于靠近身体的变形区域，我们利用 LBS 来处理变形。至于对应于宽松服装区域的生成区域，我们引入了一种新颖的自由形式、部分感知生成器来建模它们，因为它们受运动的影响较小。这种自由形式的生成范式为我们的混合框架带来了增强的灵活性和表现力，使其能够捕捉具有挑战性的宽松服装（如裙子和连衣裙）的复杂几何细节。在宽松服装基准数据集上的实验结果表明，我们的方法实现了最先进的性能，具有出色的视觉保真度和真实感，尤其是在最具挑战性的情况下。