2024-12-20

Title: Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing

Authors: Le-Anh Tran, Dong-Chul Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14220
Pdf URL: https://arxiv.org/pdf/2412.14220
Copy Paste: [[2412.14220]] Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing(https://arxiv.org/abs/2412.14220)
Keywords: restoration, generative
Abstract: This paper proposes a lightweight neural network designed for realistic image dehazing, utilizing a Distilled Pooling Transformer Encoder, named DPTE-Net. Recently, while vision transformers (ViTs) have achieved great success in various vision tasks, their self-attention (SA) module's complexity scales quadratically with image resolution, hindering their applicability on resource-constrained devices. To overcome this, the proposed DPTE-Net substitutes traditional SA modules with efficient pooling mechanisms, significantly reducing computational demands while preserving ViTs' learning capabilities. To further enhance semantic feature learning, a distillation-based training process is implemented which transfers rich knowledge from a larger teacher network to DPTE-Net. Additionally, DPTE-Net is trained within a generative adversarial network (GAN) framework, leveraging the strong generalization of GAN in image restoration, and employs a transmission-aware loss function to dynamically adapt to varying haze densities. Experimental results on various benchmark datasets have shown that the proposed DPTE-Net can achieve competitive dehazing performance when compared to state-of-the-art methods while maintaining low computational complexity, making it a promising solution for resource-limited applications. The code of this work is available at this https URL.
摘要：本文提出了一种轻量级神经网络，利用蒸馏池化变换器编码器（DPTE-Net）实现逼真的图像去雾。最近，虽然视觉变换器 (ViT) 在各种视觉任务中取得了巨大成功，但它们的自注意力 (SA) 模块的复杂性与图像分辨率成二次方关系，阻碍了它们在资源受限设备上的适用性。为了解决这个问题，提出的 DPTE-Net 用高效的池化机制取代了传统的 SA 模块，大大降低了计算需求，同时保留了 ViT 的学习能力。为了进一步增强语义特征学习，实施了基于蒸馏的训练过程，将丰富的知识从更大的教师网络转移到 DPTE-Net。此外，DPTE-Net 在生成对抗网络 (GAN) 框架内进行训练，利用 GAN 在图像恢复中的强大泛化能力，并采用传输感知损失函数来动态适应不同的雾霾密度。在各种基准数据集上的实验结果表明，与最先进的方法相比，所提出的 DPTE-Net 可以实现具有竞争力的去雾性能，同时保持较低的计算复杂度，使其成为资源受限应用的有前途的解决方案。这项工作的代码可在此 https URL 上找到。

Title: PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation

Authors: Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, Di Niu
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14283
Pdf URL: https://arxiv.org/pdf/2412.14283
Copy Paste: [[2412.14283]] PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation(https://arxiv.org/abs/2412.14283)
Keywords: generation
Abstract: Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inference-time methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.
摘要：最近的研究探索了扩散模型 (DM) 在一致对象编辑方面的潜力，旨在修改对象的位置、大小和构图等，同时保持对象和背景的一致性而不改变它们的纹理和属性。当前的推理时间方法通常依赖于 DDIM 反转，这本质上会损害效率和编辑图像可实现的一致性。最近的方法还利用能量引导，迭代更新预测噪声，并可能将潜在噪声驱离原始图像，从而导致失真。在本文中，我们提出了 PixelMan，这是一种无需反转和无需训练的方法，可通过像素操作和生成实现一致的对象编辑，我们直接在像素空间中的目标位置创建源对象的副本，并引入一种有效的采样方法，以迭代方式将操作的对象协调到目标位置并修复其原始位置，同时通过将要生成的编辑图像锚定到像素操作图像以及在推理过程中引入各种一致性保持优化技术来确保图像一致性。基于基准数据集以及广泛的视觉比较的实验评估表明，在短短 16 个推理步骤中，PixelMan 在多个一致的对象编辑任务中的表现优于一系列最先进的基于训练和免训练的方法（通常需要 50 个步骤）。

Title: Personalized Generative Low-light Image Denoising and Enhancement

Authors: Xijun Wang, Prateek Chennuri, Yu Yuan, Bole Ma, Xingguang Zhang, Stanley Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14327
Pdf URL: https://arxiv.org/pdf/2412.14327
Copy Paste: [[2412.14327]] Personalized Generative Low-light Image Denoising and Enhancement(https://arxiv.org/abs/2412.14327)
Keywords: restoration, generation, generative
Abstract: While smartphone cameras today can produce astonishingly good photos, their performance in low light is still not completely satisfactory because of the fundamental limits in photon shot noise and sensor read noise. Generative image restoration methods have demonstrated promising results compared to traditional methods, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Recognizing the availability of personalized photo galleries on users' smartphones, we propose Personalized Generative Denoising (PGD) by building a diffusion model customized for different users. Our core innovation is an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer provides a strong prior that can be integrated with the diffusion model to restore the degraded images, without the need of fine-tuning. Over a wide range of low-light testing scenarios, we show that PGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches.
摘要：尽管如今的智能手机相机可以拍出令人惊叹的好照片，但由于光子散粒噪声和传感器读取噪声的根本限制，它们在低光照条件下的表现仍然不能完全令人满意。与传统方法相比，生成式图像恢复方法已显示出令人鼓舞的结果，但当信噪比 (SNR) 较低时，它们会产生幻觉内容。认识到用户智能手机上个性化照片库的可用性，我们提出了个性化生成去噪 (PGD)，通过构建针对不同用户定制的扩散模型。我们的核心创新是身份一致的物理缓冲区，可从图库中提取人的物理属性。这种 ID 一致的物理缓冲区提供了强大的先验，可以与扩散模型集成以恢复退化的图像，而无需进行微调。在广泛的低光照测试场景中，我们表明与现有的基于扩散的去噪方法相比，PGD 实现了卓越的图像去噪和增强性能。

Title: Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Authors: Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14333
Pdf URL: https://arxiv.org/pdf/2412.14333
Copy Paste: [[2412.14333]] Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters(https://arxiv.org/abs/2412.14333)
Keywords: generation
Abstract: Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.
摘要：近来，同步语音手势和说话头部生成方面取得了令人印象深刻的进展，但大多数方法仅关注两个任务中的一项。那些试图同时生成这两项任务的方法通常依赖于单独的模型或网络模块，这增加了训练复杂性并忽略了面部和身体运动之间的内在关系。为了应对这些挑战，在本文中，我们提出了一种新颖的模型架构，可在单个网络中联合生成面部和身体运动。这种方法利用模态之间的共享权重，并通过适配器实现对公共潜在空间的适应。我们的实验表明，所提出的框架不仅保持了最先进的同步语音手势和说话头部生成性能，而且还显著减少了所需的参数数量。

Title: A Unifying Information-theoretic Perspective on Evaluating Generative Models

Authors: Alexis Fox, Samarth Swarup, Abhijin Adiga
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.14340
Pdf URL: https://arxiv.org/pdf/2412.14340
Copy Paste: [[2412.14340]] A Unifying Information-theoretic Perspective on Evaluating Generative Models(https://arxiv.org/abs/2412.14340)
Keywords: generative
Abstract: Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.
摘要：考虑到解释生成模型输出的难度，目前有大量研究集中于确定有意义的评估指标。最近的几种方法利用从分类领域借用的“精度”和“召回率”分别量化输出保真度（真实性）和输出多样性（真实数据变化的表示）。随着指标提案的增加，需要一个统一的视角，以便更容易比较和更清楚地解释它们的优点和缺点。为此，我们使用来自 kNN 密度估计的方法，在信息理论视角下统一了一类基于 kth-nearest-neighbors (kNN) 的指标。此外，我们提出了一个由精度交叉熵 (PCE)、召回交叉熵 (RCE) 和召回熵 (RE) 组成的三维指标，分别测量保真度和多样性的两个不同方面，即类间和类内。我们的领域无关度量源自信息论中熵和交叉熵的概念，可用于样本级和模式级分析。我们的详细实验结果证明了我们的度量组件对其各自质量的敏感性，并揭示了其他度量的不良行为。

Title: Surrealistic-like Image Generation with Vision-Language Models

Authors: Elif Ayten, Shuai Wang, Hjalmar Snoep
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14366
Pdf URL: https://arxiv.org/pdf/2412.14366
Copy Paste: [[2412.14366]] Surrealistic-like Image Generation with Vision-Language Models(https://arxiv.org/abs/2412.14366)
Keywords: generation, generative
Abstract: Recent advances in generative AI make it convenient to create different types of content, including text, images, and code. In this paper, we explore the generation of images in the style of paintings in the surrealism movement using vision-language generative models, including DALL-E, Deep Dream Generator, and DreamStudio. Our investigation starts with the generation of images under various image generation settings and different models. The primary objective is to identify the most suitable model and settings for producing such images. Additionally, we aim to understand the impact of using edited base images on the generated resulting images. Through these experiments, we evaluate the performance of selected models and gain valuable insights into their capabilities in generating such images. Our analysis shows that Dall-E 2 performs the best when using the generated prompt by ChatGPT.
摘要：生成式人工智能的最新进展使得创建不同类型的内容（包括文本、图像和代码）变得非常方便。在本文中，我们探索了使用视觉语言生成模型（包括 DALL-E、Deep Dream Generator 和 DreamStudio）生成超现实主义运动风格的图像。我们的研究从在不同图像生成设置和不同模型下的图像生成开始。主要目标是确定最适合生成此类图像的模型和设置。此外，我们旨在了解使用编辑过的基础图像对生成的结果图像的影响。通过这些实验，我们评估了所选模型的性能，并深入了解了它们生成此类图像的能力。我们的分析表明，当使用 ChatGPT 生成的提示时，Dall-E 2 的表现最佳。

Title: Enhancing Diffusion Models for High-Quality Image Generation

Authors: Jaineet Shah, Michael Gromis, Rickston Pinto
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14422
Pdf URL: https://arxiv.org/pdf/2412.14422
Copy Paste: [[2412.14422]] Enhancing Diffusion Models for High-Quality Image Generation(https://arxiv.org/abs/2412.14422)
Keywords: generation, generative
Abstract: This report presents the comprehensive implementation, evaluation, and optimization of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs), which are state-of-the-art generative models. During inference, these models take random noise as input and iteratively generate high-quality images as output. The study focuses on enhancing their generative capabilities by incorporating advanced techniques such as Classifier-Free Guidance (CFG), Latent Diffusion Models with Variational Autoencoders (VAE), and alternative noise scheduling strategies. The motivation behind this work is the growing demand for efficient and scalable generative AI models that can produce realistic images across diverse datasets, addressing challenges in applications such as art creation, image synthesis, and data augmentation. Evaluations were conducted on datasets including CIFAR-10 and ImageNet-100, with a focus on improving inference speed, computational efficiency, and image quality metrics like Frechet Inception Distance (FID). Results demonstrate that DDIM + CFG achieves faster inference and superior image quality. Challenges with VAE and noise scheduling are also highlighted, suggesting opportunities for future optimization. This work lays the groundwork for developing scalable, efficient, and high-quality generative AI systems to benefit industries ranging from entertainment to robotics.
摘要：本报告介绍了去噪扩散概率模型 (DDPM) 和去噪扩散隐式模型 (DDIM) 的全面实施、评估和优化，它们是最先进的生成模型。在推理过程中，这些模型将随机噪声作为输入，并迭代生成高质量图像作为输出。该研究侧重于通过结合无分类器指导 (CFG)、带变分自动编码器的潜在扩散模型 (VAE) 和替代噪声调度策略等先进技术来增强其生成能力。这项工作背后的动机是对高效且可扩展的生成 AI 模型的需求日益增长，这些模型可以在不同的数据集中生成逼真的图像，以应对艺术创作、图像合成和数据增强等应用中的挑战。评估是在包括 CIFAR-10 和 ImageNet-100 在内的数据集上进行的，重点是提高推理速度、计算效率和图像质量指标，如 Frechet 初始距离 (FID)。结果表明，DDIM + CFG 实现了更快的推理和卓越的图像质量。VAE 和噪声调度方面的挑战也得到了强调，表明未来存在优化机会。这项工作为开发可扩展、高效和高质量的生成式 AI 系统奠定了基础，使娱乐和机器人等行业受益。

Title: IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

Authors: Anand Kumar, Jiteng Mu, Nuno Vasconcelos
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14432
Pdf URL: https://arxiv.org/pdf/2412.14432
Copy Paste: [[2412.14432]] IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features(https://arxiv.org/abs/2412.14432)
Keywords: generation
Abstract: Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns regarding data privacy and copyright infringement among artists. Consequently, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as introspective style attribution (IntroStyle) and demonstrates superior performance to state-of-the-art models for style retrieval. We also introduce a synthetic dataset of Style Hacks (SHacks) to isolate artistic style and evaluate fine-grained style attribution performance.
摘要：文本转图像 (T2I) 模型已在内容创作者和普通大众中得到广泛采用。然而，这引发了艺术家对数据隐私和版权侵权的严重担忧。因此，人们越来越需要 T2I 模型纳入防止产生特定艺术风格的机制，从而保护知识产权。现有的风格提取方法通常需要收集自定义数据集并训练专门的模型。然而，这需要大量资源、耗时，并且通常不适用于实时应用。此外，它可能无法充分解决艺术风格的动态性质和数字艺术快速发展的格局。我们提出了一种新颖的、无需训练的框架来解决风格归因问题，仅使用由扩散模型生成的特征，无需任何外部模块或重新训练。这被称为内省风格归因 (IntroStyle)，其性能优于最先进的风格检索模型。我们还引入了 Style Hacks (SHacks) 的合成数据集来分离艺术风格并评估细粒度风格归因性能。

Title: GenHMR: Generative Human Mesh Recovery

Authors: Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14444
Pdf URL: https://arxiv.org/pdf/2412.14444
Copy Paste: [[2412.14444]] GenHMR: Generative Human Mesh Recovery(https://arxiv.org/abs/2412.14444)
Keywords: generative
Abstract: Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at this https URL
摘要：人体网格恢复 (HMR) 在许多计算机视觉应用中都至关重要；从健康到艺术和娱乐。单目图像的 HMR 主要通过确定性方法解决，这些方法会为给定的 2D 图像输出单个预测。然而，由于深度模糊和遮挡，单幅图像的 HMR 是一个不适定问题。概率方法试图通过生成和融合多个合理的 3D 重建来解决这个问题，但它们的性能往往落后于确定性方法。在本文中，我们介绍了 GenHMR，这是一种新颖的生成框架，它将单目 HMR 重新表述为图像条件生成任务，明确建模并减轻 2D 到 3D 映射过程中的不确定性。 GenHMR 包含两个关键组件：（1）姿势标记器，用于将 3D 人体姿势转换为潜在空间中的离散标记序列；（2）图像条件掩蔽转换器，用于学习姿势标记的概率分布，以输入图像提示和随机掩蔽的标记序列为条件。在推理过程中，模型从学习到的条件分布中采样以迭代解码高置信度姿势标记，从而减少 3D 重建的不确定性。为了进一步细化重建，提出了一种 2D 姿势引导细化技术，直接在潜在空间中微调解码的姿势标记，从而迫使投影的 3D 身体网格与 2D 姿势线索对齐。在基准数据集上的实验表明，GenHMR 明显优于最先进的方法。项目网站可以在这个 https URL 找到

Title: Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

Authors: Shengqi Liu, Yuhao Cheng, Zhuo Chen, Xingyu Ren, Wenhan Zhu, Lincheng Li, Mengxiao Bi, Xiaokang Yang, Yichao Yan
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14453
Pdf URL: https://arxiv.org/pdf/2412.14453
Copy Paste: [[2412.14453]] Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation(https://arxiv.org/abs/2412.14453)
Keywords: generation, generative
Abstract: Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: this https URL.
摘要：在服装设计中，缝纫图案生成因其对 CG 友好且可灵活编辑的特性而受到越来越多的关注。以前的缝纫图案生成方法已经能够制作出精美的服装，但难以设计出具有细节控制的复杂服装。为了解决这些问题，我们提出了 SewingLDM，这是一种多模态生成模型，可生成由文本提示、体形和服装草图控制的缝纫图案。首先，我们将缝纫图案的原始向量扩展为更全面的表示，以涵盖更复杂的细节，然后将它们压缩到紧凑的潜在空间中。为了学习潜在空间中的缝纫图案分布，我们设计了一个两步训练策略，将多模态条件（即体形、文本提示和服装草图）注入扩散模型，确保生成的服装适合身体并控制细节。全面的定性和定量实验表明我们提出的方法的有效性，在复杂服装设计和各种身体适应性方面显著超越了以前的方法。我们的项目页面：这个 https URL。

Title: LEDiff: Latent Exposure Diffusion for HDR Generation

Authors: Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14456
Pdf URL: https://arxiv.org/pdf/2412.14456
Copy Paste: [[2412.14456]] LEDiff: Latent Exposure Diffusion for HDR Generation(https://arxiv.org/abs/2412.14456)
Keywords: generation, generative
Abstract: While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
摘要：虽然消费级显示器越来越多地支持超过 10 档动态范围，但大多数图像资产（例如互联网照片和生成式 AI 内容）仍然局限于 8 位低动态范围 (LDR)，这限制了它们在高动态范围 (HDR) 应用中的效用。目前，没有任何生成模型能够以通用的方式生成高位、高动态范围内容。现有的 LDR 到 HDR 转换方法通常难以在剪切区域中产生照片级逼真的细节和物理上合理的动态范围。我们引入了 LEDiff，这种方法通过受图像空间曝光融合技术启发的潜在空间融合，实现具有 HDR 内容生成的生成模型。它还可用作 LDR 到 HDR 转换器，扩展现有低动态范围图像的动态范围。我们的方法使用小型 HDR 数据集来使预训练的扩散模型能够恢复剪切的高光和阴影中的细节和动态范围。 LEDiff 将 HDR 功能带入现有的生成模型，并将任何 LDR 图像转换为 HDR，为图像生成、基于图像的照明（HDR 环境图生成）和景深模拟等摄影效果创建逼真的 HDR 输出，其中线性 HDR 数据对于逼真的质量至关重要。

Title: LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

Authors: Tung Do, Thuan Hoang Nguyen, Anh Tuan Tran, Rang Nguyen, Binh-Son Hua
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.14464
Pdf URL: https://arxiv.org/pdf/2412.14464
Copy Paste: [[2412.14464]] LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations(https://arxiv.org/abs/2412.14464)
Keywords: generation
Abstract: We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.
摘要：我们提出了一种新的视图合成方法，即从单个或几个视图输入图像合成 3D 神经场。为了解决图像到 3D 生成问题的病态性，我们设计了一种两阶段方法，其中包括用于视图合成的重建模型和扩散模型。我们的重建模型首先将一个或多个输入图像从体积提升到 3D 空间作为粗尺度 3D 表示，然后将三平面作为细尺度 3D 表示。为了减轻遮挡区域的模糊性，我们的扩散模型随后从三平面产生渲染图像中缺失的细节。然后，我们引入了一种新的渐进式细化技术，该技术迭代应用重建和扩散模型来逐步合成新视图，从而提高 3D 表示及其渲染的整体质量。实证评估证明了我们的方法在合成 SRN-Car 数据集、野外 CO3D 数据集和大规模 Objaverse 数据集上优于最先进的方法，同时实现了采样效率和多视图一致性。

Title: DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

Authors: Wengyi Zhan, Mingbao Lin, Shuicheng Yan, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14465
Pdf URL: https://arxiv.org/pdf/2412.14465
Copy Paste: [[2412.14465]] DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On(https://arxiv.org/abs/2412.14465)
Keywords: generation
Abstract: We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model. This initial foray into the application of untrained diffusion models in virtual try-on technology potentially paves the way for further exploration and refinement in this industrially and academically valuable field.
摘要：我们为虚拟时装试穿引入了 DiffusionTrend，从而无需重新训练扩散模型。使用先进的扩散模型，DiffusionTrend 利用富含先验信息的潜在信息来捕捉服装细节的细微差别。在整个扩散去噪过程中，这些细节无缝集成到模型图像生成中，由轻量级紧凑型 CNN 制作的精确服装蒙版专业引导。虽然我们的 DiffusionTrend 模型最初表现出次优的度量性能，但我们的探索性方法提供了一些重要优势：（1）它避免了在大型数据集上对扩散模型进行资源密集型的重新训练。（2）它消除了各种复杂且用户不友好的模型输入的必要性。（3）它提供了视觉上引人注目的试穿体验，凸显了无需训练的扩散模型的潜力。这是首次在虚拟试穿技术中应用未经训练的扩散模型，可能为进一步探索和改进这一具有工业和学术价值的领域铺平了道路。

Title: DirectorLLM for Human-Centric Video Generation

Authors: Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14484
Pdf URL: https://arxiv.org/pdf/2412.14484
Copy Paste: [[2412.14484]] DirectorLLM for Human-Centric Video Generation(https://arxiv.org/abs/2412.14484)
Keywords: generation
Abstract: In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
摘要：在本文中，我们介绍了 DirectorLLM，这是一种新颖的视频生成模型，它采用大型语言模型 (LLM) 来协调视频中的人体姿势。随着基础文本到视频模型的快速发展，对高质量人体运动和交互的需求不断增长。为了满足这一需求并增强人体运动的真实性，我们将 LLM 从文本生成器扩展为视频导演和人体运动模拟器。利用 Llama 3 的开源资源，我们训练 DirectorLLM 生成详细的指导信号（例如人体姿势）来指导视频生成。这种方法将人体运动的模拟从视频生成器转移到 LLM，有效地为以人为中心的场景创建信息大纲。这些信号被视频渲染器用作条件，从而促进更逼真和及时的视频生成。作为一个独立的 LLM 模块，它可以以最小的努力应用于不同的视频渲染器，包括 UNet 和 DiT。在自动评估基准和人工评估上的实验表明，我们的模型在生成具有更高人体运动保真度、改进的及时忠实度和增强的渲染主题自然度的视频方面优于现有模型。

Title: Content-style disentangled representation for controllable artistic image stylization and generation

Authors: Ma Zhuoqi, Zhang Yixuan, You Zejun, Tian Long, Liu Xiyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14496
Pdf URL: https://arxiv.org/pdf/2412.14496
Copy Paste: [[2412.14496]] Content-style disentangled representation for controllable artistic image stylization and generation(https://arxiv.org/abs/2412.14496)
Keywords: generation
Abstract: Controllable artistic image stylization and generation aims to render the content provided by text or image with the learned artistic style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image information for supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in semantic interference from the reference image. To address the above issues, this paper proposes a content-style representation disentangling method for controllable artistic image stylization and generation. We construct a WikiStyle+ dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled content and style representations guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers for better controllable stylization. This approach allows model to accommodate inputs from different modalities. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling a harmonious integration of content and style in the generated outputs, successfully producing style-consistent and expressive stylized images.
摘要：可控的艺术图像风格化与生成旨在用学习到的艺术风格来渲染文本或图像提供的内容，其中内容和风格的解耦是获得满意结果的关键。然而，目前的内容和风格解耦方法主要依靠图像信息进行监督，这导致两个问题：1）模型只能支持一种风格或内容输入模态；2）解耦不完全导致参考图像的语义干扰。为了解决上述问题，本文提出了一种可控的艺术图像风格化与生成的内容风格表征解耦方法。我们构建了一个由艺术作品以及风格和内容的相应文本描述组成的 WikiStyle+ 数据集。基于多模态数据集，我们提出了一种解耦的内容和风格表征引导的扩散模型。解耦后的表征首先由 Q-Formers 学习，然后使用可学习的多步骤交叉注意层注入到预训练的扩散模型中，以实现更好的可控风格化。这种方法允许模型适应来自不同模态的输入。实验结果表明，我们的方法在多模态监督下实现了参考图像中内容和风格的彻底分离，从而实现了生成的输出中内容和风格的和谐融合，成功生成了风格一致且富有表现力的风格化图像。

Title: Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Authors: Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14531
Pdf URL: https://arxiv.org/pdf/2412.14531
Copy Paste: [[2412.14531]] Consistent Human Image and Video Generation with Spatially Conditioned Diffusion(https://arxiv.org/abs/2412.14531)
Keywords: generation
Abstract: Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.
摘要：以人为中心的一致图像和视频合成旨在生成具有新姿势的图像或视频，同时保持与给定参考图像的外观一致性，这对于低成本视觉内容创建至关重要。基于扩散模型的最新进展通常依赖于单独的网络进行参考外观特征提取和目标视觉生成，导致参考和目标之间的域间隙不一致。在本文中，我们将任务定义为空间条件修复问题，其中修复目标图像以保持与参考的外观一致性。这种方法使参考特征能够在统一的去噪网络中指导符合姿势的目标的生成，从而减轻域间隙。此外，为了更好地维护参考外观信息，我们实施了一个因果特征交互框架，其中参考特征只能从自身查询，而目标特征可以从参考和目标查询外观信息。为了进一步提高计算效率和灵活性，在实际实施中，我们将空间条件生成过程分解为两个阶段：参考外观提取和条件目标生成。两个阶段共享一个去噪网络，交互仅限于自注意力层。所提出的方法确保灵活控制生成的人体图像和视频的外观。通过对人体视频数据上的现有基础扩散模型进行微调，我们的方法对未见过的人体身份和姿势表现出很强的泛化能力，而无需对每个实例进行额外的微调。实验结果验证了我们方法的有效性，与现有的人体图像和视频合成方法相比，我们的方法表现出了竞争力。

Title: DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching

Authors: Xiaofei Huang, Wenting Chen, Jie Liu, Qisheng Lu, Xiaoling Luo, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14535
Pdf URL: https://arxiv.org/pdf/2412.14535
Copy Paste: [[2412.14535]] DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching(https://arxiv.org/abs/2412.14535)
Keywords: generation
Abstract: Medical report generation is crucial for clinical diagnosis and patient management, summarizing diagnoses and recommendations based on medical imaging. However, existing work often overlook the clinical pipeline involved in report writing, where physicians typically conduct an initial quick review followed by a detailed examination. Moreover, current alignment methods may lead to misaligned relationships. To address these issues, we propose DAMPER, a dual-stage framework for medical report generation that mimics the clinical pipeline of report writing in two stages. In the first stage, a MeSH-Guided Coarse-Grained Alignment (MCG) stage that aligns chest X-ray (CXR) image features with medical subject headings (MeSH) features to generate a rough keyphrase representation of the overall impression. In the second stage, a Hypergraph-Enhanced Fine-Grained Alignment (HFG) stage that constructs hypergraphs for image patches and report annotations, modeling high-order relationships within each modality and performing hypergraph matching to capture semantic correlations between image regions and textual phrases. Finally,the coarse-grained visual features, generated MeSH representations, and visual hypergraph features are fed into a report decoder to produce the final medical report. Extensive experiments on public datasets demonstrate the effectiveness of DAMPER in generating comprehensive and accurate medical reports, outperforming state-of-the-art methods across various evaluation metrics.
摘要：医疗报告生成对于临床诊断和患者管理至关重要，它根据医学影像总结诊断和建议。然而，现有的工作往往忽视了报告撰写所涉及的临床流程，医生通常会进行初步的快速审查，然后进行详细检查。此外，当前的对齐方法可能会导致关系不一致。为了解决这些问题，我们提出了 DAMPER，这是一个用于医疗报告生成的双阶段框架，它分两个阶段模拟报告撰写的临床流程。在第一阶段，MeSH 引导的粗粒度对齐 (MCG) 阶段将胸部 X 光 (CXR) 图像特征与医学主题词 (MeSH) 特征对齐，以生成整体印象的粗略关键短语表示。在第二阶段，超图增强细粒度对齐 (HFG) 阶段为图像块和报告注释构建超图，对每个模态内的高阶关系进行建模，并执行超图匹配以捕获图像区域和文本短语之间的语义相关性。最后，将粗粒度的视觉特征、生成的 MeSH 表示和视觉超图特征输入报告解码器以生成最终的医疗报告。在公共数据集上进行的大量实验证明了 DAMPER 在生成全面而准确的医疗报告方面的有效性，并且在各种评估指标上均优于最先进的方法。

Title: Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw Images

Authors: Min Wang, Xin Huang, Guoqing Zhou, Qifeng Guo, Qing Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.14547
Pdf URL: https://arxiv.org/pdf/2412.14547
Copy Paste: [[2412.14547]] Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw Images(https://arxiv.org/abs/2412.14547)
Keywords: restoration
Abstract: Neural Radiance Fields (NeRFs) have demonstrated prominent performance in novel view synthesis. However, their input heavily relies on image acquisition under normal light conditions, making it challenging to learn accurate scene representation in low-light environments where images typically exhibit significant noise and severe color distortion. To address these challenges, we propose a novel approach, Bright-NeRF, which learns enhanced and high-quality radiance fields from multi-view low-light raw images in an unsupervised manner. Our method simultaneously achieves color restoration, denoising, and enhanced novel view synthesis. Specifically, we leverage a physically-inspired model of the sensor's response to illumination and introduce a chromatic adaptation loss to constrain the learning of response, enabling consistent color perception of objects regardless of lighting conditions. We further utilize the raw data's properties to expose the scene's intensity automatically. Additionally, we have collected a multi-view low-light raw image dataset to advance research in this field. Experimental results demonstrate that our proposed method significantly outperforms existing 2D and 3D approaches. Our code and dataset will be made publicly available.
摘要：神经辐射场 (NeRF) 在新视图合成中表现出色。然而，它们的输入严重依赖于正常光照条件下的图像采集，这使得在低光环境中学习准确的场景表示具有挑战性，因为图像通常会出现明显的噪声和严重的颜色失真。为了应对这些挑战，我们提出了一种新方法 Bright-NeRF，它以无监督的方式从多视图低光原始图像中学习增强的高质量辐射场。我们的方法同时实现了色彩恢复、去噪和增强的新视图合成。具体来说，我们利用物理启发的传感器对照明的响应模型，并引入色度适应损失来限制响应的学习，从而实现无论光照条件如何都能对物体产生一致的颜色感知。我们进一步利用原始数据的属性来自动显示场景的强度。此外，我们还收集了一个多视图低光原始图像数据集，以推进该领域的研究。实验结果表明，我们提出的方法明显优于现有的 2D 和 3D 方法。我们的代码和数据集将公开。

Title: ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

Authors: Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, Ruimao Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14559
Pdf URL: https://arxiv.org/pdf/2412.14559
Copy Paste: [[2412.14559]] ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model(https://arxiv.org/abs/2412.14559)
Keywords: generation
Abstract: The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.
摘要：缩放定律已在自然语言处理 (NLP) 和大规模计算机视觉任务等各个领域得到验证；然而，它在运动生成中的应用仍未得到广泛探索。在本文中，我们介绍了一个可扩展的运动生成框架，其中包括运动标记器 Motion FSQ-VAE 和文本前缀自回归变换器。通过全面的实验，我们观察了该系统的缩放行为。我们首次在运动生成的背景下确认了缩放定律的存在。具体而言，我们的结果表明，我们的前缀自回归模型的归一化测试损失遵循与计算预算相关的对数定律。此外，我们还分别确认了非词汇参数、词汇参数和数据标记相对于计算预算的幂律。利用缩放定律，我们预测了计算预算为 $1e18$ 时的最佳变换器大小、词汇大小和数据要求。当使用最佳模型大小、词汇量和所需数据进行训练时，系统的测试损失与预测的测试损失完全一致，从而验证了缩放规律。

Title: DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Authors: Yiren Song, Xiaokang Liu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14580
Pdf URL: https://arxiv.org/pdf/2412.14580
Copy Paste: [[2412.14580]] DiffSim: Taming Diffusion Models for Evaluating Visual Similarity(https://arxiv.org/abs/2412.14580)
Keywords: generation, generative
Abstract: Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.
摘要：扩散模型从根本上改变了生成模型领域，使得评估定制模型输出与参考输入之间的相似性变得至关重要。然而，传统的感知相似性度量主要在像素和块级别运行，比较低级颜色和纹理，但无法捕捉图像布局、对象姿势和语义内容的中级相似性和差异。基于对比学习的 CLIP 和基于自监督学习的 DINO 通常用于测量语义相似性，但它们高度压缩了图像特征，无法充分评估外观细节。本文首次发现预训练扩散模型可用于测量视觉相似性，并引入了 DiffSim 方法，解决了传统度量在捕获自定义生成任务中的感知一致性方面的局限性。通过在去噪 U-Net 的注意层中对齐特征，DiffSim 评估外观和风格相似性，显示出与人类视觉偏好的出色一致性。此外，我们引入了 Sref 和 IP 基准，分别在风格和实例级别评估视觉相似性。多个基准的综合评估表明，DiffSim 实现了最先进的性能，为测量生成模型中的视觉连贯性提供了强大的工具。

Title: Successive optimization of optics and post-processing with differentiable coherent PSF operator and field information

Authors: Zheng Ren, Jingwen Zhou, Wenguan Zhang, Jiapu Yan, Bingkun Chen, Huajun Feng, Shiqi Chen
Subjects: cs.CV, physics.optics
Abstract URL: https://arxiv.org/abs/2412.14603
Pdf URL: https://arxiv.org/pdf/2412.14603
Copy Paste: [[2412.14603]] Successive optimization of optics and post-processing with differentiable coherent PSF operator and field information(https://arxiv.org/abs/2412.14603)
Keywords: restoration
Abstract: Recently, the joint design of optical systems and downstream algorithms is showing significant potential. However, existing rays-described methods are limited to optimizing geometric degradation, making it difficult to fully represent the optical characteristics of complex, miniaturized lenses constrained by wavefront aberration or diffraction effects. In this work, we introduce a precise optical simulation model, and every operation in pipeline is differentiable. This model employs a novel initial value strategy to enhance the reliability of intersection calculation on high aspherics. Moreover, it utilizes a differential operator to reduce memory consumption during coherent point spread function calculations. To efficiently address various degradation, we design a joint optimization procedure that leverages field information. Guided by a general restoration network, the proposed method not only enhances the image quality, but also successively improves the optical performance across multiple lenses that are already in professional level. This joint optimization pipeline offers innovative insights into the practical design of sophisticated optical systems and post-processing algorithms. The source code will be made publicly available at this https URL
摘要：最近，光学系统和下游算法的联合设计显示出巨大的潜力。然而，现有的射线描述方法仅限于优化几何退化，难以完全表示受波前像差或衍射效应限制的复杂微型镜头的光学特性。在这项工作中，我们引入了一个精确的光学仿真模型，管道中的每个操作都是可微分的。该模型采用了一种新颖的初始值策略来增强高非球面相交计算的可靠性。此外，它利用微分算子来减少相干点扩展函数计算期间的内存消耗。为了有效解决各种退化问题，我们设计了一个利用场信息的联合优化程序。在通用恢复网络的指导下，所提出的方法不仅提高了图像质量，而且还逐步提高了已经达到专业水平的多个镜头的光学性能。这种联合优化管道为复杂光学系统和后处理算法的实际设计提供了创新的见解。源代码将在此 https URL 上公开提供

Title: Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models

Authors: Keith G. Mills, Mohammad Salameh, Ruichen Chen, Negar Hassanpour, Wei Lu, Di Niu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.14628
Pdf URL: https://arxiv.org/pdf/2412.14628
Copy Paste: [[2412.14628]] Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models(https://arxiv.org/abs/2412.14628)
Keywords: generation, generative
Abstract: Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua$^2$SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$, PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.
摘要：扩散模型 (DM) 通过迭代去噪过程使 AI 图像生成变得民主化。量化是减轻推理成本和减小 DM 去噪器网络大小的主要技术。然而，随着去噪器从卷积 U-Nets 的变体发展到更新的 Transformer 架构，了解不同权重层、操作和架构类型对性能的量化敏感性变得越来越重要。在这项工作中，我们使用 Qua$^2$SeDiMo 来应对这一挑战，这是一个混合精度的训练后量化框架，它针对不同去噪器操作类型和块结构生成有关各种模型权重量化方法的成本效益的可解释见解。我们利用这些见解为从基础 U-Nets 到最先进的 Transformer 等各种扩散模型做出高质量的混合精度量化决策。结果显示，Qua$^2$SeDiMo 可以在 PixArt-${\alpha}$、PixArt-${\Sigma}$、Hunyuan-DiT 和 SDXL 上分别构建 3.4 位、3.9 位、3.65 位和 3.7 位权重量化。我们进一步将权重量化配置与 6 位激活量化配对，并在定量指标和生成图像质量方面优于现有方法。

Title: Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model

Authors: Minglong Xue, Jinhong He, Shivakumara Palaiahnakote, Mingliang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14630
Pdf URL: https://arxiv.org/pdf/2412.14630
Copy Paste: [[2412.14630]] Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model(https://arxiv.org/abs/2412.14630)
Keywords: restoration
Abstract: Image restoration and enhancement are pivotal for numerous computer vision applications, yet unifying these tasks efficiently remains a significant challenge. Inspired by the iterative refinement capabilities of diffusion models, we propose CycleRDM, a novel framework designed to unify restoration and enhancement tasks while achieving high-quality mapping. Specifically, CycleRDM first learns the mapping relationships among the degraded domain, the rough normal domain, and the normal domain through a two-stage diffusion inference process. Subsequently, we transfer the final calibration process to the wavelet low-frequency domain using discrete wavelet transform, performing fine-grained calibration from a frequency domain perspective by leveraging task-specific frequency spaces. To improve restoration quality, we design a feature gain module for the decomposed wavelet high-frequency domain to eliminate redundant features. Additionally, we employ multimodal textual prompts and Fourier transform to drive stable denoising and reduce randomness during the inference process. After extensive validation, CycleRDM can be effectively generalized to a wide range of image restoration and enhancement tasks while requiring only a small number of training samples to be significantly superior on various benchmarks of reconstruction quality and perceptual quality. The source code will be available at this https URL.
摘要：图像恢复和增强对于众多计算机视觉应用至关重要，但有效地统一这些任务仍然是一项重大挑战。受扩散模型的迭代细化能力的启发，我们提出了 CycleRDM，这是一个新颖的框架，旨在统一恢复和增强任务，同时实现高质量的映射。具体而言，CycleRDM 首先通过两阶段扩散推理过程学习退化域、粗糙法线域和法线域之间的映射关系。随后，我们使用离散小波变换将最终校准过程转移到小波低频域，利用特定于任务的频率空间从频域角度执行细粒度校准。为了提高恢复质量，我们为分解的小波高频域设计了一个特征增益模块，以消除冗余特征。此外，我们采用多模态文本提示和傅里叶变换来驱动稳定的去噪并减少推理过程中的随机性。经过大量验证，CycleRDM 可以有效地推广到广泛的图像恢复和增强任务，同时只需要少量的训练样本就可以在各种重建质量和感知质量基准上取得显著优势。源代码将在此 https URL 上提供。

Title: LoLaFL: Low-Latency Federated Learning via Forward-only Propagation

Authors: Jierui Zhang, Jianhao Huang, Kaibin Huang
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2412.14668
Pdf URL: https://arxiv.org/pdf/2412.14668
Copy Paste: [[2412.14668]] LoLaFL: Low-Latency Federated Learning via Forward-only Propagation(https://arxiv.org/abs/2412.14668)
Keywords: generation
Abstract: Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 91\% and 98\%, respectively, while maintaining comparable accuracies.
摘要：联邦学习 (FL) 已成为一种广泛采用的范例，可在确保数据隐私的同时实现分布式数据的边缘学习。然而，传统的通过反向传播训练深度神经网络的 FL 很难满足第六代 (6G) 移动网络中的低延迟学习要求。这一挑战主要源于要传输的高维模型参数以及由于训练过程固有的随机性而需要收敛的大量通信轮次。为了解决这个问题，我们采用了最先进的最大编码率降低原理来学习线性判别特征，并将得到的白盒神经网络扩展到 FL，通过前向传播产生了低延迟联邦学习 (LoLaFL) 的新框架。LoLaFL 实现了分层传输和聚合，通信轮次明显减少，从而大大降低了延迟。此外，我们为 LoLaFL 提出了两种 \emph{非线性} 聚合方案。第一个方案基于 LoLaFL 中最佳 NN 参数聚合应该是谐波均值的证明。第二种方案进一步利用特征的低秩结构，传输特征的低秩近似协方差矩阵，以实现额外的延迟减少。进行了理论分析和实验以评估 LoLaFL 的性能。与传统 FL 相比，LoLaFL 的两种非线性聚合方案分别可以实现超过 91% 和 98% 的延迟减少，同时保持相当的准确率。

Title: MUSTER: Longitudinal Deformable Registration by Composition of Consecutive Deformations

Authors: Edvard O. S. Grødem, Donatas Sederevičius, Esten H. Leonardsen, Bradley J. MacIntosh, Atle Bjørnerud, Till Schellhorn, Øystein Sørensen, Inge Amlien, Pablo F. Garrido, Anders M. Fjell
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2412.14671
Pdf URL: https://arxiv.org/pdf/2412.14671
Copy Paste: [[2412.14671]] MUSTER: Longitudinal Deformable Registration by Composition of Consecutive Deformations(https://arxiv.org/abs/2412.14671)
Keywords: generation
Abstract: Longitudinal imaging allows for the study of structural changes over time. One approach to detecting such changes is by non-linear image registration. This study introduces Multi-Session Temporal Registration (MUSTER), a novel method that facilitates longitudinal analysis of changes in extended series of medical images. MUSTER improves upon conventional pairwise registration by incorporating more than two imaging sessions to recover longitudinal deformations. Longitudinal analysis at a voxel-level is challenging due to effects of a changing image contrast as well as instrumental and environmental sources of bias between sessions. We show that local normalized cross-correlation as an image similarity metric leads to biased results and propose a robust alternative. We test the performance of MUSTER on a synthetic multi-site, multi-session neuroimaging dataset and show that, in various scenarios, using MUSTER significantly enhances the estimated deformations relative to pairwise registration. Additionally, we apply MUSTER on a sample of older adults from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. The results show that MUSTER can effectively identify patterns of neuro-degeneration from T1-weighted images and that these changes correlate with changes in cognition, matching the performance of state of the art segmentation methods. By leveraging GPU acceleration, MUSTER efficiently handles large datasets, making it feasible also in situations with limited computational resources.
摘要：纵向成像可以研究随时间的结构变化。检测此类变化的一种方法是非线性图像配准。本研究引入了多会话时间配准 (MUSTER)，这是一种有助于纵向分析扩展系列医学图像中变化的新方法。MUSTER 通过结合两个以上的成像会话来恢复纵向变形，从而改进了传统的成对配准。由于图像对比度变化以及会话之间的工具和环境偏差源的影响，体素级别的纵向分析具有挑战性。我们表明，局部归一化互相关作为图像相似性度量会导致有偏差的结果，并提出了一种可靠的替代方案。我们在合成的多站点、多会话神经成像数据集上测试了 MUSTER 的性能，并表明在各种情况下，使用 MUSTER 可以显着增强相对于成对配准的估计变形。此外，我们将 MUSTER 应用于阿尔茨海默病神经成像计划 (ADNI) 研究中的一组老年人样本。结果表明，MUSTER 能够有效地从 T1 加权图像中识别神经退化的模式，并且这些变化与认知变化相关，与最先进的分割方法的性能相匹配。通过利用 GPU 加速，MUSTER 可以高效处理大型数据集，即使在计算资源有限的情况下也可以使用。

Title: FiVL: A Framework for Improved Vision-Language Alignment

Authors: Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.14672
Pdf URL: https://arxiv.org/pdf/2412.14672
Copy Paste: [[2412.14672]] FiVL: A Framework for Improved Vision-Language Alignment(https://arxiv.org/abs/2412.14672)
Keywords: generation
Abstract: Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM's ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model's reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at this https URL.
摘要：大型视觉语言模型 (LVLM) 在整合视觉和文本输入以实现多模态推理方面取得了重大进展。然而，一个反复出现的挑战是确保这些模型能够像利用语言内容一样有效地利用视觉信息，因为这两种模态对于形成准确的答案都是必需的。我们假设幻觉的产生是由于当前 LVLM 缺乏有效的视觉基础。这个问题延伸到视觉语言基准，很难使图像成为准确答案生成不可或缺的一部分，特别是在视觉问答任务中。在这项工作中，我们引入了 FiVL，这是一种构建数据集的新方法，旨在训练 LVLM 以增强视觉基础并评估其实现这一目标的有效性。这些数据集可用于训练和评估 LVLM 使用图像内容作为实质性证据的能力，而不是仅仅依赖语言先验，从而深入了解模型对视觉信息的依赖。为了展示我们数据集的实用性，我们引入了一项创新的训练任务，该任务的表现优于基线，同时还引入了可解释性的验证方法和应用程序。代码可在此 https URL 上找到。

Title: EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Authors: Jianrong Zhang, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14706
Pdf URL: https://arxiv.org/pdf/2412.14706
Copy Paste: [[2412.14706]] EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space(https://arxiv.org/abs/2412.14706)
Keywords: generation
Abstract: Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
摘要：扩散模型，尤其是潜在扩散模型，在文本驱动的人体运动生成方面取得了显著的成功。然而，对于潜在扩散模型来说，将多个语义概念有效地组合成一个连贯的运动序列仍然具有挑战性。为了解决这个问题，我们提出了 EnergyMoGen，它包括两种基于能量的模型：(1) 我们将扩散模型解释为一种潜在感知的基于能量的模型，通过在潜在空间中组合一组扩散模型来生成运动；(2) 我们引入了一种基于交叉注意的语义感知能量模型，该模型支持文本嵌入的语义组合和自适应梯度下降。为了克服这两个频谱之间的语义不一致和运动失真的挑战，我们引入了协同能量融合。这种设计允许运动潜在扩散模型通过组合与文本描述相对应的多个能量项来合成高质量、复杂的运动。实验表明，我们的方法在各种动作生成任务上的表现均优于现有的最先进模型，包括文本到动作生成、组合动作生成和多概念动作生成。此外，我们还证明了我们的方法可用于扩展动作数据集并改进文本到动作任务。

Title: Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data

Authors: Fabian Sven Karst, Sook-Yee Chong, Abigail A. Antenor, Enyu Lin, Mahei Manhai Li, Jan Marco Leimeister
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14730
Pdf URL: https://arxiv.org/pdf/2412.14730
Copy Paste: [[2412.14730]] Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data(https://arxiv.org/abs/2412.14730)
Keywords: generative
Abstract: The banking sector faces challenges in using deep learning due to data sensitivity and regulatory constraints, but generative AI may offer a solution. Thus, this study identifies effective algorithms for generating synthetic financial transaction data and evaluates five leading models - Conditional Tabular Generative Adversarial Networks (CTGAN), DoppelGANger (DGAN), Wasserstein GAN, Financial Diffusion (FinDiff), and Tabular Variational AutoEncoders (TVAE) - across five criteria: fidelity, synthesis quality, efficiency, privacy, and graph structure. While none of the algorithms is able to replicate the real data's graph structure, each excels in specific areas: DGAN is ideal for privacy-sensitive tasks, FinDiff and TVAE excel in data replication and augmentation, and CTGAN achieves a balance across all five criteria, making it suitable for general applications with moderate privacy concerns. As a result, our findings offer valuable insights for choosing the most suitable algorithm.
摘要：由于数据敏感性和监管限制，银行业在使用深度学习方面面临挑战，但生成式人工智能可能提供解决方案。因此，本研究确定了生成合成金融交易数据的有效算法，并根据五个标准评估了五种领先模型——条件表格生成对抗网络 (CTGAN)、DoppelGANger (DGAN)、Wasserstein GAN、金融扩散 (FinDiff) 和表格变分自动编码器 (TVAE)：保真度、合成质量、效率、隐私和图形结构。虽然没有一种算法能够复制真实数据的图形结构，但每种算法都在特定领域表现出色：DGAN 非常适合隐私敏感任务，FinDiff 和 TVAE 在数据复制和增强方面表现出色，而 CTGAN 在所有五个标准上实现了平衡，使其适用于具有中等隐私问题的一般应用。因此，我们的研究结果为选择最合适的算法提供了宝贵的见解。

Title: Extending TWIG: Zero-Shot Predictive Hyperparameter Selection for KGEs based on Graph Structure

Authors: Jeffrey Sardina, John D. Kelleher, Declan O'Sullivan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.14801
Pdf URL: https://arxiv.org/pdf/2412.14801
Copy Paste: [[2412.14801]] Extending TWIG: Zero-Shot Predictive Hyperparameter Selection for KGEs based on Graph Structure(https://arxiv.org/abs/2412.14801)
Keywords: generation
Abstract: Knowledge Graphs (KGs) have seen increasing use across various domains -- from biomedicine and linguistics to general knowledge modelling. In order to facilitate the analysis of knowledge graphs, Knowledge Graph Embeddings (KGEs) have been developed to automatically analyse KGs and predict new facts based on the information in a KG, a task called "link prediction". Many existing studies have documented that the structure of a KG, KGE model components, and KGE hyperparameters can significantly change how well KGEs perform and what relationships they are able to learn. Recently, the Topologically-Weighted Intelligence Generation (TWIG) model has been proposed as a solution to modelling how each of these elements relate. In this work, we extend the previous research on TWIG and evaluate its ability to simulate the output of the KGE model ComplEx in the cross-KG setting. Our results are twofold. First, TWIG is able to summarise KGE performance on a wide range of hyperparameter settings and KGs being learned, suggesting that it represents a general knowledge of how to predict KGE performance from KG structure. Second, we show that TWIG can successfully predict hyperparameter performance on unseen KGs in the zero-shot setting. This second observation leads us to propose that, with additional research, optimal hyperparameter selection for KGE models could be determined in a pre-hoc manner using TWIG-like methods, rather than by using a full hyperparameter search.
摘要：知识图谱 (KG) 在各个领域得到了越来越广泛的应用——从生物医学和语言学到一般知识建模。为了便于分析知识图谱，人们开发了知识图谱嵌入 (KGE) 来自动分析 KG 并根据 KG 中的信息预测新事实，这项任务称为“链接预测”。许多现有研究表明，KG 的结构、KGE 模型组件和 KGE 超参数可以显著改变 KGE 的性能以及它们能够学习的关系。最近，拓扑加权智能生成 (TWIG) 模型被提出作为建模这些元素之间关系的解决方案。在这项工作中，我们扩展了之前对 TWIG 的研究，并评估了它在跨 KG 设置中模拟 KGE 模型 ComplEx 输出的能力。我们的结果是双重的。首先，TWIG 能够总结各种超参数设置和正在学习的 KG 上的 KGE 性能，这表明它代表了如何根据 KG 结构预测 KGE 性能的一般知识。其次，我们表明 TWIG 可以成功预测零样本设置中未见过的 KG 上的超参数性能。这第二个观察结果促使我们提出，通过进一步研究，可以使用类似 TWIG 的方法以预先确定的方式确定 KGE 模型的最佳超参数选择，而不是使用完整的超参数搜索。

Title: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Authors: Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.14803
Pdf URL: https://arxiv.org/pdf/2412.14803
Copy Paste: [[2412.14803]] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations(https://arxiv.org/abs/2412.14803)
Keywords: generation
Abstract: Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.
摘要：机器人技术的最新进展主要集中在开发能够执行多项任务的通用策略。通常，这些策略利用预先训练的视觉编码器从当前观察中捕获关键信息。然而，以前的视觉编码器是在双图像对比学习或单图像重建上进行训练的，无法完美地捕获具体任务所必需的序列信息。最近，视频扩散模型 (VDM) 已经展示了准确预测未来图像序列的能力，表现出对物理动力学的良好理解。受 VDM 强大的视觉预测能力的启发，我们假设它们固有地拥有反映物理世界演变的视觉表征，我们称之为预测视觉表征。基于这一假设，我们提出了视频预测策略 (VPP)，这是一种以 VDM 的预测视觉表征为条件的通用机器人策略。为了进一步增强这些表征，我们结合了不同的人类或机器人操作数据集，采用统一的视频生成训练目标。VPP 在两个模拟和两个现实世界基准测试中始终优于现有方法。值得注意的是，与之前的最先进技术相比，它在 Calvin ABC-D 基准中实现了 28.1％的相对改进，并且使复杂的现实世界灵巧操作任务的成功率提高了 28.8％。

Title: MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

Authors: Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Hunag, Yuhua Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.14902
Pdf URL: https://arxiv.org/pdf/2412.14902
Copy Paste: [[2412.14902]] MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models(https://arxiv.org/abs/2412.14902)
Keywords: generation
Abstract: Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a "Name Space", where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \url{this https URL}.
摘要：大规模文本到图像传播模型（例如 DALL-E、SDXL）能够通过简单地引用他们的名字来生成名人。是否有可能让这样的模型生成像名人一样简单的通用身份，例如只使用一个名字？在本文中，我们探索了“名称空间”的存在，其中空间中的任何点都对应一个特定的身份。幸运的是，我们在名人姓名的文本嵌入所跨越的特征空间中找到了一些线索。具体来说，我们首先使用传播模型的文本编码器从 Laion5B 数据集中提取名人姓名的嵌入。此类嵌入用作监督来学习可以预测给定人脸图像的名称（实际上是嵌入）的编码器。我们通过实验发现，这种姓名嵌入在保证生成的图像具有良好的身份一致性方面效果很好。请注意，与名人姓名一样，我们预测的姓名嵌入与文本输入的语义分离，从而很好地保留了文本转图像模型的原始生成能力。此外，只需插入此类姓名嵌入，所有源自同一基础模型（即 SDXL）的变体（例如来自 Civitai 的变体）都可以轻松成为身份感知的文本转图像模型。项目主页：\url{此 https URL}。

Title: Movie2Story: A framework for understanding videos and telling stories in the form of novel text

Authors: Kangning Li, Zheyang Jia, Anyu Ying
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.14965
Pdf URL: https://arxiv.org/pdf/2412.14965
Copy Paste: [[2412.14965]] Movie2Story: A framework for understanding videos and telling stories in the form of novel text(https://arxiv.org/abs/2412.14965)
Keywords: generation
Abstract: Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.
摘要：多模态视频转文本模型取得了长足进步，主要体现在生成视频内容的简短描述上，但在生成融合音频和视频的丰富长文本描述方面还存在不足。本文介绍了一个框架M2S，旨在结合音频、视频和字符识别生成长文本。M2S包括视频长文本描述与理解模块、基于音频的情绪、语速和字符对齐分析模块、基于视觉的字符识别对齐模块。通过使用大型语言模型GPT4o整合多模态信息，M2S在多模态文本生成领域脱颖而出。我们通过对比实验和人工评估证明了M2S的有效性和准确性。此外，该模型框架具有良好的可扩展性和巨大的研究潜力。

Title: Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations

Authors: Jianhua Sun, Yuxuan Li, Jiude Wei, Longfei Xu, Nange Wang, Yining Zhang, Cewu Lu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.14974
Pdf URL: https://arxiv.org/pdf/2412.14974
Copy Paste: [[2412.14974]] Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations(https://arxiv.org/abs/2412.14974)
Keywords: generation
Abstract: The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects' point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 categories of articulate objects and provides annotations across a wide range of both vision and manipulation tasks, and we provide exhaustive experiments which fully demonstrate its advantages. We will make Arti-PG toolbox publicly available for the community to use.
摘要：获取大量的3D关节物体数据不仅成本高昂而且耗时，因此3D关节物体数据的稀缺性成为深度学习方法在各种关节物体理解任务中取得优异表现的一大障碍。同时，将这些物体数据与详细的注释配对以进行各种任务的训练也很困难，而且需要大量人力。为了迅速收集大量带有全面详细注释的3D关节物体用于训练，我们提出了关节物体程序生成工具箱，又称Arti-PG工具箱。Arti-PG工具箱包括：i）通过通用结构程序对关节物体的描述以及它们与物体点云的解析对应关系，ii）关于对结构程序进行操作的程序规则，以合成大规模和多样化的新关节物体，以及iii）对知识（例如可供性、语义等）的数学描述，为合成的对象提供注释。 Arti-PG 具有两个吸引人的特性，可以为关节物体理解任务提供训练数据：i）通过面向程序的结构操作创建具有无限形状变化的物体；ii）Arti-PG 可广泛适用于各种任务，可轻松提供全面而详细的注释。Arti-PG 现在支持 26 类关节物体的程序生成，并在广泛的视觉和操作任务中提供注释，我们提供详尽的实验，充分展示其优势。我们将向社区公开 Arti-PG 工具箱供其使用。

Title: DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

Authors: Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal Ertugrul
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.15032
Pdf URL: https://arxiv.org/pdf/2412.15032
Copy Paste: [[2412.15032]] DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space(https://arxiv.org/abs/2412.15032)
Keywords: generation, generative
Abstract: This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \url{this https URL}.
摘要：本文从频率空间探索图像建模，并介绍 DCTdiff，这是一种端到端扩散生成范式，可在离散余弦变换 (DCT) 空间中高效建模图像。我们研究了 DCTdiff 的设计空间并揭示了关键的设计因素。在不同框架 (UViT、DiT)、生成任务和各种扩散采样器上进行的实验表明，DCTdiff 在生成质量和训练效率方面优于基于像素的扩散模型。值得注意的是，DCTdiff 可以在不使用潜在扩散范式的情况下无缝扩展到高分辨率生成。最后，我们说明了 DCT 图像建模的几个有趣特性。例如，我们提供了为什么“图像扩散可以看作是光谱自回归”的理论证明，从而弥合了扩散和自回归模型之间的差距。DCTdiff 的有效性和引入的属性为频率空间中的图像建模提供了一个有希望的方向。代码位于 \url{this https URL}。

Title: Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

Authors: Zhifei Chen, Tianshuo Xu, Wenhang Ge, Leyi Wu, Dongyu Yan, Jing He, Luozhou Wang, Lu Zeng, Shunsi Zhang, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15050
Pdf URL: https://arxiv.org/pdf/2412.15050
Copy Paste: [[2412.15050]] Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion(https://arxiv.org/abs/2412.15050)
Keywords: generation
Abstract: Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images. Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrates a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.
摘要：渲染和逆渲染是计算机视觉和图形学中的关键任务。渲染方程是这两项任务的核心，是将固有属性转换为 RGB 图像的理想条件分布传递函数。尽管现有渲染方法取得了令人满意的结果，但它们只是近似于特定场景的理想估计，并且计算成本很高。此外，由于固有的模糊性，逆条件分布传递难以处理。为了应对这些挑战，我们提出了一种数据驱动的方法，将渲染和逆渲染联合建模为单个扩散框架内的两个条件生成任务。受 UniDiffuser 的启发，我们利用两个不同的时间表来建模这两项任务，并通过量身定制的双流模块，实现了两个预训练扩散模型的交叉调节。这种统一的方法称为 Uni-Renderer，它允许两个过程通过周期一致的约束相互促进，通过强制固有属性和渲染图像之间的一致性来减轻模糊性。结合精心准备的数据集，我们的方法有效地分解了固有属性，并展示了强大的识别渲染过程中变化的能力。我们将向公众开源我们的训练和推理代码，促进该领域的进一步研究和开发。

Title: Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation

Authors: Haoran Liu, Youzhi Luo, Tianxiao Li, James Caverlee, Martin Renqiang Min
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.15086
Pdf URL: https://arxiv.org/pdf/2412.15086
Copy Paste: [[2412.15086]] Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation(https://arxiv.org/abs/2412.15086)
Keywords: generation, generative
Abstract: We consider the conditional generation of 3D drug-like molecules with \textit{explicit control} over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model's effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.
摘要：我们考虑有条件地生成 3D 类药物分子，并对分子属性（例如类药物属性（例如，类药物性的定量估计或合成可及性评分））进行 \textit{显式控制}，并有效地结合到特定的蛋白质位点。为了解决这个问题，我们提出了一个 E(3) 等变 Wasserstein 自动编码器，并将我们的生成模型的潜在空间分解为两个解开的方面：分子属性和 3D 分子的剩余结构背景。我们的模型确保对这些分子属性进行显式控制，同时保持坐标表示的等方差和数据似然的不变性。此外，我们引入了一种新的基于对齐的坐标损失，以使等变网络适应从头开始的自回归从头 3D 分子生成。大量实验验证了我们的模型在属性引导和上下文引导的分子生成方面的有效性，既适用于从头 3D 分子设计，也适用于针对蛋白质靶标的基于结构的药物发现。

Title: Parallelized Autoregressive Visual Generation

Authors: Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15119
Pdf URL: https://arxiv.org/pdf/2412.15119
Copy Paste: [[2412.15119]] Parallelized Autoregressive Visual Generation(https://arxiv.org/abs/2412.15119)
Keywords: generation
Abstract: Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependencies-tokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: this https URL.
摘要：自回归模型已成为一种强大的视觉生成方法，但由于其逐个标记的顺序预测过程，推理速度较慢。在本文中，我们提出了一种简单而有效的并行自回归视觉生成方法，该方法在保留自回归建模优势的同时提高了生成效率。我们的主要见解是并行生成取决于视觉标记依赖性 - 具有弱依赖性的标记可以并行生成，而具有强依赖性的相邻标记很难一起生成，因为它们的独立采样可能会导致不一致。基于这一观察，我们开发了一种并行生成策略，该策略可以并行生成具有弱依赖性的远距离标记，同时保持强依赖性本地标记的顺序生成。我们的方法可以无缝集成到标准自回归模型中，而无需修改架构或标记器。在 ImageNet 和 UCF-101 上的实验表明，我们的方法在图像和视频生成任务中实现了 3.6 倍的加速和相当的质量，并且实现了高达 9.5 倍的加速和最小的质量下降。我们希望这项工作能够激发未来高效视觉生成和统一自回归建模的研究。项目页面：此 https URL。

Title: Jet: A Modern Transformer-Based Normalizing Flow

Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.15129
Pdf URL: https://arxiv.org/pdf/2412.15129
Copy Paste: [[2412.15129]] Jet: A Modern Transformer-Based Normalizing Flow(https://arxiv.org/abs/2412.15129)
Keywords: generation, generative
Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.
摘要：过去，正则化生成流已成为自然图像生成模型中一类很有前途的类别。这种类型的模型具有许多建模优势：能够高效计算输入数据的对数似然、生成速度快、整体结构简单。正则化流一直是活跃的研究主题，但后来失宠了，因为样本的视觉质量无法与其他模型类别（如 GAN、基于 VQ-VAE 的方法或扩散模型）相媲美。在本文中，我们重新审视了基于耦合的正则化流模型的设计，仔细消除了先前的设计选择，并使用基于 Vision Transformer 架构（而不是卷积神经网络）的计算块。因此，我们以更简单的架构实现了最先进的定量和定性性能。虽然整体视觉质量仍然落后于当前最先进的模型，但我们认为强大的正则化流模型可以作为更强大的生成模型的构建组件，帮助推进研究前沿。

Title: Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Authors: Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.15156
Pdf URL: https://arxiv.org/pdf/2412.15156
Copy Paste: [[2412.15156]] Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM(https://arxiv.org/abs/2412.15156)
Keywords: generation
Abstract: Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
摘要：文本到视频模型通过对高质量文本-视频对的优化取得了显著进步，其中文本提示在确定输出视频的质量方面起着关键作用。然而，实现所需的输出通常需要多次修改和迭代推理以改进用户提供的提示。当前用于改进提示的自动方法在应用于文本到视频传播模型时遇到了诸如模态不一致、成本差异和模型不感知等挑战。为了解决这些问题，我们引入了一个基于 LLM 的提示自适应框架，称为 Prompt-A-Video，它擅长制作以视频为中心、无需劳动力和偏好一致的提示，以适应特定的视频传播模型。我们的方法涉及精心设计的两阶段优化和对齐系统。首先，我们进行奖励引导的提示演进管道，以自动创建最佳提示池并利用它们对 LLM 进行监督微调 (SFT)。然后使用多维奖励为 SFT 模型生成成对数据，然后使用直接偏好优化 (DPO) 算法进一步促进偏好对齐。通过大量实验和比较分析，我们验证了 Prompt-A-Video 在不同生成模型中的有效性，突出了其突破视频生成界限的潜力。

Title: OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Authors: Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15159
Pdf URL: https://arxiv.org/pdf/2412.15159
Copy Paste: [[2412.15159]] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization(https://arxiv.org/abs/2412.15159)
Keywords: generation, quality assessment
Abstract: In recent years, the field of text-to-video (T2V) generation has made significant strides. Despite this progress, there is still a gap between theoretical advancements and practical application, amplified by issues like degraded image quality and flickering artifacts. Recent advancements in enhancing the video diffusion model (VDM) through feedback learning have shown promising results. However, these methods still exhibit notable limitations, such as misaligned feedback and inferior scalability. To tackle these issues, we introduce OnlineVPO, a more efficient preference learning approach tailored specifically for video diffusion models. Our method features two novel designs, firstly, instead of directly using image-based reward feedback, we leverage the video quality assessment (VQA) model trained on synthetic data as the reward model to provide distribution and modality-aligned feedback on the video diffusion model. Additionally, we introduce an online DPO algorithm to address the off-policy optimization and scalability issue in existing video preference learning frameworks. By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and more importantly scalable preference learning algorithm for video diffusion models, offering valuable insights for future advancements in this domain.
摘要：近年来，文本转视频 (T2V) 生成领域取得了重大进展。尽管取得了这些进展，但理论进步与实际应用之间仍然存在差距，图像质量下降和闪烁伪影等问题进一步加剧了这一差距。通过反馈学习增强视频扩散模型 (VDM) 的最新进展已显示出令人鼓舞的结果。然而，这些方法仍然存在明显的局限性，例如反馈不一致和可扩展性较差。为了解决这些问题，我们引入了 OnlineVPO，这是一种专为视频扩散模型量身定制的更有效的偏好学习方法。我们的方法有两种新颖的设计，首先，我们不是直接使用基于图像的奖励反馈，而是利用在合成数据上训练的视频质量评估 (VQA) 模型作为奖励模型，为视频扩散模型提供分布和模态对齐的反馈。此外，我们引入了一种在线 DPO 算法来解决现有视频偏好学习框架中的离策略优化和可扩展性问题。通过使用视频奖励模型动态提供简洁的视频反馈，OnlineVPO 提供了有效且高效的偏好指导。对开源视频传播模型进行的大量实验表明，OnlineVPO 是一种简单而有效、更重要的是可扩展的视频传播模型偏好学习算法，为该领域的未来发展提供了宝贵的见解。

Title: Rethinking Uncertainty Estimation in Natural Language Generation

Authors: Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.15176
Pdf URL: https://arxiv.org/pdf/2412.15176
Copy Paste: [[2412.15176]] Rethinking Uncertainty Estimation in Natural Language Generation(https://arxiv.org/abs/2412.15176)
Keywords: generation
Abstract: Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.
摘要：大型语言模型 (LLM) 在实际应用中的应用越来越广泛，这推动了对其生成文本的可信度进行评估的需求。为此，可靠的不确定性估计至关重要。由于当前的 LLM 通过随机过程自回归地生成文本，因此相同的提示可能会导致不同的输出。因此，领先的不确定性估计方法会生成并分析多个输出序列以确定 LLM 的不确定性。但是，生成输出序列的计算成本很高，因此这些方法在大规模上不切实际。在这项工作中，我们检查了领先方法的理论基础，并探索了提高其计算效率的新方向。在适当评分规则框架的基础上，我们发现最可能的输出序列的负对数似然构成了一个有理论依据的不确定性度量。为了近似这个替代度量，我们提出了 G-NLL，它的优点是只需使用贪婪解码生成的单个输出序列即可获得。这使得不确定性估计更加高效和直接，同时保持了理论严谨性。实证结果表明，G-NLL 在各种 LLM 和任务中均实现了最佳性能。我们的工作为自然语言生成中高效可靠的不确定性估计奠定了基础，挑战了目前该领域领先的计算量更大的方法的必要性。

Title: Tiled Diffusion

Authors: Or Madar, Ohad Fried
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15185
Pdf URL: https://arxiv.org/pdf/2412.15185
Copy Paste: [[2412.15185]] Tiled Diffusion(https://arxiv.org/abs/2412.15185)
Keywords: generation, generative
Abstract: Image tiling -- the seamless connection of disparate images to create a coherent visual field -- is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains. This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images. Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.
摘要：图像平铺（无缝连接不同的图像以创建连贯的视觉场）对于纹理创建、视频游戏资产开发和数字艺术等应用至关重要。传统上，平铺是手动构建的，这种方法在可扩展性和灵活性方面存在很大限制。最近的研究试图使用生成模型来自动化这一过程。然而，当前的方法主要侧重于平铺纹理和操纵用于单图像生成的模型，而本身并不支持跨不同域创建多个互连的平铺。本文介绍了平铺扩散，这是一种新颖的方法，它扩展了扩散模型的功能，以适应在需要平铺的各种图像合成领域中生成有凝聚力的平铺模式。我们的方法支持广泛的平铺场景，从自平铺到复杂的多对多连接，实现多幅图像的无缝集成。 Tiled Diffusion 自动化了平铺过程，无需人工干预，并增强了各种应用中的创造可能性，例如无缝平铺现有图像、平铺纹理创建和 360° 合成。

Title: AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.15191
Pdf URL: https://arxiv.org/pdf/2412.15191
Copy Paste: [[2412.15191]] AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation(https://arxiv.org/abs/2412.15191)
Keywords: generation
Abstract: We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL
摘要：我们提出了 AV-Link，这是一个统一的视频到音频和音频到视频生成框架，它利用冻结视频和音频扩散模型的激活来实现时间对齐的跨模态调节。我们框架的关键是一个融合块，它通过时间对齐的自注意力操作实现我们骨干视频和音频扩散模型之间的双向信息交换。与之前使用针对其他任务进行预训练的特征提取器来处理调节信号的工作不同，AV-Link 可以直接利用单个框架中互补模态获得的特征，即利用视频特征生成音频，或利用音频特征生成视频。我们广泛评估了我们的设计选择，并展示了我们的方法实现同步和高质量视听内容的能力，展示了其在沉浸式媒体生成中的应用潜力。项目页面：此 http URL

Title: DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Authors: Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.15200
Pdf URL: https://arxiv.org/pdf/2412.15200
Copy Paste: [[2412.15200]] DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation(https://arxiv.org/abs/2412.15200)
Keywords: generation
Abstract: Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.
摘要：程序内容生成 (PCG) 在创建高质量 3D 内容方面功能强大，但控制它以产生所需的形状很困难，而且通常需要大量的参数调整。逆程序内容生成旨在自动找到输入条件下的最佳参数。然而，现有的基于采样和基于神经网络的方法仍然存在大量样本迭代或可控性有限的问题。在这项工作中，我们提出了 DI-PCG，一种用于一般图像条件的逆 PCG 的新颖而有效的方法。其核心是一个轻量级的扩散变压器模型，其中 PCG 参数直接被视为去噪目标，观察到的图像作为控制参数生成的条件。DI-PCG 高效而有效。仅需 7.6M 网络参数和 30 个 GPU 小时的训练时间，它就表现出了准确恢复参数的卓越性能，并且很好地推广到自然图像。定量和定性实验结果验证了 DI-PCG 在逆 PCG 和图像到 3D 生成任务中的有效性。 DI-PCG 为高效的逆 PCG 提供了一种有前途的方法，并代表了朝着 3D 生成路径迈出的宝贵探索一步，该路径模拟了如何使用参数模型构建 3D 资产。

Title: FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

Authors: Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15205
Pdf URL: https://arxiv.org/pdf/2412.15205
Copy Paste: [[2412.15205]] FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching(https://arxiv.org/abs/2412.15205)
Keywords: generation
Abstract: Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator's dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR's intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \url{this https URL}.
摘要：自回归 (AR) 建模通过使模型能够通过下一个标记预测生成具有连贯性和上下文理解的文本，在自然语言处理中取得了显著的成功。最近，在图像生成中，VAR 提出了尺度自回归建模，将下一个标记预测扩展到下一个尺度预测，保留了图像的二维结构。然而，VAR 遇到了两个主要挑战：(1) 其复杂而僵化的尺度设计限制了下一个尺度预测的泛化，(2) 生成器对具有相同复杂尺度结构的离散标记器的依赖限制了更新标记器的模块化和灵活性。为了解决这些限制，我们引入了 FlowAR，这是一种通用的下一个尺度预测方法，具有精简的尺度设计，其中每个后续尺度只是前一个尺度的两倍。这消除了对 VAR 复杂的多尺度残差标记器的需要，并允许使用任何现成的变分自动编码器 (VAE)。我们的简化设计增强了下一个尺度预测的泛化，并促进了流匹配的集成以实现高质量的图像合成。我们在具有挑战性的 ImageNet-256 基准上验证了 FlowAR 的有效性，与之前的方法相比，其生成性能更出色。代码将在 \url{此 https URL} 上提供。

Title: Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

Authors: Hadi Alzayer, Philipp Henzler, Jonathan T. Barron, Jia-Bin Huang, Pratul P. Srinivasan, Dor Verbin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15211
Pdf URL: https://arxiv.org/pdf/2412.15211
Copy Paste: [[2412.15211]] Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation(https://arxiv.org/abs/2412.15211)
Keywords: generative
Abstract: Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object's geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both synthetic and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent "shiny" appearance which cannot be reconstructed by prior methods.
摘要：从在不同环境下拍摄的照片中重建物体的几何形状和外观非常困难，因为拍摄的图像中的光照和物体外观各不相同。这对于镜面物体尤其具有挑战性，因为镜面物体的外观在很大程度上取决于观察方向。一些先前的方法使用每个图像的嵌入向量对图像间的外观变化进行建模，而另一些方法则使用基于物理的渲染来恢复材料和每个图像的光照。由于输入光照存在很大变化，这些方法无法忠实地恢复与视图相关的外观，并且往往会产生大部分漫反射结果。我们提出了一种方法，该方法可以从在不同光照下拍摄的图像中重建物体，方法是首先使用多视图重新照明扩散模型在单个参考光照下重新照明图像，然后使用辐射场架构重建物体的几何形状和外观，该架构对重新照明图像之间剩余的微小不一致性具有鲁棒性。我们在合成和真实数据集上验证了我们提出的方法，并证明它在从极端光照变化下拍摄的图像中重建高保真外观方面远远优于现有技术。此外，我们的方法对于恢复先前方法无法重建的视图相关的“闪亮”外观特别有效。

Title: Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Authors: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.15213
Pdf URL: https://arxiv.org/pdf/2412.15213
Copy Paste: [[2412.15213]] Flowing from Words to Pixels: A Framework for Cross-Modality Evolution(https://arxiv.org/abs/2412.15213)
Keywords: super-resolution, generation
Abstract: Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.
摘要：扩散模型及其泛化流匹配对媒体生成领域产生了显著影响。在这里，传统方法是学习从简单的高斯噪声源分布到目标媒体分布的复杂映射。对于文本到图像生成等跨模态任务，在模型中包含调节机制的同时，学习从噪声到图像的相同映射。流匹配的一个关键且迄今为止相对未被探索的特征是，与扩散模型不同，它们不受源分布为噪声的限制。因此，在本文中，我们提出了一种范式转变，并提出了一个问题：我们是否可以训练流匹配模型来学习从一种模态分布到另一种模态分布的直接映射，从而消除对噪声分布和调节机制的需求。我们提出了一个通用且简单的跨模态流匹配框架 CrossFlow。我们展示了将变分编码器应用于输入数据的重要性，并介绍了一种实现无分类器指导的方法。令人惊讶的是，对于文本到图像，使用没有交叉注意的 vanilla Transformer 的 CrossFlow 略优于标准流匹配，并且我们表明它可以更好地适应训练步骤和模型大小，同时还允许有趣的潜在算法，从而在输出空间中产生具有语义意义的编辑。为了证明我们方法的通用性，我们还表明 CrossFlow 在各种跨模态 / 模态内映射任务（即图像字幕、深度估计和图像超分辨率）方面与最先进的技术相当或优于最先进的技术。我们希望本文有助于加速跨模态媒体生成的进展。