2024-12-06

Title: HunyuanVideo: A Systematic Framework For Large Video Generative Models

Authors: Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Bai Jiawang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Daquan Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Caesar Zhong (Refer to the report for detailed contributions)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03603
Pdf URL: https://arxiv.org/pdf/2412.03603
Copy Paste: [[2412.03603]] HunyuanVideo: A Systematic Framework For Large Video Generative Models(https://arxiv.org/abs/2412.03603)
Keywords: generation, generative
Abstract: Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.
摘要：视频生成领域的最新进展极大地影响了个人和行业的日常生活。然而，领先的视频生成模型仍然是闭源的，导致行业能力与公众可用的能力之间存在显著的性能差距。在本报告中，我们介绍了HunyuanVideo，这是一个创新的开源视频基础模型，其视频生成性能可与领先的闭源模型相媲美甚至超越。HunyuanVideo包含一个全面的框架，该框架集成了几个关键要素，包括数据管理、先进的架构设计、渐进式模型扩展和训练，以及为大规模模型训练和推理量身定制的高效基础设施。因此，我们成功训练了一个具有超过130亿个参数的视频生成模型，使其成为所有开源模型中最大的。我们进行了广泛的实验并实施了一系列有针对性的设计，以确保高视觉质量、运动动态、文本-视频对齐和先进的拍摄技术。根据专家的评估，HunyuanVideo 的表现优于之前的先进模型，包括 Runway Gen-3、Luma 1.6 和三个表现最好的中文视频生成模型。通过发布基础模型及其应用程序的代码，我们旨在弥合闭源社区和开源社区之间的差距。这一举措将使社区中的个人能够尝试他们的想法，从而培育一个更具活力和生机的视频生成生态系统。代码在此 https URL 上公开提供。

Title: MV-Adapter: Multi-view Consistent Image Generation Made Easy

Authors: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03632
Pdf URL: https://arxiv.org/pdf/2412.03632
Copy Paste: [[2412.03632]] MV-Adapter: Multi-view Consistent Image Generation Made Easy(https://arxiv.org/abs/2412.03632)
Keywords: generation
Abstract: Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.
摘要：现有的多视图图像生成方法通常会对预训练的文本转图像 (T2I) 模型进行侵入性修改并需要完全微调，导致 (1) 计算成本高，尤其是对于大型基础模型和高分辨率图像，以及 (2) 由于优化困难和高质量 3D 数据稀缺而导致图像质量下降。在本文中，我们提出了第一个基于适配器的多视图图像生成解决方案，并介绍了 MV-Adapter，这是一种多功能的即插即用适配器，可在不改变原始网络结构或特征空间的情况下增强 T2I 模型及其衍生产品。通过更新更少的参数，MV-Adapter 可以实现高效训练并保留预训练模型中嵌入的先验知识，从而降低过度拟合风险。为了在适配器中有效地对 3D 几何知识进行建模，我们引入了创新设计，包括重复的自注意力层和并行注意力架构，使适配器能够继承预训练模型的强大先验来对新颖的 3D 知识进行建模。此外，我们提出了一个统一的条件编码器，无缝集成了相机参数和几何信息，促进了基于文本和图像的 3D 生成和纹理化等应用。MV-Adapter 在稳定扩散 XL (SDXL) 上实现了 768 分辨率的多视图生成，并展示了适应性和多功能性。它还可以扩展到任意视图生成，从而实现更广泛的应用。我们证明 MV-Adapter 为多视图图像生成设定了新的质量标准，并因其效率、适应性和多功能性开辟了新的可能性。

Title: A Water Efficiency Dataset for African Data Centers

Authors: Noah Shumba, Opelo Tshekiso, Pengfei Li, Giulia Fanti, Shaolei Ren
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.03716
Pdf URL: https://arxiv.org/pdf/2412.03716
Copy Paste: [[2412.03716]] A Water Efficiency Dataset for African Data Centers(https://arxiv.org/abs/2412.03716)
Keywords: generation
Abstract: AI computing and data centers consume a large amount of freshwater, both directly for cooling and indirectly for electricity generation. While most attention has been paid to developed countries such as the U.S., this paper presents the first-of-its-kind dataset that combines nation-level weather and electricity generation data to estimate water usage efficiency for data centers in 41 African countries across five different climate regions. We also use our dataset to evaluate and estimate the water consumption of inference on two large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African countries. Our findings show that writing a 10-page report using Llama-3-70B could consume about \textbf{0.7 liters} of water, while the water consumption by GPT-4 for the same task may go up to about 60 liters. For writing a medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about \textbf{0.13 liters} and 3 liters of water, respectively. Interestingly, given the same AI model, 8 out of the 11 selected African countries consume less water than the global average, mainly because of lower water intensities for electricity generation. However, water consumption can be substantially higher in some African countries with a steppe climate than the U.S. and global averages, prompting more attention when deploying AI computing in these countries. Our dataset is publicly available on \href{this https URL}{Hugging Face}.
摘要：AI 计算和数据中心消耗大量淡水，既直接用于冷却，又间接用于发电。虽然大多数注意力都集中在美国等发达国家，但本文提出了首个结合国家级天气和发电数据的数据集，以估算 41 个非洲国家/地区五个不同气候区的数据中心的用水效率。我们还使用我们的数据集评估和估算了 11 个选定非洲国家/地区两个大型语言模型（即 Llama-3-70B 和 GPT-4）的推理用水量。我们的研究结果表明，使用 Llama-3-70B 编写一份 10 页的报告可能会消耗约 \textbf{0.7 升} 的水，而 GPT-4 完成相同任务的用水量可能高达约 60 升。撰写一封 120-200 字的中等长度电子邮件，Llama-3-70B 和 GPT-4 分别可以消耗约 \textbf{0.13 升} 和 3 升水。有趣的是，在同样的 AI 模型下，11 个选定的非洲国家中有 8 个国家的用水量低于全球平均水平，这主要是因为发电用水强度较低。然而，一些非洲草原气候国家的用水量可能远高于美国和全球平均水平，这促使在这些国家部署 AI 计算时引起更多关注。我们的数据集在 \href{此 https URL}{Hugging Face} 上公开可用。

Title: HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution

Authors: Yuxuan Jiang, Ho Man Kwan, Tianhao Peng, Ge Gao, Fan Zhang, Xiaoqing Zhu, Joel Sole, David Bull
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03748
Pdf URL: https://arxiv.org/pdf/2412.03748
Copy Paste: [[2412.03748]] HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution(https://arxiv.org/abs/2412.03748)
Keywords: super-resolution
Abstract: Recent advances in implicit neural representations (INRs) have shown significant promise in modeling visual signals for various low-vision tasks including image super-resolution (ISR). INR-based ISR methods typically learn continuous representations, providing flexibility for generating high-resolution images at any desired scale from their low-resolution counterparts. However, existing INR-based ISR methods utilize multi-layer perceptrons for parameterization in the network; this does not take account of the hierarchical structure existing in local sampling points and hence constrains the representation capability. In this paper, we propose a new \textbf{H}ierarchical encoding based \textbf{I}mplicit \textbf{I}mage \textbf{F}unction for continuous image super-resolution, \textbf{HIIF}, which leverages a novel hierarchical positional encoding that enhances the local implicit representation, enabling it to capture fine details at multiple scales. Our approach also embeds a multi-head linear attention mechanism within the implicit attention network by taking additional non-local information into account. Our experiments show that, when integrated with different backbone encoders, HIIF outperforms the state-of-the-art continuous image super-resolution methods by up to 0.17dB in PSNR. The source code of HIIF will be made publicly available at \url{this http URL}.
摘要：隐式神经表征 (INR) 的最新进展已在为包括图像超分辨率 (ISR) 在内的各种低视力任务建模视觉信号方面显示出巨大的潜力。基于 INR 的 ISR 方法通常学习连续表征，从而可以灵活地从低分辨率图像生成任何所需比例的高分辨率图像。然而，现有的基于 INR 的 ISR 方法利用多层感知器在网络中进行参数化；这没有考虑到局部采样点中存在的层次结构，因此限制了表征能力。在本文中，我们提出了一种新的基于 \textbf{H} 层次编码的 \textbf{I} 隐式 \textbf{I} 图像 \textbf{F} 函数用于连续图像超分辨率，\textbf{HIIF}，它利用一种新颖的层次位置编码来增强局部隐式表征，使其能够捕捉多个尺度的精细细节。我们的方法还通过考虑额外的非局部信息，在隐式注意力网络中嵌入了多头线性注意力机制。我们的实验表明，当与不同的骨干编码器集成时，HIIF 在 PSNR 方面的表现比最先进的连续图像超分辨率方法高出 0.17dB。HIIF 的源代码将在 \url{此 http URL} 上公开发布。

Title: Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

Authors: Justin Theiss, Norman Müller, Daeil Kim, Aayush Prakash
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03756
Pdf URL: https://arxiv.org/pdf/2412.03756
Copy Paste: [[2412.03756]] Multi-view Image Diffusion via Coordinate Noise and Fourier Attention(https://arxiv.org/abs/2412.03756)
Keywords: generation
Abstract: Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.
摘要：最近，与以前的基线相比，使用扩散模型的文本到图像生成在更高保真度和泛化能力方面取得了重大进步。然而，从提示中生成整体多视图一致图像仍然是一项重要且具有挑战性的任务。为了应对这一挑战，我们提出了一种扩散过程，该过程通过新颖的注意力机制以及新颖的噪声初始化技术和交叉注意力损失来关注特征的时间相关空间频率。这个基于傅里叶的注意力模块专注于生成场景中非重叠区域的特征，以更好地对齐全局外观。我们的噪声初始化技术结合了从像素坐标和深度图得出的共享噪声和低空间频率信息，以在视图之间引起噪声相关性。交叉注意力损失进一步对齐了整个场景中共享相同提示的特征。与其他最先进的多视图一致性方法相比，我们的技术在几个定量指标上改进了 SOTA，并在质量上获得了更好的结果。

Title: Advancing Auto-Regressive Continuation for Video Frames

Authors: Ruibo Ming, Jingwei Wu, Zhewei Huang, Zhuoxuan Ju, Jianming HU, Lihui Peng, Shuchang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03758
Pdf URL: https://arxiv.org/pdf/2412.03758
Copy Paste: [[2412.03758]] Advancing Auto-Regressive Continuation for Video Frames(https://arxiv.org/abs/2412.03758)
Keywords: generation
Abstract: Recent advances in auto-regressive large language models (LLMs) have shown their potential in generating high-quality text, inspiring researchers to apply them to image and video generation. This paper explores the application of LLMs to video continuation, a task essential for building world models and predicting future frames. In this paper, we tackle challenges including preventing degeneration in long-term frame generation and enhancing the quality of generated images. We design a scheme named ARCON, which involves training our model to alternately generate semantic tokens and RGB tokens, enabling the LLM to explicitly learn and predict the high-level structural information of the video. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance the visual quality of the generated videos. Quantitative and qualitative experiments in autonomous driving scenarios demonstrate our model can consistently generate long videos.
摘要：自回归大型语言模型 (LLM) 的最新进展已显示出其在生成高质量文本方面的潜力，启发研究人员将其应用于图像和视频生成。本文探讨了 LLM 在视频延续中的应用，这是构建世界模型和预测未来帧的必备任务。在本文中，我们解决了包括防止长期帧生成中的退化和提高生成图像的质量等挑战。我们设计了一个名为 ARCON 的方案，该方案涉及训练我们的模型以交替生成语义标记和 RGB 标记，使 LLM 能够明确学习和预测视频的高级结构信息。我们发现生成的 RGB 图像和语义图具有高度一致性，而无需特殊设计。此外，我们采用基于光流的纹理拼接方法来增强生成的视频的视觉质量。自动驾驶场景中的定量和定性实验表明，我们的模型可以持续生成长视频。

Title: Coordinate In and Value Out: Training Flow Transformers in Ambient Space

Authors: Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Angel Bautista
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03791
Pdf URL: https://arxiv.org/pdf/2412.03791
Copy Paste: [[2412.03791]] Coordinate In and Value Out: Training Flow Transformers in Ambient Space(https://arxiv.org/abs/2412.03791)
Keywords: generative
Abstract: Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (i.e., a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.
摘要：流匹配模型已成为在图像或视频等领域甚至 3D 点云等非结构化数据上进行生成建模的强大方法。这些模型通常分两个阶段进行训练：首先，训练数据压缩器（即变分自动编码器），然后在后续训练阶段，在数据压缩器的低维潜在空间中训练流匹配生成模型。这种两阶段范式增加了整体训练方案的复杂性，并为跨数据域统一模型设置了障碍，因为不同的数据模式使用特定的数据压缩器。为此，我们引入了环境空间流变换器 (ASFT)，这是一种与领域无关的方法，用于在环境空间中学习流匹配变换器，从而避免了训练压缩器的要求并简化了训练过程。我们引入了一个条件独立的逐点训练目标，使 ASFT 能够在坐标空间中连续进行预测。我们的实证结果表明，使用通用转换器模块，ASFT 可以有效处理不同的数据模式，例如图像和 3D 点云，在两个领域都取得了出色的表现，并且优于同类方法。ASFT 是朝着领域无关的流匹配生成模型迈出的有希望的一步，该模型可以在不同的数据领域中轻松采用。

Title: Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Authors: Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Chengming Xu, Xiaobin Hu, Kai Wu, Donghao Luo, Yabiao Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03814
Pdf URL: https://arxiv.org/pdf/2412.03814
Copy Paste: [[2412.03814]] Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration(https://arxiv.org/abs/2412.03814)
Keywords: restoration
Abstract: Image Restoration aims to restore degraded images, with deep learning, especially CNNs and Transformers, enhancing performance. However, there's a lack of a unified training benchmark for IR. We identified a bias in image complexity between training and testing datasets, affecting restoration quality. To address this, we created ReSyn, a large-scale IR dataset with balanced complexity, including real and synthetic images. We also established a unified training standard for IR models. Our RWKV-IR model integrates linear complexity RWKV into transformers for global and local receptive fields. It replaces Q-Shift with Depth-wise Convolution for local dependencies and combines Bi-directional attention for global-local awareness. The Cross-Bi-WKV module balances horizontal and vertical attention. Experiments show RWKV-IR's effectiveness in image restoration.
摘要：图像恢复旨在恢复退化的图像，深度学习（尤其是 CNN 和 Transformers）可提高性能。然而，IR 缺乏统一的训练基准。我们发现训练和测试数据集之间的图像复杂度存在偏差，这会影响恢复质量。为了解决这个问题，我们创建了 ReSyn，这是一个复杂度均衡的大规模 IR 数据集，包括真实图像和合成图像。我们还为 IR 模型建立了统一的训练标准。我们的 RWKV-IR 模型将线性复杂度 RWKV 集成到 Transformers 中，用于全局和局部感受野。它用深度卷积取代 Q-Shift 来处理局部依赖关系，并结合双向注意力来获得全局和局部感知。Cross-Bi-WKV 模块平衡了水平和垂直注意力。实验证明了 RWKV-IR 在图像恢复方面的有效性。

Title: A large language model-type architecture for high-dimensional molecular potential energy surfaces

Authors: Xiao Zhu, Srinivasan S. Iyengar
Subjects: cs.LG, physics.atm-clus, physics.chem-ph, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2412.03831
Pdf URL: https://arxiv.org/pdf/2412.03831
Copy Paste: [[2412.03831]] A large language model-type architecture for high-dimensional molecular potential energy surfaces(https://arxiv.org/abs/2412.03831)
Keywords: generative
Abstract: Computing high dimensional potential surfaces for molecular and materials systems is considered to be a great challenge in computational chemistry with potential impact in a range of areas including fundamental prediction of reaction rates. In this paper we design and discuss an algorithm that has similarities to large language models in generative AI and natural language processing. Specifically, we represent a molecular system as a graph which contains a set of nodes, edges, faces etc. Interactions between these sets, which represent molecular subsystems in our case, are used to construct the potential energy surface for a reasonably sized chemical system with 51 dimensions. Essentially a family of neural networks that pertain to the graph-based subsystems, get the job done for this 51 dimensional system. We then ask if this same family of lower-dimensional neural networks can be transformed to provide accurate predictions for a 186 dimensional potential surface. We find that our algorithm does provide reasonably accurate results for this larger dimensional problem with sub-kcal/mol accuracy for the higher dimensional potential surface problem.
摘要：计算分子和材料系统的高维势能表面被认为是计算化学领域的一大挑战，可能对一系列领域产生影响，包括对反应速率的基本预测。在本文中，我们设计并讨论了一种与生成式人工智能和自然语言处理中的大型语言模型相似的算法。具体来说，我们将分子系统表示为包含一组节点、边、面等的图。这些集合之间的相互作用（在我们的案例中代表分子子系统）用于构建具有 51 个维度的合理大小的化学系统的势能表面。本质上，属于基于图的子系统的神经网络家族可以完成这个 51 维系统的工作。然后，我们询问是否可以将同一低维神经网络家族转换为对 186 维势能表面提供准确的预测。我们发现，我们的算法确实为这个大维问题提供了相当准确的结果，对于高维势能表面问题，精度达到亚 kcal/mol。

Title: LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model

Authors: Yuan Xue, Qi Zhang, Chuanmin Jia, Shiqi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03841
Pdf URL: https://arxiv.org/pdf/2412.03841
Copy Paste: [[2412.03841]] LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model(https://arxiv.org/abs/2412.03841)
Keywords: restoration, quality assessment
Abstract: Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.
摘要：机器图像压缩 (ICM) 旨在压缩图像以用于机器视觉任务，而不是人类观看。当前的工作主要集中在高级任务上，例如对象检测和语义分割。然而，原始图像的质量在现实世界中通常无法保证，导致压缩后的感知质量或下游任务性能更差。低级 (LL) 机器视觉模型（如图像恢复模型）可以帮助提高这种质量，因此也应考虑它们的压缩要求。在本文中，我们提出了一种用于 LL 机器视觉任务的开创性 ICM 框架，即 LL-ICM。通过联合优化压缩和 LL 任务，所提出的 LL-ICM 不仅丰富了其在推广到多功能 LL 任务中的编码能力，而且还优化了下游 LL 任务模型的处理能力，实现了图像编解码器和 LL 任务模型的相互适应。此外，我们将大规模视觉语言模型集成到 LL-ICM 框架中，为 LL 视觉任务生成更通用且抗失真的特征嵌入。因此，一个 LL-ICM 编解码器可以推广到多个任务。我们建立了一个可靠的基准来评估 LL-ICM，其中包括使用完整和无参考图像质量评估进行的大量客观实验。实验结果表明，与最先进的方法相比，LL-ICM 可以实现 22.65% 的 BD 率降低。

Title: Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer

Authors: Jayaprakash Sundararaj, Akhil Vyas, Benjamin Gonzalez-Maldonado
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.03853
Pdf URL: https://arxiv.org/pdf/2412.03853
Copy Paste: [[2412.03853]] Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer(https://arxiv.org/abs/2412.03853)
Keywords: generation
Abstract: Converting mathematical expressions into LaTeX is challenging. In this paper, we explore using newer transformer based architectures for addressing the problem of converting handwritten/digital mathematical expression images into equivalent LaTeX code. We use the current state of the art CNN encoder and RNN decoder as a baseline for our experiments. We also investigate improvements to CNN-RNN architecture by replacing the CNN encoder with the ResNet50 model. Our experiments show that transformer architectures achieve a higher overall accuracy and BLEU scores along with lower Levenschtein scores compared to the baseline CNN/RNN architecture with room to achieve even better results with appropriate fine-tuning of model parameters.
摘要：将数学表达式转换为 LaTeX 是一项挑战。在本文中，我们探索使用较新的基于 Transformer 的架构来解决将手写/数字数学表达式图像转换为等效 LaTeX 代码的问题。我们使用当前最先进的 CNN 编码器和 RNN 解码器作为实验的基线。我们还通过将 CNN 编码器替换为 ResNet50 模型来研究对 CNN-RNN 架构的改进。我们的实验表明，与基线 CNN/RNN 架构相比，Transformer 架构实现了更高的整体准确率和 BLEU 分数以及更低的 Levenschtein 分数，并且通过适当微调模型参数还有空间实现更好的结果。

Title: CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Authors: Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03859
Pdf URL: https://arxiv.org/pdf/2412.03859
Copy Paste: [[2412.03859]] CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation(https://arxiv.org/abs/2412.03859)
Keywords: generation
Abstract: Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at this https URL.
摘要：扩散模型因其能够生成不仅具有视觉吸引力而且具有高艺术品质的图像的能力而受到认可。因此，已经提出了布局到图像 (L2I) 生成，以利用特定于区域的位置和描述来实现更精确和可控的生成。然而，以前的方法主要集中在基于 UNet 的模型（例如 SD1.5 和 SDXL），而对多模态扩散变换器 (MM-DiT) 的探索却很少，这些变换器已展示出强大的图像生成能力。启用 MM-DiT 进行布局到图像生成似乎很简单，但由于布局在多种模态之间引入、集成和平衡的复杂性，因此具有挑战性。为此，我们探索了各种网络变体，以有效地将布局指导纳入 MM-DiT，并最终提出 SiamLayout。为了继承 MM-DiT 的优势，我们使用一组单独的网络权重来处理布局，将其视为与图像和文本模态同等重要。同时，为了缓解模态之间的竞争，我们将图像布局交互与图像文本交互分离为一个连体分支，并在后期将它们融合。此外，我们还贡献了一个名为 LayoutSAM 的大型布局数据集，其中包括 270 万个图像文本对和 1070 万个实体。每个实体都带有一个边界框和详细描述。我们进一步构建了 LayoutSAM-Eval 基准，作为评估 L2I 生成质量的综合工具。最后，我们引入了布局设计器，它挖掘了大型语言模型在布局规划中的潜力，将它们转变为布局生成和优化方面的专家。我们的代码、模型和数据集将在此 https URL 上提供。

Title: Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

Authors: Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03876
Pdf URL: https://arxiv.org/pdf/2412.03876
Copy Paste: [[2412.03876]] Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization(https://arxiv.org/abs/2412.03876)
Keywords: generation
Abstract: Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.
摘要：文本到图像 (T2I) 扩散模型因其能够根据文本提示生成高质量和多样化的图像而受到广泛认可。然而，尽管最近取得了进展，但这些模型仍然容易生成包含敏感或不适当内容的不安全图像，这可能会对用户造成伤害。当前为防止扩散模型生成不当图像所做的努力很容易被绕过，并且容易受到对抗性攻击。如何确保 T2I 模型与特定的安全目标保持一致仍然是一个重大挑战。在这项工作中，我们提出了一种新颖的、无需训练的方法，称为提示噪声优化 (PNO)，以减轻不安全的图像生成。我们的方法引入了一个新颖的优化框架，该框架利用采样过程中的连续提示嵌入和注入的噪声轨迹来生成安全的图像。大量的数值结果表明，我们的框架在抑制有毒图像生成方面实现了最先进的性能，并表现出对对抗性攻击的鲁棒性，而无需调整模型参数。此外，与现有方法相比，PNO 使用相当的生成时间，同时在安全生成和即时图像对齐的冲突目标之间提供最佳权衡。

Title: DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism

Authors: Sudha Krishnamurthy, Vimal Bhat, Abhinav Jain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03878
Pdf URL: https://arxiv.org/pdf/2412.03878
Copy Paste: [[2412.03878]] DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism(https://arxiv.org/abs/2412.03878)
Keywords: generation, generative
Abstract: The proliferation of several streaming services in recent years has now made it possible for a diverse audience across the world to view the same media content, such as movies or TV shows. While translation and dubbing services are being added to make content accessible to the local audience, the support for making content accessible to people with different abilities, such as the Deaf and Hard of Hearing (DHH) community, is still lagging. Our goal is to make media content more accessible to the DHH community by generating sign language videos with synthetic signers that are realistic and expressive. Using the same signer for a given media content that is viewed globally may have limited appeal. Hence, our approach combines parametric modeling and generative modeling to generate realistic-looking synthetic signers and customize their appearance based on user preferences. We first retarget human sign language poses to 3D sign language avatars by optimizing a parametric model. The high-fidelity poses from the rendered avatars are then used to condition the poses of synthetic signers generated using a diffusion-based generative model. The appearance of the synthetic signer is controlled by an image prompt supplied through a visual adapter. Our results show that the sign language videos generated using our approach have better temporal consistency and realism than signing videos generated by a diffusion model conditioned only on text prompts. We also support multimodal prompts to allow users to further customize the appearance of the signer to accommodate diversity (e.g. skin tone, gender). Our approach is also useful for signer anonymization.
摘要：近年来，多种流媒体服务的激增使得世界各地的不同观众能够观看相同的媒体内容，例如电影或电视节目。虽然正在添加翻译和配音服务以使当地观众可以访问内容，但对使不同能力的人（例如聋人和听力障碍 (DHH) 社区）可以访问内容的支持仍然滞后。我们的目标是通过生成逼真且富有表现力的合成手语视频，使媒体内容更容易被 DHH 社区访问。对于全球观看的特定媒体内容使用相同的手语可能吸引力有限。因此，我们的方法结合了参数建模和生成建模，以生成看起来逼真的合成手语并根据用户偏好定制其外观。我们首先通过优化参数模型将人类手语姿势重新定位到 3D 手语化身。然后使用渲染化身的高保真姿势来调节使用基于扩散的生成模型生成的合成手语的姿势。合成手语者的外观由通过视觉适配器提供的图像提示控制。我们的结果表明，使用我们的方法生成的手语视频比仅以文本提示为条件的扩散模型生成的手语视频具有更好的时间一致性和真实感。我们还支持多模式提示，以允许用户进一步自定义手语者的外观以适应多样性（例如肤色、性别）。我们的方法对于手语者匿名化也很有用。

Title: A Noise is Worth Diffusion Guidance

Authors: Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.03895
Pdf URL: https://arxiv.org/pdf/2412.03895
Copy Paste: [[2412.03895]] A Noise is Worth Diffusion Guidance(https://arxiv.org/abs/2412.03895)
Keywords: generation
Abstract: Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: this https URL.
摘要：扩散模型在生成高质量图像方面表现出色。然而，目前的扩散模型很难在没有指导方法（如无分类器指导 (CFG)）的情况下生成可靠的图像。指导方法真的有必要吗？观察到通过扩散反演获得的噪声可以在没有指导的情况下重建高质量图像，我们将重点放在去噪管道的初始噪声上。通过将高斯噪声映射到“无指导噪声”，我们发现小的低幅度低频分量显著增强了去噪过程，消除了对指导的需求，从而提高了推理吞吐量和内存。在此基础上，我们提出了一种新方法，用对初始噪声的单一细化来取代指导方法。这种精炼噪声可以在同一个扩散管道内无需指导的情况下生成高质量图像。我们的噪声精炼模型利用高效的噪声空间学习，仅使用 50K 个文本图像对即可实现快速收敛和强大的性能。我们在各种指标上验证了它的有效性，并分析了精炼噪声如何消除对指导的需求。请参阅我们的项目页面：此 https URL。

Title: Multi-View Pose-Agnostic Change Localization with Zero Labels

Authors: Chamuditha Jayanga Galappaththige, Jason Lai, Lloyd Windrim, Donald Dansereau, Niko Suenderhauf, Dimity Miller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03911
Pdf URL: https://arxiv.org/pdf/2412.03911
Copy Paste: [[2412.03911]] Multi-View Pose-Agnostic Change Localization with Zero Labels(https://arxiv.org/abs/2412.03911)
Keywords: generation
Abstract: Autonomous agents often require accurate methods for detecting and localizing changes in their environment, particularly when observations are captured from unconstrained and inconsistent viewpoints. We propose a novel label-free, pose-agnostic change detection method that integrates information from multiple viewpoints to construct a change-aware 3D Gaussian Splatting (3DGS) representation of the scene. With as few as 5 images of the post-change scene, our approach can learn additional change channels in a 3DGS and produce change masks that outperform single-view techniques. Our change-aware 3D scene representation additionally enables the generation of accurate change masks for unseen viewpoints. Experimental results demonstrate state-of-the-art performance in complex multi-object scenes, achieving a 1.7$\times$ and 1.6$\times$ improvement in Mean Intersection Over Union and F1 score respectively over other baselines. We also contribute a new real-world dataset to benchmark change detection in diverse challenging scenes in the presence of lighting variations.
摘要：自主代理通常需要精确的方法来检测和定位其环境中的变化，特别是当从不受约束和不一致的视点捕获观察结果时。我们提出了一种新颖的无标签、与姿势无关的变化检测方法，该方法整合了来自多个视点的信息，以构建场景的变化感知 3D 高斯分层 (3DGS) 表示。只需 5 张变化后场景的图像，我们的方法就可以在 3DGS 中学习额外的变化通道并生成优于单视图技术的变化蒙版。我们的变化感知 3D 场景表示还可以为看不见的视点生成准确的变化蒙版。实验结果证明了在复杂的多物体场景中的最佳性能，与其他基线相比，平均交集和 F1 得分分别提高了 1.7$\times$ 和 1.6$\times$。我们还贡献了一个新的真实世界数据集，以在存在光照变化的情况下对各种具有挑战性的场景中的变化检测进行基准测试。

Title: Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications

Authors: Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew, Hui Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03924
Pdf URL: https://arxiv.org/pdf/2412.03924
Copy Paste: [[2412.03924]] Privacy-Preserving in Medical Image Analysis: A Review of Methods and Applications(https://arxiv.org/abs/2412.03924)
Keywords: generative
Abstract: With the rapid advancement of artificial intelligence and deep learning, medical image analysis has become a critical tool in modern healthcare, significantly improving diagnostic accuracy and efficiency. However, AI-based methods also raise serious privacy concerns, as medical images often contain highly sensitive patient information. This review offers a comprehensive overview of privacy-preserving techniques in medical image analysis, including encryption, differential privacy, homomorphic encryption, federated learning, and generative adversarial networks. We explore the application of these techniques across various medical image analysis tasks, such as diagnosis, pathology, and telemedicine. Notably, we organizes the review based on specific challenges and their corresponding solutions in different medical image analysis applications, so that technical applications are directly aligned with practical issues, addressing gaps in the current research landscape. Additionally, we discuss emerging trends, such as zero-knowledge proofs and secure multi-party computation, offering insights for future research. This review serves as a valuable resource for researchers and practitioners and can help advance privacy-preserving in medical image analysis.
摘要：随着人工智能和深度学习的快速发展，医学图像分析已成为现代医疗保健的重要工具，大大提高了诊断的准确性和效率。然而，基于人工智能的方法也引发了严重的隐私问题，因为医学图像通常包含高度敏感的患者信息。本综述全面概述了医学图像分析中的隐私保护技术，包括加密、差分隐私、同态加密、联邦学习和生成对抗网络。我们探索了这些技术在各种医学图像分析任务中的应用，例如诊断、病理学和远程医疗。值得注意的是，我们根据不同医学图像分析应用中的具体挑战及其相应的解决方案来组织综述，以便技术应用与实际问题直接保持一致，解决当前研究领域的空白。此外，我们还讨论了零知识证明和安全多方计算等新兴趋势，为未来的研究提供了见解。本综述是研究人员和从业人员的宝贵资源，有助于推进医学图像分析中的隐私保护。

Title: InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Authors: Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, Jiahui Huang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.03934
Pdf URL: https://arxiv.org/pdf/2412.03934
Copy Paste: [[2412.03934]] InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models(https://arxiv.org/abs/2412.03934)
Keywords: generation, generative
Abstract: We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
摘要：我们提出了一种可扩展的方法 InfiniCube，用于生成具有高保真度和可控性的无界动态 3D 驾驶场景。以前的场景生成方法要么规模有限，要么在生成的序列中缺乏几何和外观一致性。相比之下，我们利用可扩展 3D 表示和视频模型的最新进展来实现大型动态场景生成，允许通过高清地图、车辆边界框和文本描述进行灵活控制。首先，我们构建了一个地图条件下的基于稀疏体素的 3D 生成模型，以释放其在无界体素世界生成中的能力。然后，我们重新利用视频模型，并通过一组精心设计的像素对齐引导缓冲区将其固定在体素世界中，从而合成一致的外观。最后，我们提出了一种快速前馈方法，该方法同时使用体素和像素分支将动态视频提升为具有可控对象的动态 3D 高斯。我们的方法可以生成可控且逼真的 3D 驾驶场景，大量实验验证了我们模型的有效性和优越性。

Title: AIpparel: A Large Multimodal Generative Model for Digital Garments

Authors: Kiyohiro Nakayama, Jan Ackermann, Timur Levent Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas J. Guibas, Guandao Yang, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.03937
Pdf URL: https://arxiv.org/pdf/2412.03937
Copy Paste: [[2412.03937]] AIpparel: A Large Multimodal Generative Model for Digital Garments(https://arxiv.org/abs/2412.03937)
Keywords: generation, generative
Abstract: Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. \methodname achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at this http URL.
摘要：服装是人类生活中必不可少的物品，它提供保护、反映文化身份并展示个人风格。然而，服装的制作仍然是一个耗时的过程，这主要是因为设计服装需要手工劳动。为了简化这个过程，我们引入了 AIpparel，这是一个用于生成和编辑缝纫图案的大型多模态模型。我们的模型在一个定制的大型数据集上对最先进的大型多模态模型 (LMM) 进行了微调，该数据集包含超过 120,000 件独特的服装，每件服装都带有多模态注释，包括文本、图像和缝纫图案。此外，我们提出了一种新颖的标记化方案，可以简洁地编码这些复杂的缝纫图案，以便 LLM 可以学习有效地预测它们。 \methodname 在单模态任务（包括文本到服装和图像到服装预测）中实现了最先进的性能，并支持新颖的多模态服装生成应用，例如交互式服装编辑。项目网站位于此 http URL。

Title: A Framework For Image Synthesis Using Supervised Contrastive Learning

Authors: Yibin Liu, Jianyu Zhang, Li Zhang, Shijian Li, Gang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.03957
Pdf URL: https://arxiv.org/pdf/2412.03957
Copy Paste: [[2412.03957]] A Framework For Image Synthesis Using Supervised Contrastive Learning(https://arxiv.org/abs/2412.03957)
Keywords: generation, generative
Abstract: Text-to-image (T2I) generation aims at producing realistic images corresponding to text descriptions. Generative Adversarial Network (GAN) has proven to be successful in this task. Typical T2I GANs are 2 phase methods that first pretrain an inter-modal representation from aligned image-text pairs and then use GAN to train image generator on that basis. However, such representation ignores the inner-modal semantic correspondence, e.g. the images with same label. The semantic label in priory describes the inherent distribution pattern with underlying cross-image relationships, which is supplement to the text description for understanding the full characteristics of image. In this paper, we propose a framework leveraging both inter- and inner-modal correspondence by label guided supervised contrastive learning. We extend the T2I GANs to two parameter-sharing contrast branches in both pretraining and generation phases. This integration effectively clusters the semantically similar image-text pair representations, thereby fostering the generation of higher-quality images. We demonstrate our framework on four novel T2I GANs by both single-object dataset CUB and multi-object dataset COCO, achieving significant improvements in the Inception Score (IS) and Frechet Inception Distance (FID) metrics of imagegeneration evaluation. Notably, on more complex multi-object COCO, our framework improves FID by 30.1%, 27.3%, 16.2% and 17.1% for AttnGAN, DM-GAN, SSA-GAN and GALIP, respectively. We also validate our superiority by comparing with other label guided T2I GANs. The results affirm the effectiveness and competitiveness of our approach in advancing the state-of-the-art GAN for T2I generation
摘要：文本到图像 (T2I) 生成旨在生成与文本描述相对应的逼真图像。生成对抗网络 (GAN) 已被证明可成功完成此任务。典型的 T2I GAN 是两阶段方法，首先从对齐的图像-文本对中预训练模态间表示，然后在此基础上使用 GAN 训练图像生成器。然而，这种表示忽略了模态内语义对应关系，例如具有相同标签的图像。语义标签优先描述具有潜在跨图像关系的固有分布模式，这是对文本描述的补充，有助于理解图像的全部特征。在本文中，我们提出了一个通过标签引导的监督对比学习利用模态间和模态内对应的框架。我们将 T2I GAN 扩展为预训练和生成阶段的两个参数共享对比分支。这种集成有效地对语义相似的图像-文本对表示进行聚类，从而促进生成更高质量的图像。我们通过单对象数据集 CUB 和多对象数据集 COCO 在四种新型 T2I GAN 上展示了我们的框架，在图像生成评估的初始分数 (IS) 和 Frechet 初始距离 (FID) 指标上取得了显著的改进。值得注意的是，在更复杂的多对象 COCO 上，我们的框架分别将 AttnGAN、DM-GAN、SSA-GAN 和 GALIP 的 FID 提高了 30.1%、27.3%、16.2% 和 17.1%。我们还通过与其他标签引导的 T2I GAN 进行比较来验证我们的优势。结果证实了我们的方法在推进最先进的 T2I 生成 GAN 方面的有效性和竞争力

Title: Local Curvature Smoothing with Stein's Identity for Efficient Score Matching

Authors: Genki Osada, Makoto Shing, Takashi Nishide
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.03962
Pdf URL: https://arxiv.org/pdf/2412.03962
Copy Paste: [[2412.03962]] Local Curvature Smoothing with Stein's Identity for Efficient Score Matching(https://arxiv.org/abs/2412.03962)
Keywords: generation
Abstract: The training of score-based diffusion models (SDMs) is based on score matching. The challenge of score matching is that it includes a computationally expensive Jacobian trace. While several methods have been proposed to avoid this computation, each has drawbacks, such as instability during training and approximating the learning as learning a denoising vector field rather than a true score. We propose a novel score matching variant, local curvature smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by applying Stein's identity, enabling regularization effectiveness and efficient computation. We show that LCSS surpasses existing methods in sample generation performance and matches the performance of denoising score matching, widely adopted by most SDMs, in evaluations such as FID, Inception score, and bits per dimension. Furthermore, we show that LCSS enables realistic image generation even at a high resolution of $1024 \times 1024$.
摘要：基于分数的扩散模型 (SDM) 的训练基于分数匹配。分数匹配的挑战在于它包括计算成本高昂的雅可比行列式迹。虽然已经提出了几种方法来避免这种计算，但每种方法都有缺点，例如训练期间不稳定以及将学习近似为学习去噪矢量场而不是真实分数。我们提出了一种新颖的分数匹配变体，即带有 Stein 恒等式 (LCSS) 的局部曲率平滑。LCSS 通过应用 Stein 恒等式绕过雅可比行列式迹，从而实现正则化有效性和高效计算。我们表明，LCSS 在样本生成性能方面超越了现有方法，并且在 FID、Inception 分数和每维位数等评估中与大多数 SDM 广泛采用的去噪分数匹配的性能相匹配。此外，我们表明，即使在 $1024 \times 1024$ 的高分辨率下，LCSS 也能生成逼真的图像。

Title: Blind Underwater Image Restoration using Co-Operational Regressor Networks

Authors: Ozer Can Devecioglu, Serkan Kiranyaz, Turker Ince, Moncef Gabbouj
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.03995
Pdf URL: https://arxiv.org/pdf/2412.03995
Copy Paste: [[2412.03995]] Blind Underwater Image Restoration using Co-Operational Regressor Networks(https://arxiv.org/abs/2412.03995)
Keywords: restoration
Abstract: The exploration of underwater environments is essential for applications such as biological research, archaeology, and infrastructure maintenanceHowever, underwater imaging is challenging due to the waters unique properties, including scattering, absorption, color distortion, and reduced visibility. To address such visual degradations, a variety of approaches have been proposed covering from basic signal processing methods to deep learning models; however, none of them has proven to be consistently successful. In this paper, we propose a novel machine learning model, Co-Operational Regressor Networks (CoRe-Nets), designed to achieve the best possible underwater image restoration. A CoRe-Net consists of two co-operating networks: the Apprentice Regressor (AR), responsible for image transformation, and the Master Regressor (MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images generated by the AR and feeds it back to AR. CoRe-Nets are built on Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior learning capability by modulating nonlinearity in kernel transformations. The effectiveness of the proposed model is demonstrated on the benchmark Large Scale Underwater Image (LSUI) dataset. Leveraging the joint learning capabilities of the two cooperating networks, the proposed model achieves the state-of-art restoration performance with significantly reduced computational complexity and often presents such results that can even surpass the visual quality of the ground truth with a 2-pass application. Our results and the optimized PyTorch implementation of the proposed approach are now publicly shared on GitHub.
摘要：水下环境的探索对于生物研究、考古学和基础设施维护等应用至关重要。然而，由于水的独特特性，包括散射、吸收、颜色失真和能见度降低，水下成像具有挑战性。为了解决这种视觉退化问题，已经提出了各种方法，从基本的信号处理方法到深度学习模型；然而，没有一种方法被证明能够持续成功。在本文中，我们提出了一种新颖的机器学习模型——协同操作回归网络 (CoRe-Nets)，旨在实现最佳的水下图像恢复。CoRe-Net 由两个协同网络组成：负责图像转换的学徒回归器 (AR) 和评估 AR 生成的图像的峰值信噪比 (PSNR) 并将其反馈给 AR 的主回归器 (MR)。CoRe-Nets 建立在自组织操作神经网络 (Self-ONNs) 之上，通过调节核变换中的非线性来提供卓越的学习能力。所提模型的有效性在基准大规模水下图像 (LSUI) 数据集上得到证明。利用两个协作网络的联合学习能力，所提模型实现了最先进的恢复性能，同时显著降低了计算复杂度，并且通常呈现出的结果甚至可以超越使用 2 遍应用的真实图像的视觉质量。我们的结果和所提方法的优化 PyTorch 实现现已在 GitHub 上公开分享。

Title: IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation

Authors: Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04000
Pdf URL: https://arxiv.org/pdf/2412.04000
Copy Paste: [[2412.04000]] IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation(https://arxiv.org/abs/2412.04000)
Keywords: generation, generative
Abstract: We introduce a novel approach for high-resolution talking head generation from a single image and audio input. Prior methods using explicit face models, like 3D morphable models (3DMM) and facial landmarks, often fall short in generating high-fidelity videos due to their lack of appearance-aware motion representation. While generative approaches such as video diffusion models achieve high video quality, their slow processing speeds limit practical application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM), employs implicit motion to encode human faces into appearance-aware compressed facial latents, enhancing video generation. Although implicit motion lacks the spatial disentanglement of explicit models, which complicates alignment with subtle lip movements, we introduce motion statistics to help capture fine-grained motion information. Additionally, our model provides motion controllability to optimize the trade-off between motion intensity and visual quality during inference. IF-MDM supports real-time generation of 512x512 resolution videos at up to 45 frames per second (fps). Extensive evaluations demonstrate its superior performance over existing diffusion and explicit face models. The code will be released publicly, available alongside supplementary materials. The video results can be found on this https URL.
摘要：我们介绍了一种通过单个图像和音频输入生成高分辨率说话头部的新方法。之前使用显式面部模型（如 3D 可变形模型 (3DMM) 和面部标志）的方法由于缺乏外观感知运动表示而无法生成高保真视频。虽然视频扩散模型等生成方法可以实现高视频质量，但其缓慢的处理速度限制了实际应用。我们提出的模型隐式面部运动扩散模型 (IF-MDM) 采用隐式运动将人脸编码为外观感知的压缩面部潜伏信息，从而增强了视频生成。虽然隐式运动缺乏显式模型的空间解缠，这会使与细微唇部运动的对齐变得复杂，但我们引入了运动统计数据来帮助捕获细粒度的运动信息。此外，我们的模型提供运动可控性，以优化推理过程中运动强度和视觉质量之间的权衡。IF-MDM 支持以高达每秒 45 帧 (fps) 的速度实时生成 512x512 分辨率的视频。大量评估表明，它的性能优于现有的扩散和显式人脸模型。代码将与补充材料一起公开发布。视频结果可在此 https URL 上找到。

Title: PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

Authors: Kangan Qian, Xinyu Jiao, Yining Shi, Yunlong Wang, Ziang Luo, Zheng Fu, Kun Jiang, Diange Yang
Subjects: cs.CV, cs.PF, cs.RO
Abstract URL: https://arxiv.org/abs/2412.04020
Pdf URL: https://arxiv.org/pdf/2412.04020
Copy Paste: [[2412.04020]] PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors(https://arxiv.org/abs/2412.04020)
Keywords: generative
Abstract: Reliable perception of spatial and motion information is crucial for safe autonomous navigation. Traditional approaches typically fall into two categories: object-centric and class-agnostic methods. While object-centric methods often struggle with missed detections, leading to inaccuracies in motion prediction, many class-agnostic methods focus heavily on encoder design, often overlooking important priors like rigidity and temporal consistency, leading to suboptimal performance, particularly with sparse LiDAR data at distant region. To address these issues, we propose $\textbf{PriorMotion}$, a generative framework that extracts rasterized and vectorized scene representations to model spatio-temporal priors. Our model comprises a BEV encoder, an Raster-Vector prior Encoder, and a Spatio-Temporal prior Generator, improving both spatial and temporal consistency in motion prediction. Additionally, we introduce a standardized evaluation protocol for class-agnostic motion prediction. Experiments on the nuScenes dataset show that PriorMotion achieves state-of-the-art performance, with further validation on advanced FMCW LiDAR confirming its robustness.
摘要：可靠的空间和运动信息感知对于安全的自主导航至关重要。传统方法通常分为两类：以对象为中心和与类别无关的方法。虽然以对象为中心的方法通常会错过检测，从而导致运动预测不准确，但许多与类别无关的方法主要侧重于编码器设计，通常忽略了刚性和时间一致性等重要先验，导致性能不佳，尤其是在远距离区域的稀疏 LiDAR 数据中。为了解决这些问题，我们提出了 $\textbf{PriorMotion}$，这是一个生成框架，可提取栅格化和矢量化的场景表示来模拟时空先验。我们的模型包括一个 BEV 编码器、一个栅格矢量先验编码器和一个时空先验生成器，可提高运动预测的空间和时间一致性。此外，我们引入了一个用于与类别无关的运动预测的标准化评估协议。在 nuScenes 数据集上进行的实验表明，PriorMotion 实现了最先进的性能，并且在先进的 FMCW LiDAR 上进行的进一步验证证实了其稳健性。

Title: INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Authors: Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, Zhipeng Ge
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04037
Pdf URL: https://arxiv.org/pdf/2412.04037
Copy Paste: [[2412.04037]] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations(https://arxiv.org/abs/2412.04037)
Keywords: generation
Abstract: Imagine having a conversation with a socially intelligent agent. It can attentively listen to your words and offer visual and linguistic feedback promptly. This seamless interaction allows for multiple rounds of conversation to flow smoothly and naturally. In pursuit of actualizing it, we propose INFP, a novel audio-driven head generation framework for dyadic interaction. Unlike previous head generation works that only focus on single-sided communication, or require manual role assignment and explicit role switching, our model drives the agent portrait dynamically alternates between speaking and listening state, guided by the input dyadic audio. Specifically, INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage. The first stage learns to project facial communicative behaviors from real-life conversation videos into a low-dimensional motion latent space, and use the motion latent codes to animate a static image. The second stage learns the mapping from the input dyadic audio to motion latent codes through denoising, leading to the audio-driven head generation in interactive scenarios. To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet. Extensive experiments and visualizations demonstrate superior performance and effectiveness of our method. Project Page: this https URL.
摘要：想象一下与一个社交智能代理进行对话。它可以专心聆听您的话语并及时提供视觉和语言反馈。这种无缝交互使多轮对话顺畅自然地进行。为了实现它，我们提出了 INFP，一种用于二元交互的新型音频驱动头部生成框架。与以前的头部生成工作不同，这些工作只关注单边通信，或需要手动角色分配和明确角色切换，我们的模型驱动代理肖像在输入二元音频的引导下动态地在说话和聆听状态之间交替。具体而言，INFP 包括基于运动的头部模仿阶段和音频引导的运动生成阶段。第一阶段学习将现实生活中的对话视频中的面部交流行为投射到低维运动潜在空间中，并使用运动潜在代码为静态图像制作动画。第二阶段通过去噪学习从输入二元音频到运动潜在代码的映射，从而实现交互场景中的音频驱动头部生成。为了促进这一研究方向，我们引入了 DyConv，这是一个从互联网收集的大规模丰富二元对话数据集。大量实验和可视化证明了我们方法的卓越性能和有效性。项目页面：此 https URL。

Title: ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality

Authors: Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04062
Pdf URL: https://arxiv.org/pdf/2412.04062
Copy Paste: [[2412.04062]] ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality(https://arxiv.org/abs/2412.04062)
Keywords: generation
Abstract: In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
摘要：在本文中，我们提出了 ZipAR，这是一种无需训练、即插即用的并行解码框架，用于加速自回归 (AR) 视觉生成。其动机源于以下观察：图像表现出局部结构，而空间上较远的区域往往具有最小的相互依赖性。给定一组部分解码的视觉标记，除了行维度中原始的下一个标记预测方案外，还可以并行解码列维度中与空间相邻区域相对应的标记，从而实现“下一组预测”范式。通过在一次前向传递中同时解码多个标记，生成图像所需的前向传递次数显著减少，从而大幅提高生成效率。实验表明，ZipAR 可以在 Emu3-Gen 模型上将模型前向传递次数减少高达 91%，而无需任何额外的再训练。

Title: Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning

Authors: Shicheng Zhou, Jingju Liu, Yuliang Lu, Jiahai Yang, Yue Zhang, Jie Chen
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2412.04078
Pdf URL: https://arxiv.org/pdf/2412.04078
Copy Paste: [[2412.04078]] Towards Generalizable Autonomous Penetration Testing via Domain Randomization and Meta-Reinforcement Learning(https://arxiv.org/abs/2412.04078)
Keywords: generation
Abstract: With increasing numbers of vulnerabilities exposed on the internet, autonomous penetration testing (pentesting) has emerged as an emerging research area, while reinforcement learning (RL) is a natural fit for studying autonomous pentesting. Previous research in RL-based autonomous pentesting mainly focused on enhancing agents' learning efficacy within abstract simulated training environments. They overlooked the applicability and generalization requirements of deploying agents' policies in real-world environments that differ substantially from their training settings. In contrast, for the first time, we shift focus to the pentesting agents' ability to generalize across unseen real environments. For this purpose, we propose a Generalizable Autonomous Pentesting framework (namely GAP) for training agents capable of drawing inferences from one to another -- a key requirement for the broad application of autonomous pentesting and a hallmark of human intelligence. GAP introduces a Real-to-Sim-to-Real pipeline with two key methods: domain randomization and meta-RL learning. Specifically, we are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve the agents' generalization ability in unseen environments by leveraging the synthetic environments. The combination of these two methods can effectively bridge the generalization gap and improve policy adaptation performance. Experiments are conducted on various vulnerable virtual machines, with results showing that GAP can (a) enable policy learning in unknown real environments, (b) achieve zero-shot policy transfer in similar environments, and (c) realize rapid policy adaptation in dissimilar environments.
摘要：随着互联网上暴露的漏洞越来越多，自主渗透测试 (pentesting) 已成为一个新兴的研究领域，而强化学习 (RL) 是研究自主渗透测试的天然选择。先前基于 RL 的自主渗透测试研究主要集中于提高代理在抽象模拟训练环境中的学习效率。他们忽视了在与训练设置有很大不同的现实环境中部署代理策略的适用性和泛化要求。相反，我们首次将重点转移到渗透测试代理在看不见的真实环境中泛化的能力。为此，我们提出了一个可泛化的自主渗透测试框架 (即 GAP)，用于训练能够从一个环境推断到另一个环境的代理——这是广泛应用自主渗透测试的关键要求和人类智能的标志。GAP 引入了 Real-to-Sim-to-Real 管道，其中包含两种关键方法：域随机化和元 RL 学习。具体来说，我们是第一批将领域随机化应用于自主渗透测试的人之一，并提出了一种基于大型语言模型的领域随机化方法用于合成环境生成。我们进一步应用元强化学习，利用合成环境来提高代理在未见过环境中的泛化能力。这两种方法的结合可以有效地弥合泛化差距并提高策略适应性能。在各种易受攻击的虚拟机上进行了实验，结果表明 GAP 可以 (a) 在未知的真实环境中实现策略学习，(b) 在类似环境中实现零样本策略转移，以及 (c) 在异类环境中实现快速策略适应。

Title: BodyMetric: Evaluating the Realism of HumanBodies in Text-to-Image Generation

Authors: Nefeli Andreou, Varsha Vivek, Ying Wang, Alex Vorobiov, Tiffany Deng, Raja Bala, Larry Davis, Betty Mohler Tesch
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04086
Pdf URL: https://arxiv.org/pdf/2412.04086
Copy Paste: [[2412.04086]] BodyMetric: Evaluating the Realism of HumanBodies in Text-to-Image Generation(https://arxiv.org/abs/2412.04086)
Keywords: generation
Abstract: Accurately generating images of human bodies from text remains a challenging problem for state of the art text-to-image models. Commonly observed body-related artifacts include extra or missing limbs, unrealistic poses, blurred body parts, etc. Currently, evaluation of such artifacts relies heavily on time-consuming human judgments, limiting the ability to benchmark models at scale. We address this by proposing BodyMetric, a learnable metric that predicts body realism in images. BodyMetric is trained on realism labels and multi-modal signals including 3D body representations inferred from the input image, and textual descriptions. In order to facilitate this approach, we design an annotation pipeline to collect expert ratings on human body realism leading to a new dataset for this task, namely, BodyRealism. Ablation studies support our architectural choices for BodyMetric and the importance of leveraging a 3D human body prior in capturing body-related artifacts in 2D images. In comparison to concurrent metrics which evaluate general user preference in images, BodyMetric specifically reflects body-related artifacts. We demonstrate the utility of BodyMetric through applications that were previously infeasible at scale. In particular, we use BodyMetric to benchmark the generation ability of text-to-image models to produce realistic human bodies. We also demonstrate the effectiveness of BodyMetric in ranking generated images based on the predicted realism scores.
摘要：准确地从文本生成人体图像对于最先进的文本到图像模型来说仍然是一个具有挑战性的问题。常见的与身体相关的伪影包括多余或缺失的肢体、不切实际的姿势、模糊的身体部位等。目前，对此类伪影的评估严重依赖于耗时的人类判断，限制了大规模基准测试模型的能力。我们通过提出 BodyMetric 来解决这个问题，这是一个可学习的指标，可以预测图像中的身体真实感。BodyMetric 是在真实感标签和多模态信号上进行训练的，包括从输入图像推断出的 3D 身体表示和文本描述。为了促进这种方法，我们设计了一个注释管道来收集专家对人体真实感的评分，从而为这项任务生成一个新的数据集，即 BodyRealism。消融研究支持我们对 BodyMetric 的架构选择以及利用 3D 人体先验在 2D 图像中捕捉与身体相关的伪影的重要性。与评估图像中一般用户偏好的并发指标相比，BodyMetric 特别反映了与身体相关的伪影。我们通过以前无法大规模实现的应用展示了 BodyMetric 的实用性。具体来说，我们使用 BodyMetric 来衡量文本转图像模型生成逼真人体的能力。我们还展示了 BodyMetric 在根据预测的真实度分数对生成的图像进行排名方面的有效性。

Title: LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04090
Pdf URL: https://arxiv.org/pdf/2412.04090
Copy Paste: [[2412.04090]] LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents(https://arxiv.org/abs/2412.04090)
Keywords: restoration, super-resolution
Abstract: We present the first loss agent, dubbed LossAgent, for low-level image processing tasks, e.g., image super-resolution and restoration, intending to achieve any customized optimization objectives of low-level image processing in different practical applications. Notably, not all optimization objectives, such as complex hand-crafted perceptual metrics, text description, and intricate human feedback, can be instantiated with existing low-level losses, e.g., MSE loss. which presents a crucial challenge in optimizing image processing networks in an end-to-end manner. To eliminate this, our LossAgent introduces the powerful large language model (LLM) as the loss agent, where the rich textual understanding of prior knowledge empowers the loss agent with the potential to understand complex optimization objectives, trajectory, and state feedback from external environments in the optimization process of the low-level image processing networks. In particular, we establish the loss repository by incorporating existing loss functions that support the end-to-end optimization for low-level image processing. Then, we design the optimization-oriented prompt engineering for the loss agent to actively and intelligently decide the compositional weights for each loss in the repository at each optimization interaction, thereby achieving the required optimization trajectory for any customized optimization objectives. Extensive experiments on three typical low-level image processing tasks and multiple optimization objectives have shown the effectiveness and applicability of our proposed LossAgent. Code and pre-trained models will be available at this https URL.
摘要：我们提出了第一个损失代理 LossAgent，用于低级图像处理任务，例如图像超分辨率和恢复，旨在实现不同实际应用中低级图像处理的任何定制优化目标。值得注意的是，并非所有优化目标（例如复杂的手工制作的感知指标、文本描述和复杂的人工反馈）都可以用现有的低级损失（例如 MSE 损失）来实例化。这对以端到端方式优化图像处理网络提出了关键挑战。为了消除这个问题，我们的 LossAgent 引入了强大的大型语言模型 (LLM) 作为损失代理，其中对先验知识的丰富文本理解使损失代理能够在低级图像处理网络的优化过程中理解来自外部环境的复杂优化目标、轨迹和状态反馈。特别是，我们通过合并支持低级图像处理端到端优化的现有损失函数来建立损失存储库。然后，我们设计了面向优化的提示工程，使损失代理能够在每次优化交互中主动、智能地决定存储库中每个损失的组成权重，从而实现任何定制优化目标所需的优化轨迹。对三个典型的低级图像处理任务和多个优化目标进行的大量实验证明了我们提出的 LossAgent 的有效性和适用性。代码和预训练模型将在此 https URL 上提供。

Title: MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities

Authors: Haoning Wu, Ziheng Zhao, Ya Zhang, Weidi Xie, Yanfeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04106
Pdf URL: https://arxiv.org/pdf/2412.04106
Copy Paste: [[2412.04106]] MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities(https://arxiv.org/abs/2412.04106)
Keywords: generation, generative
Abstract: Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: (i) we collect and curate a large-scale radiology image-text dataset, MedGen-1M, comprising modality labels, attributes, region, and organ information, along with a subset of organ mask annotations, to support research in controllable medical image generation; (ii) we propose a diffusion-based data engine, termed MRGen, which enables generation conditioned on text prompts and masks, synthesizing MR images for diverse modalities lacking mask annotations, to train segmentation models on unannotated modalities; (iii) we conduct extensive experiments across various modalities, illustrating that our data engine can effectively synthesize training samples and extend MRI segmentation towards unannotated modalities.
摘要：医学图像分割最近在深度神经网络的帮助下取得了令人瞩目的进展，然而模态的异构性和掩模注释的稀缺性限制了未注释模态上分割模型的发展。本文研究了一种在医学应用中利用生成模型的新范式：可控地合成未注释模态的数据，而无需注册的数据对。具体来说，我们在本文中做出了以下贡献：（i）我们收集并整理了一个大规模放射学图像文本数据集 MedGen-1M，其中包括模态标签、属性、区域和器官信息，以及器官掩模注释的子集，以支持可控医学图像生成的研究；（ii）我们提出了一种基于扩散的数据引擎 MRGen，它能够根据文本提示和掩模进行生成，为缺少掩模注释的各种模态合成 MR 图像，以训练未注释模态的分割模型； (iii) 我们对各种模式进行了广泛的实验，表明我们的数据引擎可以有效地合成训练样本并将 MRI 分割扩展到未注释的模式。

Title: Deep priors for satellite image restoration with accurate uncertainties

Authors: Biquard Maud, Marie Chabert, Florence Genin, Christophe Latry, Thomas Oberlin
Subjects: cs.CV, eess.IV, physics.optics
Abstract URL: https://arxiv.org/abs/2412.04130
Pdf URL: https://arxiv.org/pdf/2412.04130
Copy Paste: [[2412.04130]] Deep priors for satellite image restoration with accurate uncertainties(https://arxiv.org/abs/2412.04130)
Keywords: restoration, super-resolution
Abstract: Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, classically including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainty related to this restoration could be valuable by lowering the risk of hallucination and avoiding propagating these biases in downstream applications. Deep learning methods are now state-of-the-art for satellite image restoration. However, they require to train a specific network for each sensor and they do not provide the associated uncertainties. This paper proposes a generic method involving a single network to restore images from several sensors and a scalable way to derive the uncertainties. We focus on deep regularization (DR) methods, which learn a deep prior on target images before plugging it into a model-based optimization scheme. First, we introduce VBLE-xz, which solves the inverse problem in the latent space of a variational compressive autoencoder, estimating the uncertainty jointly in the latent and in the image spaces. It enables scalable posterior sampling with relevant and calibrated uncertainties. Second, we propose the denoiser-based method SatDPIR, adapted from DPIR, which efficiently computes accurate point estimates. We conduct a comprehensive set of experiments on very high resolution simulated and real Pleiades images, asserting both the performance and robustness of the proposed methods. VBLE-xz and SatDPIR achieve state-of-the-art results compared to direct inversion methods. In particular, VBLE-xz is a scalable method to get realistic posterior samples and accurate uncertainties, while SatDPIR represents a compelling alternative to direct inversion methods when uncertainty quantification is not required.
摘要：卫星光学图像在地面接收时，会提供观察到的场景的扭曲视图。在使用之前，需要对其进行修复，通常包括去噪、去模糊，有时还需要超分辨率。此外，量化与这种恢复相关的不确定性可能很有价值，因为它可以降低幻觉的风险，并避免在下游应用中传播这些偏差。深度学习方法现在是卫星图像恢复的最新方法。然而，它们需要为每个传感器训练一个特定的网络，并且它们不提供相关的不确定性。本文提出了一种通用方法，涉及一个单一网络来恢复来自多个传感器的图像，以及一种可扩展的方法来导出不确定性。我们专注于深度正则化 (DR) 方法，它在将目标图像插入基于模型的优化方案之前先学习深度先验。首先，我们引入了 VBLE-xz，它解决了变分压缩自动编码器潜在空间中的逆问题，联合估计潜在空间和图像空间中的不确定性。它能够实现具有相关和校准不确定性的可扩展后验采样。其次，我们提出了基于降噪器的方法 SatDPIR，该方法改编自 DPIR，可以高效计算准确的点估计值。我们对非常高分辨率的模拟和真实昴宿星团图像进行了一系列全面的实验，证明了所提出方法的性能和稳健性。与直接反演方法相比，VBLE-xz 和 SatDPIR 实现了最先进的结果。特别是，VBLE-xz 是一种可扩展的方法，可以获得真实的后验样本和准确的不确定性，而 SatDPIR 则代表了在不需要不确定性量化时直接反演方法的有力替代方案。

Title: Compositional Generative Multiphysics and Multi-component Simulation

Authors: Tao Zhang, Zhenhai Liu, Feipeng Qi, Yongjun Jiao, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04134
Pdf URL: https://arxiv.org/pdf/2412.04134
Copy Paste: [[2412.04134]] Compositional Generative Multiphysics and Multi-component Simulation(https://arxiv.org/abs/2412.04134)
Keywords: generative
Abstract: Multiphysics simulation, which models the interactions between multiple physical processes, and multi-component simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies often rely on numerical solvers or machine learning-based surrogate models to solve or accelerate these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers-each responsible for evolving a specific physical process-into a coupled program, which introduces significant development challenges. Furthermore, no universal algorithm exists for multi-component simulations, which adds to the complexity. Here we propose compositional Multiphysics and Multi-component Simulation with Diffusion models (MultiSimDiff) to overcome these challenges. During diffusion-based training, MultiSimDiff learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, MultiSimDiff generates coupled multiphysics solutions and multi-component structures by sampling from the joint probability distribution, achieved by composing the learned energy functions in a structured way. We test our method in three tasks. In the reaction-diffusion and nuclear thermal coupling problems, MultiSimDiff successfully predicts the coupling solution using decoupled data, while the surrogate model fails in the more complex second problem. For the thermal and mechanical analysis of the prismatic fuel element, MultiSimDiff trained for single component prediction accurately predicts a larger structure with 64 components, reducing the relative error by 40.3% compared to the surrogate model.
摘要：多物理场模拟可以模拟多个物理过程之间的相互作用，而复杂结构的多组分模拟在核工程和航空航天工程等领域至关重要。以前的研究通常依赖数值求解器或基于机器学习的替代模型来解决或加速这些模拟。然而，多物理场模拟通常需要将多个专门的求解器（每个求解器负责演化特定的物理过程）集成到一个耦合程序中，这带来了重大的开发挑战。此外，多组分模拟没有通用算法，这增加了复杂性。在这里，我们提出了组合多物理场和多组分扩散模型模拟（MultiSimDiff）来克服这些挑战。在基于扩散的训练过程中，MultiSimDiff 学习能量函数，该函数模拟一个物理过程/组分以其他过程/组分为条件的条件概率。在推理中，MultiSimDiff 通过从联合概率分布中采样来生成耦合的多物理场解和多组分结构，这是通过以结构化方式组合学习到的能量函数实现的。我们在三个任务中测试了我们的方法。在反应扩散和核热耦合问题中，MultiSimDiff 使用解耦数据成功预测了耦合解，而替代模型在更复杂的第二个问题中失败了。对于棱柱形燃料元件的热和机械分析，经过单组件预测训练的 MultiSimDiff 准确预测了具有 64 个组件的较大结构，与替代模型相比，相对误差降低了 40.3%。

Title: Understanding Memorization in Generative Models via Sharpness in Probability Landscapes

Authors: Dongjae Jeon, Dueun Kim, Albert No
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04140
Pdf URL: https://arxiv.org/pdf/2412.04140
Copy Paste: [[2412.04140]] Understanding Memorization in Generative Models via Sharpness in Probability Landscapes(https://arxiv.org/abs/2412.04140)
Keywords: generative
Abstract: In this paper, we introduce a geometric framework to analyze memorization in diffusion models using the eigenvalues of the Hessian of the log probability density. We propose that memorization arises from isolated points in the learned probability distribution, characterized by sharpness in the probability landscape, as indicated by large negative eigenvalues of the Hessian. Through experiments on various datasets, we demonstrate that these eigenvalues effectively detect and quantify memorization. Our approach provides a clear understanding of memorization in diffusion models and lays the groundwork for developing strategies to ensure secure and reliable generative models
摘要：在本文中，我们引入了一个几何框架，使用对数概率密度的 Hessian 的特征值来分析扩散模型中的记忆。我们认为记忆源自学习概率分布中的孤立点，其特征是概率景观的尖锐性，如 Hessian 的较大负特征值所示。通过在各种数据集上的实验，我们证明这些特征值可以有效地检测和量化记忆。我们的方法提供了对扩散模型中记忆的清晰理解，并为制定确保生成模型安全可靠的策略奠定了基础

Title: AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Authors: Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, Qian He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04146
Pdf URL: https://arxiv.org/pdf/2412.04146
Copy Paste: [[2412.04146]] AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models(https://arxiv.org/abs/2412.04146)
Keywords: generation
Abstract: Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.
摘要：基于扩散模型的以服装为中心的文本和图像提示图像生成的最新进展令人印象深刻。然而，现有的方法缺乏对各种服装组合的支持，并且难以在保持对文本提示的忠实度的同时保留服装细节，从而限制了它们在不同场景中的表现。在本文中，我们专注于一项新任务，即多服装虚拟穿衣，并提出了一种新颖的 AnyDressing 方法，用于根据任何服装组合和任何个性化文本提示定制角色。AnyDressing 包含两个主要网络，分别名为 GarmentsNet 和 DressingNet，分别用于提取详细的服装特征和生成定制图像。具体来说，我们在 GarmentsNet 中提出了一个高效且可扩展的模块，称为服装特定特征提取器，用于并行单独编码服装纹理。这种设计在确保网络效率的同时防止了服装混淆。同时，我们在 DressingNet 中设计了一种自适应的 Dressing-Attention 机制和一种新颖的实例级服装定位学习策略，以将多服装特征准确地注入其相应区域。该方法有效地将多件服装的纹理线索整合到生成的图像中，并进一步增强了文本与图像的一致性。此外，我们引入了服装增强纹理学习策略来改善服装的细粒度纹理细节。得益于我们精心设计，AnyDressing 可以作为插件模块，轻松与任何社区控制扩展集成以用于扩散模型，从而提高合成图像的多样性和可控性。大量实验表明，AnyDressing 取得了最先进的成果。

Title: An In-Depth Examination of Risk Assessment in Multi-Class Classification Algorithms

Authors: Disha Ghandwani, Neeraj Sarna, Yuanyuan Li, Yang Lin
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2412.04166
Pdf URL: https://arxiv.org/pdf/2412.04166
Copy Paste: [[2412.04166]] An In-Depth Examination of Risk Assessment in Multi-Class Classification Algorithms(https://arxiv.org/abs/2412.04166)
Keywords: generation
Abstract: Advanced classification algorithms are being increasingly used in safety-critical applications like health-care, engineering, etc. In such applications, miss-classifications made by ML algorithms can result in substantial financial or health-related losses. To better anticipate and prepare for such losses, the algorithm user seeks an estimate for the probability that the algorithm miss-classifies a sample. We refer to this task as the risk-assessment. For a variety of models and datasets, we numerically analyze the performance of different methods in solving the risk-assessment problem. We consider two solution strategies: a) calibration techniques that calibrate the output probabilities of classification models to provide accurate probability outputs; and b) a novel approach based upon the prediction interval generation technique of conformal prediction. Our conformal prediction based approach is model and data-distribution agnostic, simple to implement, and provides reasonable results for a variety of use-cases. We compare the different methods on a broad variety of models and datasets.
摘要：高级分类算法越来越多地用于安全关键型应用，例如医疗保健、工程等。在这些应用中，机器学习算法的错误分类可能会导致重大的财务或健康相关损失。为了更好地预测和准备此类损失，算法用户会寻求算法错误分类样本的概率估计值。我们将这项任务称为风险评估。对于各种模型和数据集，我们用数字分析不同方法在解决风险评估问题方面的表现。我们考虑了两种解决策略：a) 校准技术，校准分类模型的输出概率以提供准确的概率输出；b) 一种基于共形预测的预测区间生成技术的新方法。我们基于共形预测的方法与模型和数据分布无关，易于实现，并为各种用例提供合理的结果。我们在各种模型和数据集上比较了不同的方法。

Title: Instructional Video Generation

Authors: Yayuan Li, Zhi Cao, Jason J. Corso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04189
Pdf URL: https://arxiv.org/pdf/2412.04189
Copy Paste: [[2412.04189]] Instructional Video Generation(https://arxiv.org/abs/2412.04189)
Keywords: generation
Abstract: Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and this http URL results can be found on the project webpage: this https URL
摘要：尽管最近在视频生成方面取得了长足进步，但最先进的方法仍然在处理视觉细节元素方面存在困难。一个特别具有挑战性的案例是一类以自我为中心的教学视频，其中复杂的手部运动与基本稳定且无干扰的环境相结合对于传达适当的视觉动作指令是必不可少的。为了应对这些挑战，我们引入了一种新的教学视频生成方法。我们基于扩散的方法结合了两项不同的创新。首先，我们提出了一种自动方法来生成预期的运动区域，该方法由视觉背景和动作文本引导。其次，我们引入了关键的手部结构损失来引导扩散模型专注于平滑一致的手势。我们在基于 EpicKitchens 和 Ego4D 的增强教学数据集上评估了我们的方法，结果表明，在教学清晰度方面，尤其是在目标区域的手部运动方面，与最先进的方法相比，在不同环境中都有显著的改进，此 http URL 结果可以在项目网页上找到：此 https URL

Title: Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image

Authors: Shuang Xu, Zixiang Zhao, Haowen Bai, Chang Yu, Jiangjun Peng, Xiangyong Cao, Deyu Meng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.04201
Pdf URL: https://arxiv.org/pdf/2412.04201
Copy Paste: [[2412.04201]] Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image(https://arxiv.org/abs/2412.04201)
Keywords: restoration, super-resolution
Abstract: Hyperspectral images (HSIs) are frequently noisy and of low resolution due to the constraints of imaging devices. Recently launched satellites can concurrently acquire HSIs and panchromatic (PAN) images, enabling the restoration of HSIs to generate clean and high-resolution imagery through fusing PAN images for denoising and super-resolution. However, previous studies treated these two tasks as independent processes, resulting in accumulated errors. This paper introduces \textbf{H}yperspectral \textbf{I}mage Joint \textbf{Pand}enoising \textbf{a}nd Pan\textbf{s}harpening (Hipandas), a novel learning paradigm that reconstructs HRHS images from noisy low-resolution HSIs (LRHS) and high-resolution PAN images. The proposed zero-shot Hipandas framework consists of a guided denoising network, a guided super-resolution network, and a PAN reconstruction network, utilizing an HSI low-rank prior and a newly introduced detail-oriented low-rank prior. The interconnection of these networks complicates the training process, necessitating a two-stage training strategy to ensure effective training. Experimental results on both simulated and real-world datasets indicate that the proposed method surpasses state-of-the-art algorithms, yielding more accurate and visually pleasing HRHS images.
摘要：由于成像设备的限制，高光谱图像 (HSI) 通常噪声较大且分辨率较低。最近发射的卫星可以同时获取 HSI 和全色 (PAN) 图像，通过融合 PAN 图像进行去噪和超分辨率，可以恢复 HSI 以生成清晰的高分辨率图像。然而，以前的研究将这两个任务视为独立的过程，导致累积误差。本文介绍了 \textbf{H}yperspectral \textbf{I}mage Joint \textbf{Pand}enoising \textbf{a}nd Pan\textbf{s}harpening (Hipandas)，这是一种新颖的学习范式，可以从噪声低分辨率 HSI (LRHS) 和高分辨率 PAN 图像重建 HRHS 图像。所提出的零样本 Hipandas 框架由引导式去噪网络、引导式超分辨率网络和 PAN 重建网络组成，利用 HSI 低秩先验和新引入的面向细节的低秩先验。这些网络的互连使训练过程变得复杂，需要采用两阶段训练策略来确保有效训练。在模拟数据集和真实数据集上的实验结果表明，所提出的方法超越了最先进的算法，可以生成更准确、视觉上更令人愉悦的 HRHS 图像。

Title: VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction

Authors: Jiahao Zhang, Ryota Yoshihashi, Shunsuke Kitada, Atsuki Osanai, Yuta Nakashima
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04237
Pdf URL: https://arxiv.org/pdf/2412.04237
Copy Paste: [[2412.04237]] VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction(https://arxiv.org/abs/2412.04237)
Keywords: generation, generative
Abstract: Large language models (LLMs) have proven effective for layout generation due to their ability to produce structure-description languages, such as HTML or JSON, even without access to visual information. Recently, LLM providers have evolved these models into large vision-language models (LVLM), which shows prominent multi-modal understanding capabilities. Then, how can we leverage this multi-modal power for layout generation? To answer this, we propose Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based content-aware layout generation. In our method, LVLMs iteratively refine their outputs with reference to rendered layout images, which are visualized as colored bounding boxes on poster backgrounds. In experiments, we demonstrate that our method combined with the Gemini. Without any additional training, VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming both existing layout-specific generative models and other LLM-based methods.
摘要：大型语言模型 (LLM) 已被证明可用于布局生成，因为它们能够生成结构描述语言（例如 HTML 或 JSON），即使无法访问视觉信息也是如此。最近，LLM 提供商已将这些模型发展为大型视觉语言模型 (LVLM)，该模型表现出卓越的多模态理解能力。那么，我们如何利用这种多模态能力来生成布局呢？为了回答这个问题，我们提出了基于 LVLM 的视觉感知自校正布局生成 (VASCAR)，用于生成内容感知的布局。在我们的方法中，LVLM 参考渲染的布局图像迭代地优化其输出，这些图像在海报背景上可视化为彩色边界框。在实验中，我们证明了我们的方法与 Gemini 相结合。无需任何额外的训练，VASCAR 即可实现最先进的 (SOTA) 布局生成质量，优于现有的布局特定生成模型和其他基于 LLM 的方法。

Title: LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation

Authors: Xiang Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04242
Pdf URL: https://arxiv.org/pdf/2412.04242
Copy Paste: [[2412.04242]] LMDM:Latent Molecular Diffusion Model For 3D Molecule Generation(https://arxiv.org/abs/2412.04242)
Keywords: generation
Abstract: n this work, we propose a latent molecular diffusion model that can make the generated 3D molecules rich in diversity and maintain rich geometric features. The model captures the information of the forces and local constraints between atoms so that the generated molecules can maintain Euclidean transformation and high level of effectiveness and diversity. We also use the lowerrank manifold advantage of the latent variables of the latent model to fuse the information of the forces between atoms to better maintain the geometric equivariant properties of the molecules. Because there is no need to perform information fusion encoding in stages like traditional encoders and decoders, this reduces the amount of calculation in the back-propagation process. The model keeps the forces and local constraints of particle bonds in the latent variable space, reducing the impact of underfitting on the surface of the network on the large position drift of the particle geometry, so that our model can converge earlier. We introduce a distribution control variable in each backward step to strengthen exploration and improve the diversity of generation. In the experiment, the quality of the samples we generated and the convergence speed of the model have been significantly improved.
摘要：本工作中，我们提出了一种隐含分子扩散模型，使得生成的三维分子具有丰富的多样性，同时保持了丰富的几何特征。该模型捕捉了原子间作用力和局部约束的信息，使得生成的分子既能保持欧氏变换，又能保持较高的有效性和多样性。我们还利用了隐含模型隐变量的低秩流形优势，将原子间作用力的信息进行融合，以更好地保持分子的几何等变性质。由于不需要像传统的编码器和解码器那样分阶段进行信息融合编码，从而减少了反向传播过程中的计算量。该模型将粒子键的作用力和局部约束保留在隐变量空间中，减少了网络表面欠拟合对粒子几何形状较大位置漂移的影响，使得我们的模型能够更早地收敛。我们在每次反向传播中引入一个分布控制变量，以加强探索，提高生成的多样性。在实验中，我们生成的样本质量和模型的收敛速度都得到了显著的提高。

Title: SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

Authors: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.04262
Pdf URL: https://arxiv.org/pdf/2412.04262
Copy Paste: [[2412.04262]] SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction(https://arxiv.org/abs/2412.04262)
Keywords: generation, generative
Abstract: Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.
摘要：从文档图像中提取表格是一项具有挑战性的 AI 问题，而且许多内容领域的标记数据都很难获得。现有的表格提取数据集通常侧重于科学表格，因为有大量的学术文章及其源代码可供查阅。然而，在科学、金融和其他领域，表格之间存在显著的布局和排版差异。当前的数据集通常缺少表格中包含的单词及其位置，而是依赖不可靠的 OCR 来提取这些特征，以便在自然语言处理任务上训练现代机器学习模型。因此，需要一种更通用的方法来获取标记数据。我们提出了 SynFinTabs，这是一个大规模的合成财务表格标记数据集。我们希望我们生成这些合成表格的方法可以转移到其他领域。为了证明我们的数据集在训练模型从表格图像中提取信息方面的有效性，我们创建了 FinTabQA，这是一个在提取式问答任务上训练的布局大型语言模型。我们使用真实世界的财务表测试我们的模型，并将其与最先进的生成模型进行比较并讨论结果。我们公开了数据集、模型和数据集生成代码。

Title: SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Authors: Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, Guangliang Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04292
Pdf URL: https://arxiv.org/pdf/2412.04292
Copy Paste: [[2412.04292]] SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model(https://arxiv.org/abs/2412.04292)
Keywords: generative
Abstract: The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.
摘要：生成模型在创建高度逼真的图像方面的快速发展为错误信息传播带来了巨大的风险。例如，合成图像在社交媒体上分享时会误导大量受众并削弱人们对数字内容的信任，从而造成严重后果。尽管取得了一些进展，但学术界尚未为社交媒体创建大型且多样化的深度伪造检测数据集，也没有设计出解决此问题的有效解决方案。在本文中，我们介绍了社交媒体图像检测数据集 (SID-Set)，它具有三个主要优势：(1) 数量庞大，包含 300K 张 AI 生成/篡改的图像和带有全面注释的真实图像，(2) 多样性广泛，涵盖各个类别的完全合成和篡改的图像，以及 (3) 真实感增强，仅通过目视检查几乎无法将图像与真实图像区分开来。此外，利用大型多模态模型的卓越能力，我们提出了一种新的图像深度伪造检测、定位和解释框架，名为 SIDA（社交媒体图像检测、定位和解释助手）。SIDA 不仅可以辨别图像的真实性，还可以通过掩码预测描绘出被篡改的区域，并提供模型判断标准的文本解释。与 SID-Set 和其他基准上最先进的深度伪造检测模型相比，大量实验表明 SIDA 在多样化设置中实现了卓越的性能。代码、模型和数据集即将发布。

Title: T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

Authors: Ziwei Huang, Wanggui He, Quanyu Long, Yandi Wang, Haoyuan Li, Zhelun Yu, Fangxun Shu, Long Chen, Hao Jiang, Leilei Gan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04300
Pdf URL: https://arxiv.org/pdf/2412.04300
Copy Paste: [[2412.04300]] T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts(https://arxiv.org/abs/2412.04300)
Keywords: generation
Abstract: Evaluating the quality of synthesized images remains a significant challenge in the development of text-to-image (T2I) generation. Most existing studies in this area primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of T2I models, particularly when the concepts involved are knowledge-intensive. To mitigate this gap, we present T2I-FactualBench in this work - the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA) based evaluation framework to assess the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement.
摘要：评估合成图像的质量仍然是文本转图像 (T2I) 生成开发中的一项重大挑战。该领域的大多数现有研究主要侧重于评估文本-图像对齐、图像质量和对象组合能力，而评估 T2I 模型真实性的研究相对较少，尤其是当涉及的概念是知识密集型时。为了弥补这一差距，我们在本文中提出了 T2I-FactualBench - 这是迄今为止最大的基准，就专门用于评估知识密集型概念生成的真实性的概念和提示数量而言。T2I-FactualBench 由一个三层知识密集型文本转图像生成框架组成，从单个知识概念的基本记忆到多个知识概念的更复杂组合。我们进一步引入了一个基于多轮视觉问答 (VQA) 的评估框架来评估三层知识密集型文本转图像生成任务的真实性。 T2I-FactualBench 上的实验表明，目前最先进的 (SOTA) T2I 模型仍有很大的改进空间。

Title: LocalSR: Image Super-Resolution in Local Region

Authors: Bo Ji, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04314
Pdf URL: https://arxiv.org/pdf/2412.04314
Copy Paste: [[2412.04314]] LocalSR: Image Super-Resolution in Local Region(https://arxiv.org/abs/2412.04314)
Keywords: super-resolution
Abstract: Standard single-image super-resolution (SR) upsamples and restores entire images. Yet several real-world applications require higher resolutions only in specific regions, such as license plates or faces, making the super-resolution of the entire image, along with the associated memory and computational cost, unnecessary. We propose a novel task, called LocalSR, to restore only local regions of the low-resolution image. For this problem setting, we propose a context-based local super-resolution (CLSR) to super-resolve only specified regions of interest (ROI) while leveraging the entire image as context. Our method uses three parallel processing modules: a base module for super-resolving the ROI, a global context module for gathering helpful features from across the image, and a proximity integration module for concentrating on areas surrounding the ROI, progressively propagating features from distant pixels to the target region. Experimental results indicate that our approach, with its reduced low complexity, outperforms variants that focus exclusively on the ROI.
摘要：标准的单图像超分辨率 (SR) 会上采样并恢复整个图像。然而，一些现实世界的应用仅在特定区域（例如车牌或人脸）需要更高的分辨率，因此整个图像的超分辨率以及相关的内存和计算成本都是不必要的。我们提出了一项名为 LocalSR 的新任务，用于仅恢复低分辨率图像的局部区域。对于这个问题设置，我们提出了一种基于上下文的局部超分辨率 (CLSR)，以仅对指定的感兴趣区域 (ROI) 进行超分辨率处理，同时利用整个图像作为上下文。我们的方法使用三个并行处理模块：用于超分辨率 ROI 的基本模块、用于从整个图像中收集有用特征的全局上下文模块以及用于集中于 ROI 周围区域的邻近集成模块，逐步将特征从远处像素传播到目标区域。实验结果表明，我们的方法复杂度较低，优于仅关注 ROI 的变体。

Title: Liquid: Language Models are Scalable Multi-modal Generators

Authors: Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04332
Pdf URL: https://arxiv.org/pdf/2412.04332
Copy Paste: [[2412.04332]] Liquid: Language Models are Scalable Multi-modal Generators(https://arxiv.org/abs/2412.04332)
Keywords: generation
Abstract: We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We show that existing LLMs can serve as strong foundations for Liquid, saving 100x in training costs while outperforming Chameleon in multimodal capabilities and maintaining language performance comparable to mainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2 are powerful multimodal generators, offering a scalable solution for enhancing both vision-language understanding and generation. The code and models will be released.
摘要：我们提出了一种自回归生成范式 Liquid，它通过将图像标记为离散代码，并在视觉和语言的共享特征空间中学习这些代码嵌入和文本标记，无缝集成了视觉理解和生成。与之前的多模态大型语言模型 (MLLM) 不同，Liquid 使用单个大型语言模型 (LLM) 实现这种集成，从而无需使用外部预训练的视觉嵌入（例如 CLIP）。Liquid 首次发现了一种缩放定律，即视觉和语言任务统一训练不可避免地带来的性能下降会随着模型大小的增加而减小。此外，统一的标记空间使视觉生成和理解任务能够相互增强，从而有效消除早期模型中常见的干扰。我们表明，现有的 LLM 可以作为 Liquid 的坚实基础，节省 100 倍的训练成本，同时在多模态能力方面优于 Chameleon，并保持与 LLAMA2 等主流 LLM 相当的语言性能。 Liquid 的表现也优于 SD v2.1 和 SD-XL 等模型（MJHQ-30K 上的 FID 为 5.47），在视觉语言和纯文本任务中均表现出色。这项工作表明 LLAMA3.2 和 GEMMA2 等 LLM 是强大的多模态生成器，为增强视觉语言理解和生成提供了可扩展的解决方案。代码和模型即将发布。

Title: RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

Authors: Zhouyingcheng Liao, Mingyuan Zhang, Wenjia Wang, Lei Yang, Taku Komura
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.04343
Pdf URL: https://arxiv.org/pdf/2412.04343
Copy Paste: [[2412.04343]] RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse(https://arxiv.org/abs/2412.04343)
Keywords: generation
Abstract: While motion generation has made substantial progress, its practical application remains constrained by dataset diversity and scale, limiting its ability to handle out-of-distribution scenarios. To address this, we propose a simple and effective baseline, RMD, which enhances the generalization of motion generation through retrieval-augmented techniques. Unlike previous retrieval-based methods, RMD requires no additional training and offers three key advantages: (1) the external retrieval database can be flexibly replaced; (2) body parts from the motion database can be reused, with an LLM facilitating splitting and recombination; and (3) a pre-trained motion diffusion model serves as a prior to improve the quality of motions obtained through retrieval and direct combination. Without any training, RMD achieves state-of-the-art performance, with notable advantages on out-of-distribution data.
摘要：尽管动作生成已经取得了实质性的进展，但其实际应用仍然受到数据集多样性和规模的制约，限制了其处理分布外场景的能力。为了解决这个问题，我们提出了一个简单有效的基线 RMD，它通过检索增强技术增强了动作生成的泛化。与以前基于检索的方法不同，RMD 不需要额外的训练，并具有三个主要优势：（1）可以灵活替换外部检索数据库；（2）可以重复使用动作数据库中的身体部位，并使用 LLM 促进拆分和重组；（3）预先训练的动作扩散模型作为先验，以提高通过检索和直接组合获得的动作的质量。无需任何训练，RMD 即可实现最先进的性能，并且在分布外数据上具有显着优势。

Title: Discriminative Fine-tuning of LVLMs

Authors: Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04378
Pdf URL: https://arxiv.org/pdf/2412.04378
Copy Paste: [[2412.04378]] Discriminative Fine-tuning of LVLMs(https://arxiv.org/abs/2412.04378)
Keywords: generative
Abstract: Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
摘要：对比训练的视觉语言模型 (VLM)（如 CLIP）已成为判别性视觉语言表征学习的事实上的方法。然而，这些模型的语言理解能力有限，通常表现出“词袋”行为。同时，大型视觉语言模型 (LVLM)（将视觉编码器与 LLM 相结合）已被证明能够进行详细的视觉语言推理，但它们的自回归性质使它们不太适合判别性任务。在这项工作中，我们建议结合“两全其美”：一种新的 LVLM 判别性微调训练方法，从而实现强大的判别性和组合能力。本质上，我们的方法将生成性 LVLM 转换为判别性 LVLM，释放其强大的图像文本判别能力并增强语言理解能力。我们的贡献包括：（1）精心设计的训练/优化框架，利用可变长度和粒度的图像-文本对来训练具有对比和下一个标记预测损失的模型。同时进行了消融研究，证明了我们框架组件的必要性。（2）一种参数高效的自适应方法，结合使用软提示和 LoRA 适配器。（3）与类似大小的最先进的 CLIP 类模型相比有显著改进，包括标准图像-文本检索基准和组合性方面的显著提升。

Title: Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Authors: Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04424
Pdf URL: https://arxiv.org/pdf/2412.04424
Copy Paste: [[2412.04424]] Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion(https://arxiv.org/abs/2412.04424)
Keywords: generative
Abstract: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. this https URL
摘要：我们推出了 Florence-VL，这是一类新的多模态大型语言模型 (MLLM)，具有由生成视觉基础模型 Florence-2 生成的丰富视觉表征。与通过对比学习训练的广泛使用的 CLIP 式视觉转换器不同，Florence-2 可以捕获不同级别和方面的视觉特征，这些特征更加灵活，可以适应各种下游任务。我们提出了一种新颖的特征融合架构和创新的训练方法，可以有效地将 Florence-2 的视觉特征集成到预训练的 LLM 中，例如 Phi 3.5 和 LLama 3。特别是，我们提出了“深度-呼吸融合 (DBFusion)”来融合从不同深度和多个提示下提取的视觉特征。我们的模型训练包括对整个模型进行端到端预训练，然后对投影层和 LLM 进行微调，基于精心设计的包含高质量图像标题和指令调整对的各种开源数据集的配方。我们对 Florence-VL 视觉特征的定量分析和可视化表明，它在视觉-语言对齐方面优于流行的视觉编码器，其中丰富的深度和呼吸起着重要作用。Florence-VL 在各种多模态和以视觉为中心的基准测试中实现了对现有最先进 MLLM 的显着改进，涵盖一般 VQA、感知、幻觉、OCR、图表、知识密集型理解等。为了促进未来的研究，我们的模型和完整的训练配方都是开源的。此 https URL

Title: Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Authors: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04431
Pdf URL: https://arxiv.org/pdf/2412.04431
Copy Paste: [[2412.04431]] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis(https://arxiv.org/abs/2412.04431)
Keywords: generation
Abstract: We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.
摘要：我们提出了 Infinity，这是一种按位视觉自回归模型，能够按照语言指令生成高分辨率、逼真的图像。Infinity 在按位标记预测框架下重新定义了视觉自回归模型，具有无限词汇标记器和分类器以及按位自校正机制，显著提高了生成能力和细节。通过理论上将标记器词汇量大小扩展到无穷大并同时扩展变压器大小，我们的方法与 vanilla VAR 相比显著释放了强大的扩展能力。Infinity 为自回归文本到图像模型创造了新纪录，优于 SD3-Medium 和 SDXL 等顶级扩散模型。值得注意的是，Infinity 通过将 GenEval 基准分数从 0.62 提高到 0.73 并将 ImageReward 基准分数从 0.87 提高到 0.96 超越了 SD3-Medium，实现了 66% 的胜率。无需额外优化，Infinity 即可在 0.8 秒内生成高质量的 1024x1024 图像，比 SD3-Medium 快 2.6 倍，成为最快的文本转图像模型。我们将发布模型和代码，以进一步探索 Infinity 在视觉生成和统一标记器建模方面的应用。

Title: Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Authors: Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04432
Pdf URL: https://arxiv.org/pdf/2412.04432
Copy Paste: [[2412.04432]] Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation(https://arxiv.org/abs/2412.04432)
Keywords: generation
Abstract: In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.
摘要：近年来，人们对在大型语言模型 (LLM) 中统一图像理解和生成的兴趣日益浓厚。这种日益增长的兴趣促使我们探索将这种统一扩展到视频。核心挑战在于开发一种多功能视频标记器，该标记器可以捕获视频的空间特征和时间动态，以获得 LLM 的表示，并且可以将这些表示进一步解码为逼真的视频片段以支持视频生成。在这项工作中，我们引入了 Divot，一种由扩散驱动的视频标记器，它利用扩散过程进行自监督视频表示学习。我们假设，如果视频扩散模型能够以视频标记器的特征为条件有效地对视频片段进行去噪，那么标记器就成功捕获了稳健的空间和时间信息。此外，视频扩散模型本质上充当去标记器，从其表示中解码视频。在 Divot 分词器的基础上，我们通过视频到文本的自回归和文本到视频的生成，使用高斯混合模型对连续值 Divot 特征的分布进行建模，从而提出了 Divot-Vicuna。实验结果表明，当与预训练的 LLM 集成时，我们基于扩散的视频分词器在各种视频理解和生成基准中都取得了具有竞争力的性能。经过指令调整的 Divot-Vicuna 在视频叙事方面也表现出色，可以生成交错的叙述和相应的视频。

Title: GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Authors: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04440
Pdf URL: https://arxiv.org/pdf/2412.04440
Copy Paste: [[2412.04440]] GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration(https://arxiv.org/abs/2412.04440)
Keywords: generation
Abstract: Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.
摘要：近年来，文本转视频生成模型取得了显著进展。然而，它们仍然难以基于组合文本提示生成复杂的动态场景，例如多个对象的属性绑定、与不同对象相关的时间动态以及对象之间的交互。我们的主要动机是将复杂的任务分解为更简单的任务，每个任务由角色专门化的 MLLM 代理处理。多个代理可以协作以实现复杂目标的集体智慧。我们提出了 GenMAC，这是一个支持组合文本转视频生成的迭代多代理框架。协作工作流程包括三个阶段：设计、生成和重新设计，生成和重新设计阶段之间有一个迭代循环，以逐步验证和改进生成的视频。重新设计阶段是最具挑战性的阶段，旨在验证生成的视频、提出更正建议，并为下一次迭代重新设计文本提示、逐帧布局和指导尺度。为了避免对单个 MLLM 代理产生幻觉，我们将此阶段分解为四个按顺序执行的基于 MLLM 的代理：验证代理、建议代理、校正代理和输出结构化代理。此外，为了解决组合式文本到视频生成的各种场景，我们设计了一种自路由机制，可以从一组校正代理中自适应地选择合适的校正代理，每个校正代理都专门针对一种场景。大量实验证明了 GenMAC 的有效性，在组合式文本到视频生成中实现了最先进的性能。

Title: DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ping Luo, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04446
Pdf URL: https://arxiv.org/pdf/2412.04446
Copy Paste: [[2412.04446]] DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models(https://arxiv.org/abs/2412.04446)
Keywords: generation
Abstract: Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language processing. We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training. We evaluate DiCoDe both quantitatively and qualitatively, demonstrating that it performs comparably to existing methods in terms of quality while ensuring efficient training. To showcase its scalability, we release a series of DiCoDe configurations with varying parameter sizes and observe a consistent improvement in performance as the model size increases from 100M to 3B. We believe that DiCoDe's exploration in academia represents a promising initial step toward scalable video modeling with AR language models, paving the way for the development of larger and more powerful video generation models.
摘要：视频本质上是时间序列。在这项工作中，我们探索了使用自回归 (AR) 语言模型以时间顺序和可扩展的方式对视频进行建模的潜力，这种模型的灵感来自其在自然语言处理中的成功。我们引入了 DiCoDe，这是一种利用扩散压缩深度标记以自回归方式使用语言模型生成视频的新方法。与使用有限压缩率的低级表示的现有方法不同，DiCoDe 使用具有相当大压缩率的深度标记（标记数量减少了 1000 倍）。这种显著的压缩是通过利用视频扩散模型的先验知识训练的标记器实现的。深度标记使 DiCoDe 能够使用原始 AR 语言模型来生成视频，类似于将一种视觉“语言”翻译成另一种。通过将视频视为时间序列，DiCoDe 充分利用了语言模型进行自回归生成的功能。 DiCoDe 可以使用现成的 AR 架构进行扩展，并且仅使用 4 个 A100 GPU 进行训练即可生成从几秒到一分钟的视频。我们对 DiCoDe 进行了定量和定性的评估，证明其在质量方面的表现与现有方法相当，同时确保了高效的训练。为了展示其可扩展性，我们发布了一系列具有不同参数大小的 DiCoDe 配置，并观察到随着模型大小从 100M 增加到 3B，性能持续提升。我们相信，DiCoDe 在学术界的探索代表了使用 AR 语言模型进行可扩展视频建模的有希望的第一步，为开发更大、更强大的视频生成模型铺平了道路。

Title: MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Authors: Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04448
Pdf URL: https://arxiv.org/pdf/2412.04448
Copy Paste: [[2412.04448]] MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation(https://arxiv.org/abs/2412.04448)
Keywords: generation
Abstract: Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
摘要：视频扩散模型的最新进展为逼真的音频驱动的有声视频生成开辟了新的潜力。然而，在生成的有声视频中实现无缝的音频-唇部同步、保持长期身份一致性以及产生自然、与音频一致的表情仍然是重大挑战。为了应对这些挑战，我们提出了记忆引导的情绪感知扩散 (MEMO)，这是一种端到端的音频驱动肖像动画方法，用于生成身份一致且富有表现力的有声视频。我们的方法围绕两个关键模块构建：(1) 记忆引导的时间模块，它通过开发记忆状态来存储来自更长过去上下文的信息以通过线性注意来指导时间建模，从而增强长期身份一致性和运动平滑度；(2) 情绪感知音频模块，它用多模态注意取代传统的交叉注意以增强音频-视频交互，同时从音频中检测情绪以通过情绪自适应层规范来细化面部表情。大量的定量和定性结果表明，MEMO 可以在各种图像和音频类型中生成更逼真的说话视频，在整体质量、音频-嘴唇同步、身份一致性和表情-情感一致方面均优于最先进的方法。

Title: Four-Plane Factorized Video Autoencoders

Authors: Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04452
Pdf URL: https://arxiv.org/pdf/2412.04452
Copy Paste: [[2412.04452]] Four-Plane Factorized Video Autoencoders(https://arxiv.org/abs/2412.04452)
Keywords: generation, generative
Abstract: Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.
摘要：隐变量生成模型已成为图像和视频合成等生成任务的强大工具。这些模型由预训练的自动编码器启用，这些自动编码器将高分辨率数据映射到压缩的低维潜在空间中，随后可以在其中开发生成模型，同时需要更少的计算资源。尽管隐变量模型很有效，但将其直接应用于视频等高维领域仍然对高效训练和推理构成挑战。在本文中，我们提出了一种自动编码器，它将体积数据投影到四平面分解的潜在空间上，该空间随输入大小呈亚线性增长，使其成为视频等高维数据的理想选择。我们的分解模型的设计支持在许多具有潜在扩散模型 (LDM) 的条件生成任务中直接采用，例如类条件生成、帧预测和视频插值。我们的结果表明，尽管压缩程度很高，但所提出的四平面潜在空间仍保留了高保真重建所需的丰富表示，同时使 LDM 能够以显着的速度和内存改进运行。

Title: HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

Authors: Yuto Matsubara, Ko Nishino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04456
Pdf URL: https://arxiv.org/pdf/2412.04456
Copy Paste: [[2412.04456]] HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery(https://arxiv.org/abs/2412.04456)
Keywords: generation
Abstract: We introduce a novel method for human shape and pose recovery that can fully leverage multiple static views. We target fixed-multiview people monitoring, including elderly care and safety monitoring, in which calibrated cameras can be installed at the corners of a room or an open space but whose configuration may vary depending on the environment. Our key idea is to formulate it as neural optimization. We achieve this with HeatFormer, a neural optimizer that iteratively refines the SMPL parameters given multiview images, which is fundamentally agonistic to the configuration of views. HeatFormer realizes this SMPL parameter estimation as heat map generation and alignment with a novel transformer encoder and decoder. We demonstrate the effectiveness of HeatFormer including its accuracy, robustness to occlusion, and generalizability through an extensive set of experiments. We believe HeatFormer can serve a key role in passive human behavior modeling.
摘要：我们介绍了一种新颖的人体形状和姿势恢复方法，可以充分利用多个静态视图。我们的目标是固定多视图人员监控，包括老年人护理和安全监控，其中校准的摄像头可以安装在房间或开放空间的角落，但其配置可能因环境而异。我们的主要思想是将其表述为神经优化。我们使用 HeatFormer 实现了这一点，这是一种神经优化器，它可以迭代地细化给定多视图图像的 SMPL 参数，这从根本上与视图的配置相反。HeatFormer 使用新颖的变压器编码器和解码器将这种 SMPL 参数估计实现为热图生成和对齐。我们通过大量实验证明了 HeatFormer 的有效性，包括其准确性、对遮挡的鲁棒性和通用性。我们相信 HeatFormer 可以在被动人类行为建模中发挥关键作用。

Title: LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

Authors: Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04460
Pdf URL: https://arxiv.org/pdf/2412.04460
Copy Paste: [[2412.04460]] LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors(https://arxiv.org/abs/2412.04460)
Keywords: generation, generative
Abstract: Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.
摘要：大规模扩散模型在从文本描述生成高质量图像方面取得了显著成功，在各种应用中越来越受欢迎。然而，分层内容的生成，例如具有前景层和背景层的透明图像，仍然是一个未被充分探索的领域。分层内容生成对于平面设计、动画和数字艺术等领域的创意工作流程至关重要，在这些领域，基于图层的方法对于灵活的编辑和构图至关重要。在本文中，我们提出了一种基于潜在扩散模型 (LDM) 的新型图像生成流程，该流程可生成具有两层的图像：具有透明度信息的前景层 (RGBA) 和背景层 (RGB)。与按顺序生成这些层的现有方法不同，我们的方法引入了一种协调的生成机制，使各层之间能够进行动态交互，从而获得更一致的输出。我们通过大量定性和定量实验证明了我们方法的有效性，与基线方法相比，视觉连贯性、图像质量和层一致性都有显着改善。

Title: Turbo3D: Ultra-fast Text-to-3D Generation

Authors: Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.04470
Pdf URL: https://arxiv.org/pdf/2412.04470
Copy Paste: [[2412.04470]] Turbo3D: Ultra-fast Text-to-3D Generation(https://arxiv.org/abs/2412.04470)
Keywords: generation
Abstract: We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.
摘要：我们推出了 Turbo3D，这是一款超快速的文本转 3D 系统，能够在不到一秒的时间内生成高质量的高斯溅射资产。Turbo3D 采用快速的 4 步 4 视图扩散生成器和高效的前馈高斯重构器，两者均在潜在空间中运行。4 步 4 视图生成器是通过新颖的双教师方法提炼的学生模型，该方法鼓励学生从多视图教师那里学习视图一致性，从单视图教师那里学习照片真实感。通过将高斯重构器的输入从像素空间转移到潜在空间，我们消除了额外的图像解码时间并将变压器序列长度减半以实现最高效率。与以前的基线相比，我们的方法展示了更出色的 3D 生成结果，同时运行时间仅为其运行时间的一小部分。

Title: PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Authors: Vinayak Gupta, Yunze Man, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.04471
Pdf URL: https://arxiv.org/pdf/2412.04471
Copy Paste: [[2412.04471]] PaintScene4D: Consistent 4D Scene Generation from Text Prompts(https://arxiv.org/abs/2412.04471)
Keywords: generation, generative
Abstract: Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL
摘要：扩散模型的最新进展彻底改变了 2D 和 3D 内容创建，但生成逼真的动态 4D 场景仍然是一项重大挑战。现有的动态 4D 生成方法通常依赖于从预先训练的 3D 生成模型中提取知识，这些模型通常在合成对象数据集上进行微调。因此，生成的场景往往以对象为中心，缺乏照片真实感。虽然文本到视频模型可以生成具有运动的更逼真的场景，但它们通常难以理解空间，并且在渲染过程中对摄像机视点的控制有限。为了解决这些限制，我们提出了 PaintScene4D，这是一种新颖的文本到 4D 场景生成框架，它不同于传统的多视图生成模型，而是采用精简的架构，利用在各种现实世界数据集上训练的视频生成模型。我们的方法首先使用视频生成模型生成参考视频，然后采用战略性相机阵列选择进行渲染。我们应用渐进式扭曲和修复技术来确保跨多个视点的空间和时间一致性。最后，我们使用动态渲染器优化多视图图像，从而根据用户偏好实现灵活的相机控制。我们的 PaintScene4D 采用无需训练的架构，可高效生成逼真的 4D 场景，可从任意轨迹查看。代码将公开。我们的项目页面位于此 https URL