2024-12-16

Title: Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Authors: Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09645
Pdf URL: https://arxiv.org/pdf/2412.09645
Copy Paste: [[2412.09645]] Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models(https://arxiv.org/abs/2412.09645)
Keywords: generation, generative
Abstract: Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.
摘要：视觉生成模型的最新进展使得高质量图像和视频生成成为可能，从而开辟了各种应用。然而，评估这些模型通常需要对数百或数千张图像或视频进行采样，这使得该过程在计算上非常昂贵，尤其是对于采样速度本来就很慢的基于扩散的模型。此外，现有的评估方法依赖于僵化的流程，这些流程忽略了特定的用户需求，并且提供数值结果而没有明确的解释。相比之下，人类只需观察几个样本就可以快速形成对模型能力的印象。为了模仿这一点，我们提出了评估代理框架，该框架采用类似人类的策略进行高效、动态、多轮评估，每轮仅使用几个样本，同时提供详细的、用户定制的分析。它具有四个主要优势：1) 效率，2) 可针对不同用户需求进行快速评估，3) 超越单一数值分数的可解释性，以及 4) 跨各种模型和工具的可扩展性。实验表明，评估代理将评估时间缩短至传统方法的 10%，同时提供可比的结果。评估代理框架完全开源，以推动视觉生成模型及其有效评估的研究。

Title: From Noise to Nuance: Advances in Deep Generative Image Models

Authors: Benji Peng, Chia Xin Liang, Ziqian Bi, Ming Liu, Yichao Zhang, Tianyang Wang, Keyu Chen, Xinyuan Song, Pohsun Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09656
Pdf URL: https://arxiv.org/pdf/2412.09656
Copy Paste: [[2412.09656]] From Noise to Nuance: Advances in Deep Generative Image Models(https://arxiv.org/abs/2412.09656)
Keywords: generation, generative
Abstract: Deep learning-based image generation has undergone a paradigm shift since 2021, marked by fundamental architectural breakthroughs and computational innovations. Through reviewing architectural innovations and empirical results, this paper analyzes the transition from traditional generative methods to advanced architectures, with focus on compute-efficient diffusion models and vision transformer architectures. We examine how recent developments in Stable Diffusion, DALL-E, and consistency models have redefined the capabilities and performance boundaries of image synthesis, while addressing persistent challenges in efficiency and quality. Our analysis focuses on the evolution of latent space representations, cross-attention mechanisms, and parameter-efficient training methodologies that enable accelerated inference under resource constraints. While more efficient training methods enable faster inference, advanced control mechanisms like ControlNet and regional attention systems have simultaneously improved generation precision and content customization. We investigate how enhanced multi-modal understanding and zero-shot generation capabilities are reshaping practical applications across industries. Our analysis demonstrates that despite remarkable advances in generation quality and computational efficiency, critical challenges remain in developing resource-conscious architectures and interpretable generation systems for industrial applications. The paper concludes by mapping promising research directions, including neural architecture optimization and explainable generation frameworks.
摘要：自 2021 年以来，基于深度学习的图像生成经历了范式转变，其特点是基础架构突破和计算创新。通过回顾架构创新和实证结果，本文分析了从传统生成方法到高级架构的转变，重点关注计算效率高的扩散模型和视觉变换器架构。我们研究了稳定扩散、DALL-E 和一致性模型的最新发展如何重新定义图像合成的能力和性能边界，同时解决效率和质量方面的持续挑战。我们的分析侧重于潜在空间表示、交叉注意机制和参数高效的训练方法的演变，这些方法可以在资源受限的情况下加速推理。虽然更高效的训练方法可以实现更快的推理，但 ControlNet 和区域注意系统等高级控制机制同时提高了生成精度和内容定制。我们研究了增强的多模态理解和零样本生成能力如何重塑跨行业的实际应用。我们的分析表明，尽管生成质量和计算效率取得了显着进步，但在开发资源意识架构和可解释的工业应用生成系统方面仍然存在关键挑战。论文最后概述了有前景的研究方向，包括神经结构优化和可解释的生成框架。

Title: Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals

Authors: Messi H.J. Lee, Soyeon Jeon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09668
Pdf URL: https://arxiv.org/pdf/2412.09668
Copy Paste: [[2412.09668]] Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals(https://arxiv.org/abs/2412.09668)
Keywords: generation
Abstract: Vision-Language Models (VLMs) combine Large Language Model (LLM) capabilities with image processing, enabling tasks like image captioning and text-to-image generation. Yet concerns persist about their potential to amplify human-like biases, including skin tone bias. Skin tone bias, where darker-skinned individuals face more negative stereotyping than lighter-skinned individuals, is well-documented in the social sciences but remains under-explored in Artificial Intelligence (AI), particularly in VLMs. While well-documented in the social sciences, this bias remains under-explored in AI, particularly in VLMs. Using the GAN Face Database, we sampled computer-generated images of Black American men and women, controlling for skin tone variations while keeping other features constant. We then asked VLMs to write stories about these faces and compared the homogeneity of the generated stories. Stories generated by VLMs about darker-skinned Black individuals were more homogeneous than those about lighter-skinned individuals in three of four models, and Black women were consistently represented more homogeneously than Black men across all models. Interaction effects revealed a greater impact of skin tone on women in two VLMs, while the other two showed nonsignificant results, reflecting known stereotyping patterns. These findings underscore the propagation of biases from single-modality AI systems to multimodal models and highlight the need for further research to address intersectional biases in AI.
摘要：视觉语言模型 (VLM) 将大型语言模型 (LLM) 功能与图像处理相结合，可实现图像字幕和文本到图像生成等任务。然而，人们仍然担心它们可能会放大类似人类的偏见，包括肤色偏见。肤色偏见，即肤色较深的人比肤色较浅的人面临更多的负面刻板印象，在社会科学中已有充分记录，但在人工智能 (AI) 中仍未得到充分探索，尤其是在 VLM 中。虽然在社会科学中已有充分记录，但这种偏见在人工智能中仍未得到充分探索，尤其是在 VLM 中。使用 GAN 人脸数据库，我们对计算机生成的美国黑人男性和女性图像进行了采样，控制肤色变化，同时保持其他特征不变。然后，我们让 VLM 撰写有关这些面孔的故事，并比较生成的故事的同质性。在四个模型中的三个模型中，VLM 生成的关于深色皮肤黑人的故事比关于浅色皮肤的故事更同质，而且在所有模型中，黑人女性的代表性始终比黑人男性更同质。相互作用效应表明，在两个 VLM 中，肤色对女性的影响更大，而另外两个 VLM 的结果不显著，反映了已知的刻板印象模式。这些发现强调了偏见从单模态 AI 系统传播到多模态模型，并强调需要进一步研究以解决 AI 中的交叉偏见。

Title: Omni-ID: Holistic Identity Representation Designed for Generative Tasks

Authors: Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09694
Pdf URL: https://arxiv.org/pdf/2412.09694
Copy Paste: [[2412.09694]] Omni-ID: Holistic Identity Representation Designed for Generative Tasks(https://arxiv.org/abs/2412.09694)
Keywords: generative
Abstract: We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual's appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset -- a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.
摘要：我们引入了 Omni-ID，一种专为生成任务设计的新型面部表征。Omni-ID 在一个固定大小的表征中编码了关于个人在不同表情和姿势下外貌的整体信息。它将来自不同数量的非结构化输入图像的信息整合成一个结构化表征，其中每个条目代表某些全局或局部身份特征。我们的方法使用少对多身份重建训练范式，其中使用一组有限的输入图像来重建同一个人在不同姿势和表情下的多个目标图像。进一步采用多解码器框架来在训练期间利用不同解码器的互补优势。与通常通过判别或对比目标学习的传统表征（例如 CLIP 和 ArcFace）不同，Omni-ID 通过生成目标进行了优化，从而为生成任务提供了更全面、更细致的身份捕获。 Omni-ID 在我们的 MFHQ 数据集（多视图面部图像集合）上进行训练，在各种生成任务中比传统表示表现出显着的改进。

Title: Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation

Authors: Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09706
Pdf URL: https://arxiv.org/pdf/2412.09706
Copy Paste: [[2412.09706]] Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation(https://arxiv.org/abs/2412.09706)
Keywords: generation, generative
Abstract: Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce IT3A, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, IT3A outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.
摘要：现有的测试时即时调整 (TPT) 方法侧重于单模态数据，主要增强图像并使用置信度评级来过滤不准确的图像。然而，虽然图像生成模型可以生成视觉上多样化的图像，但单模态数据增强技术仍然无法捕捉不同模态提供的全面知识。此外，我们注意到，当增强图像的数量有限时，基于 TPT 的方法的性能会显著下降，考虑到生成增强的计算成本，这并不罕见。为了解决这些问题，我们引入了一种新颖的测试时自适应方法 IT3A，该方法利用预先训练的生成模型对来自未知新领域的每个测试样本进行多模态增强。通过结合来自预先训练的视觉和语言模型的增强数据，我们增强了模型适应未知新测试数据的能力。此外，为了确保在生成各种视觉和文本增强时准确保留关键语义，我们在增强图像和文本的逻辑与原始测试数据之间采用余弦相似性过滤。此过程使我们能够过滤掉一些虚假增强和不充分的组合。为了利用生成模型在不同模态中提供的各种增强功能，我们用适配器替换了即时调整，以便更灵活地利用文本模板。我们在具有分布偏移和域间隙的测试数据集上进行的实验表明，在零样本设置中，IT3A 的表现优于最先进的测试时间即时调整方法，准确率提高了 5.50%。

Title: Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

Authors: Philipp Moeßner, Heike Adel
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09715
Pdf URL: https://arxiv.org/pdf/2412.09715
Copy Paste: [[2412.09715]] Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts(https://arxiv.org/abs/2412.09715)
Keywords: generation
Abstract: With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt's level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.
摘要：随着基于人工智能的文本转图像系统的出现，创建照片般逼真但完全合成的图像的过程已基本民主化。这可能通过简化虚假信息的传播对公众构成威胁。机器检测器和人类媒体专业知识可以帮助区分人工智能生成的（假）图像和真实图像，并抵消这种危险。尽管人工智能生成模型高度依赖提示，但提示对假检测性能的影响却很少被研究。因此，这项工作研究了提示的细节水平对假图像可检测性的影响，包括使用人工智能检测器和用户研究。为此，我们创建了一个新数据集 COCOXGEN，它包含来自 COCO 数据集的真实照片以及使用两种标准长度的提示使用 SDXL 和 Fooocus 生成的图像。我们对 200 名参与者的用户研究表明，使用更长、更详细的提示生成的图像比使用短提示生成的图像更容易被检测到。同样，基于人工智能的检测模型在用较长的提示生成的图像上取得了更好的性能。然而，正如我们在热图分析中所展示的那样，人类和人工智能模型似乎关注不同的细节。

Title: The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

Authors: Binxu Wang, John J. Vastola
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09726
Pdf URL: https://arxiv.org/pdf/2412.09726
Copy Paste: [[2412.09726]] The Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications(https://arxiv.org/abs/2412.09726)
Keywords: generation
Abstract: By learning the gradient of smoothed data distributions, diffusion models can iteratively generate samples from complex distributions. The learned score function enables their generalization capabilities, but how the learned score relates to the score of the underlying data manifold remains largely unclear. Here, we aim to elucidate this relationship by comparing learned neural scores to the scores of two kinds of analytically tractable distributions: Gaussians and Gaussian mixtures. The simplicity of the Gaussian model makes it theoretically attractive, and we show that it admits a closed-form solution and predicts many qualitative aspects of sample generation dynamics. We claim that the learned neural score is dominated by its linear (Gaussian) approximation for moderate to high noise scales, and supply both theoretical and empirical arguments to support this claim. Moreover, the Gaussian approximation empirically works for a larger range of noise scales than naive theory suggests it should, and is preferentially learned early in training. At smaller noise scales, we observe that learned scores are better described by a coarse-grained (Gaussian mixture) approximation of training data than by the score of the training distribution, a finding consistent with generalization. Our findings enable us to precisely predict the initial phase of trained models' sampling trajectories through their Gaussian approximations. We show that this allows the skipping of the first 15-30% of sampling steps while maintaining high sample quality (with a near state-of-the-art FID score of 1.93 on CIFAR-10 unconditional generation). This forms the foundation of a novel hybrid sampling method, termed analytical teleportation, which can seamlessly integrate with and accelerate existing samplers, including DPM-Solver-v3 and UniPC. Our findings suggest ways to improve the design and training of diffusion models.
摘要：通过学习平滑数据分布的梯度，扩散模型可以迭代地从复杂分布中生成样本。学习到的分数函数使它们具有泛化能力，但学习到的分数与底层数据流形的分数之间的关系在很大程度上仍不清楚。在这里，我们旨在通过将学习到的神经分数与两种可分析分布的分数进行比较来阐明这种关系：高斯和高斯混合。高斯模型的简单性使其在理论上具有吸引力，我们表明它承认闭式解并预测样本生成动态的许多定性方面。我们声称，学习到的神经分数由其在中高噪声尺度上的线性（高斯）近似主导，并提供理论和经验论据来支持这一说法。此外，高斯近似在经验上适用于比朴素理论所认为的更大范围的噪声尺度，并且在训练早期优先学习。在较小的噪声尺度下，我们观察到学习到的分数更适合用训练数据的粗粒度（高斯混合）近似来描述，而不是用训练分布的分数来描述，这一发现与泛化一致。我们的研究结果使我们能够通过高斯近似来精确预测训练模型采样轨迹的初始阶段。我们表明，这允许跳过前 15-30% 的采样步骤，同时保持较高的样本质量（在 CIFAR-10 无条件生成中，FID 分数接近最先进的 1.93）。这构成了一种新型混合采样方法的基础，称为分析传送，它可以无缝集成并加速现有的采样器，包括 DPM-Solver-v3 和 UniPC。我们的研究结果提出了改进扩散模型设计和训练的方法。

Title: CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Authors: Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.09799
Pdf URL: https://arxiv.org/pdf/2412.09799
Copy Paste: [[2412.09799]] CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection(https://arxiv.org/abs/2412.09799)
Keywords: generation
Abstract: Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.
摘要：通用物体检测的最新研究旨在将语言引入 SoTA 闭集检测器中，然后通过构建大规模（文本区域）数据集进行训练来概括开集概念。然而，这些方法面临两个主要挑战：（i）如何有效地利用提示中的先验信息来泛化物体，以及（ii）如何减少下游任务中的对齐偏差，这两者都导致在预训练之外的某些场景中性能不佳。为了应对这些挑战，我们提出了一个强大的通用检测基础模型 CP-DETR，该模型在几乎所有场景中都具有竞争力，并且只有一个预训练权重。具体来说，我们设计了一个高效的提示视觉混合编码器，通过逐尺度和多尺度融合模块增强提示和视觉之间的信息交互。然后，通过提示多标签损失和辅助检测头，使混合编码器能够充分利用提示信息。除了文本提示之外，我们还设计了两种实用的概念提示生成方法，即视觉提示和优化提示，通过具体的视觉示例提取抽象概念，并稳定地减少下游任务中的对齐偏差。凭借这些有效的设计，CP-DETR 在广泛的场景中展现出卓越的通用检测性能。例如，我们的 Swin-T 主干模型在 LVIS 上实现了 47.6 零样本 AP，而 Swin-L 主干模型在 ODinW35 上实现了 32.2 零样本 AP。此外，我们的视觉提示生成方法通过交互式检测在 COCO val 上实现了 68.4 AP，优化后的提示在 ODinW13 上实现了 73.1 全样本 AP。

Title: Infinite-dimensional next-generation reservoir computing

Authors: Lyudmila Grigoryeva, Hannah Lim Jing Ting, Juan-Pablo Ortega
Subjects: cs.LG, cs.NE, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2412.09800
Pdf URL: https://arxiv.org/pdf/2412.09800
Copy Paste: [[2412.09800]] Infinite-dimensional next-generation reservoir computing(https://arxiv.org/abs/2412.09800)
Keywords: generation
Abstract: Next-generation reservoir computing (NG-RC) has attracted much attention due to its excellent performance in spatio-temporal forecasting of complex systems and its ease of implementation. This paper shows that NG-RC can be encoded as a kernel ridge regression that makes training efficient and feasible even when the space of chosen polynomial features is very large. Additionally, an extension to an infinite number of covariates is possible, which makes the methodology agnostic with respect to the lags into the past that are considered as explanatory factors, as well as with respect to the number of polynomial covariates, an important hyperparameter in traditional NG-RC. We show that this approach has solid theoretical backing and good behavior based on kernel universality properties previously established in the literature. Various numerical illustrations show that these generalizations of NG-RC outperform the traditional approach in several forecasting applications.
摘要：下一代水库计算 (NG-RC) 因其在复杂系统时空预测中的出色表现和易于实施而备受关注。本文表明，NG-RC 可以编码为核岭回归，即使所选多项式特征的空间非常大，也能使训练变得高效可行。此外，可以扩展到无限数量的协变量，这使得该方法与被视为解释因素的过去滞后以及传统 NG-RC 中的重要超参数多项式协变量的数量无关。我们表明，这种方法具有坚实的理论支持和良好的行为，基于文献中先前建立的核通用性。各种数值示例表明，NG-RC 的这些推广在多个预测应用中优于传统方法。

Title: MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

Authors: Xunnong Xu, Mengying Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09828
Pdf URL: https://arxiv.org/pdf/2412.09828
Copy Paste: [[2412.09828]] MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion(https://arxiv.org/abs/2412.09828)
Keywords: generation, generative
Abstract: Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.
摘要：扩散变换器使视频的生成建模更加灵活。然而，生成具有丰富语义和复杂运动的高分辨率视频在技术上仍然具有挑战性，计算成本也很高。与语言类似，视频数据本质上也是自回归的，因此在模型中使用具有双向依赖性的注意机制是违反直觉的。在这里，我们提出了一个多尺度因果 (MSC) 框架来解决这些问题。具体来说，我们在空间维度上引入多个分辨率，在时间维度上引入高低频，以实现高效的注意力计算。此外，基于噪声在不同分辨率下以不同速率破坏信息的思想，以受控方式组合多个尺度上的注意力块，以允许对有噪声的图像帧进行因果条件化以进行扩散训练。我们从理论上证明了我们的方法可以大大降低计算复杂度并提高训练效率。因果注意力扩散框架也可以用于自回归长视频生成，而不会违反帧序列的自然顺序。

Title: Super-Resolution for Remote Sensing Imagery via the Coupling of a Variational Model and Deep Learning

Authors: Jing Sun, Huanfeng Shen, Qiangqiang Yuan, Liangpei Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09841
Pdf URL: https://arxiv.org/pdf/2412.09841
Copy Paste: [[2412.09841]] Super-Resolution for Remote Sensing Imagery via the Coupling of a Variational Model and Deep Learning(https://arxiv.org/abs/2412.09841)
Keywords: super-resolution
Abstract: Image super-resolution (SR) is an effective way to enhance the spatial resolution and detail information of remote sensing images, to obtain a superior visual quality. As SR is severely ill-conditioned, effective image priors are necessary to regularize the solution space and generate the corresponding high-resolution (HR) image. In this paper, we propose a novel gradient-guided multi-frame super-resolution (MFSR) framework for remote sensing imagery reconstruction. The framework integrates a learned gradient prior as the regularization term into a model-based optimization method. Specifically, the local gradient regularization (LGR) prior is derived from the deep residual attention network (DRAN) through gradient profile transformation. The non-local total variation (NLTV) prior is characterized using the spatial structure similarity of the gradient patches with the maximum a posteriori (MAP) model. The modeled prior performs well in preserving edge smoothness and suppressing visual artifacts, while the learned prior is effective in enhancing sharp edges and recovering fine structures. By incorporating the two complementary priors into an adaptive norm based reconstruction framework, the mixed L1 and L2 regularization minimization problem is optimized to achieve the required HR remote sensing image. Extensive experimental results on remote sensing data demonstrate that the proposed method can produce visually pleasant images and is superior to several of the state-of-the-art SR algorithms in terms of the quantitative evaluation.
摘要：图像超分辨率 (SR) 是增强遥感图像的空间分辨率和细节信息以获得卓越视觉质量的有效方法。由于 SR 严重病态，因此需要有效的图像先验来正则化解空间并生成相应的高分辨率 (HR) 图像。在本文中，我们提出了一种用于遥感影像重建的新型梯度引导多帧超分辨率 (MFSR) 框架。该框架将学习到的梯度先验作为正则化项集成到基于模型的优化方法中。具体而言，局部梯度正则化 (LGR) 先验是通过梯度轮廓变换从深度残差注意网络 (DRAN) 得出的。非局部总变分 (NLTV) 先验使用具有最大后验 (MAP) 模型的梯度斑块的空间结构相似性来表征。建模先验在保持边缘平滑度和抑制视觉伪影方面表现良好，而学习到的先验在增强尖锐边缘和恢复精细结构方面有效。通过将两个互补先验纳入基于自适应范数的重建框架，可以优化混合 L1 和 L2 正则化最小化问题，以实现所需的 HR 遥感图像。在遥感数据上进行的大量实验结果表明，所提出的方法可以生成视觉上令人愉悦的图像，并且在定量评估方面优于几种最先进的 SR 算法。

Title: Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

Authors: Yujin Choi, Jinseong Park, Junyoung Byun, Jaewook Lee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.09842
Pdf URL: https://arxiv.org/pdf/2412.09842
Copy Paste: [[2412.09842]] Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training(https://arxiv.org/abs/2412.09842)
Keywords: generation, generative
Abstract: Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.
摘要：以编程方式生成的合成数据已用于差分隐私训练以进行分类，以提高性能而不会泄露隐私。然而，由于合成数据是由随机过程生成的，因此真实数据和合成数据的分布是可区分的且难以传输。因此，使用合成数据训练的模型会生成不切实际的随机图像，这给将合成数据应用于生成模型带来了挑战。在这项工作中，我们提出了 DP-SynGen，它利用扩散模型中以编程方式生成的合成数据来应对这一挑战。通过利用扩散模型的三个阶段（粗略、上下文和清理），我们确定了可以有效利用合成数据的阶段。我们从理论和经验上验证了清理和粗略阶段可以在没有隐私数据的情况下进行训练，用合成数据代替它们以减少隐私预算。实验结果表明，DP-SynGen 通过减轻隐私引起的噪声对生成过程的负面影响来提高生成数据的质量。

Title: LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Authors: Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09856
Pdf URL: https://arxiv.org/pdf/2412.09856
Copy Paste: [[2412.09856]] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity(https://arxiv.org/abs/2412.09856)
Keywords: generation
Abstract: Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: this https URL.
摘要：文本转视频生成增强了内容创作，但计算量非常大：扩散变换器 (DiT) 的计算成本与像素数量成二次方关系。这使得生成一分钟长度的视频非常昂贵，限制了大多数现有模型只能生成 10-20 秒长度的视频。我们提出了一个线性复杂度的文本转视频生成 (LinGen) 框架，其成本与像素数量成线性关系。LinGen 首次在单个 GPU 上实现了高分辨率一分钟长度的视频生成，而不会影响质量。它用线性复杂度块 MATE 取代了计算占主导地位且复杂度为二次的自注意力块，MATE 由 MA 分支和 TE 分支组成。MA 分支以短程到长程相关性为目标，将双向 Mamba2 块与我们的令牌重排方法 Rotary Major Scan 以及我们为长视频生成开发的审核令牌相结合。 TE 分支是一种新颖的 TEmporal Swin Attention 块，专注于相邻标记和中距离标记之间的时间相关性。MATE 块解决了 Mamba 的邻接保留问题，并显著提高了生成视频的一致性。实验结果表明，LinGen 在视频质量方面优于 DiT（胜率为 75.6%），FLOP（延迟）减少了 15$\times$（11.5$\times$）。此外，自动指标和人工评估都表明我们的 LinGen-4B 产生的视频质量与最先进的模型相当（相对于 Gen-3、LumaLabs 和 Kling 的胜率分别为 50.5%、52.1%、49.1%）。这为一小时长度的电影生成和实时交互式视频生成铺平了道路。我们在项目网站中提供了 68 秒的视频生成结果和更多示例：此 https URL。

Title: Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning

Authors: Abraham Atsiwo
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.09859
Pdf URL: https://arxiv.org/pdf/2412.09859
Copy Paste: [[2412.09859]] Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning(https://arxiv.org/abs/2412.09859)
Keywords: generation
Abstract: The Efficient Market Hypothesis (EMH) highlights the essence of financial news in stock price movement. Financial news comes in the form of corporate announcements, news titles, and other forms of digital text. The generation of insights from financial news can be done with sentiment analysis. General-purpose language models are too general for sentiment analysis in finance. Curated labeled data for fine-tuning general-purpose language models are scare, and existing fine-tuned models for sentiment analysis in finance do not capture the maximum context width. We hypothesize that using actual and synthetic data can improve performance. We introduce BertNSP-finance to concatenate shorter financial sentences into longer financial sentences, and finbert-lc to determine sentiment from digital text. The results show improved performance on the accuracy and the f1 score for the financial phrasebank data with $50\%$ and $100\%$ agreement levels.
摘要：有效市场假说 (EMH) 强调了金融新闻在股价变动中的本质。金融新闻以公司公告、新闻标题和其他形式的数字文本的形式出现。可以通过情绪分析从金融新闻中生成见解。通用语言模型对于金融情绪分析来说太过笼统。用于微调通用语言模型的精选标记数据很少，而现有的用于金融情绪分析的微调模型无法捕捉到最大的上下文宽度。我们假设使用实际数据和合成数据可以提高性能。我们引入了 BertNSP-finance 将较短的金融句子连接成较长的金融句子，并引入了 finbert-lc 来确定数字文本中的情绪。结果显示，在 $50\%$ 和 $100\%$ 一致水平下，金融短语库数据的准确率和 f1 分数的性能有所提高。

Title: T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements

Authors: Xiangxi Tian, Jie Shan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.09886
Pdf URL: https://arxiv.org/pdf/2412.09886
Copy Paste: [[2412.09886]] T-GMSI: A transformer-based generative model for spatial interpolation under sparse measurements(https://arxiv.org/abs/2412.09886)
Keywords: generation, generative
Abstract: Generating continuous environmental models from sparsely sampled data is a critical challenge in spatial modeling, particularly for topography. Traditional spatial interpolation methods often struggle with handling sparse measurements. To address this, we propose a Transformer-based Generative Model for Spatial Interpolation (T-GMSI) using a vision transformer (ViT) architecture for digital elevation model (DEM) generation under sparse conditions. T-GMSI replaces traditional convolution-based methods with ViT for feature extraction and DEM interpolation while incorporating a terrain feature-aware loss function for enhanced accuracy. T-GMSI excels in producing high-quality elevation surfaces from datasets with over 70% sparsity and demonstrates strong transferability across diverse landscapes without fine-tuning. Its performance is validated through extensive experiments, outperforming traditional methods such as ordinary Kriging (OK) and natural neighbor (NN) and a conditional generative adversarial network (CGAN)-based model (CEDGAN). Compared to OK and NN, T-GMSI reduces root mean square error (RMSE) by 40% and 25% on airborne lidar data and by 23% and 10% on spaceborne lidar data. Against CEDGAN, T-GMSI achieves a 20% RMSE improvement on provided DEM data, requiring no fine-tuning. The ability of model on generalizing to large, unseen terrains underscores its transferability and potential applicability beyond topographic modeling. This research establishes T-GMSI as a state-of-the-art solution for spatial interpolation on sparse datasets and highlights its broader utility for other sparse data interpolation challenges.
摘要：从稀疏采样数据生成连续环境模型是空间建模（尤其是地形建模）中的一项关键挑战。传统的空间插值方法通常难以处理稀疏测量值。为了解决这个问题，我们提出了一种基于 Transformer 的空间插值生成模型 (T-GMSI)，使用视觉变换器 (ViT) 架构在稀疏条件下生成数字高程模型 (DEM)。T-GMSI 用 ViT 取代传统的基于卷积的方法进行特征提取和 DEM 插值，同时结合地形特征感知损失函数以提高准确性。T-GMSI 擅长从稀疏度超过 70% 的数据集生成高质量的高程表面，并且在无需微调的情况下在不同景观中表现出强大的可移植性。它的性能通过大量实验得到验证，优于传统方法，例如普通克里金法 (OK) 和自然邻域 (NN) 以及基于条件生成对抗网络 (CGAN) 的模型 (CEDGAN)。与 OK 和 NN 相比，T-GMSI 在机载激光雷达数据上将均方根误差 (RMSE) 降低了 40% 和 25%，在星载激光雷达数据上将均方根误差 (RMSE) 降低了 23% 和 10%。与 CEDGAN 相比，T-GMSI 在提供的 DEM 数据上实现了 20% 的 RMSE 改进，无需微调。该模型在大型、未见过的地形上的推广能力凸显了其可转移性和超越地形建模的潜在适用性。这项研究将 T-GMSI 确立为稀疏数据集空间插值的最先进的解决方案，并强调了其在其他稀疏数据插值挑战中的更广泛用途。

Title: VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09892
Pdf URL: https://arxiv.org/pdf/2412.09892
Copy Paste: [[2412.09892]] VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization(https://arxiv.org/abs/2412.09892)
Keywords: generation
Abstract: We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at this https URL.
摘要：我们提出了 VQTalker，这是一个基于矢量量化的多语言说话头像生成框架，可解决跨不同语言的唇形同步和自然运动难题。我们的方法基于语音原理，即人类语音由一组有限的不同声音单元（音素）和相应的视觉表达（视素）组成，这些声音单元和视觉表达通常在不同语言之间具有共同点。我们引入了一种基于组残差有限标量量化 (GRFSQ) 的面部运动标记器，可创建面部特征的离散表示。这种方法能够全面捕捉面部运动，同时提高对多种语言的泛化能力，即使在训练数据有限的情况下也是如此。基于这种量化表示，我们实现了从粗到细的运动生成过程，可逐步完善面部动画。大量实验表明，VQTalker 在视频驱动和语音驱动场景中均实现了最佳性能，尤其是在多语言环境中。值得注意的是，我们的方法在保持约 11 kbps 的较低比特率的同时，以 512*512 像素的分辨率实现了高质量的结果。我们的工作为跨语言对话人脸生成开辟了新的可能性。合成结果可在此 https URL 中查看。

Title: MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow

Authors: Zhe Li, Yisheng He, Lei Zhong, Weichao Shen, Qi Zuo, Lingteng Qiu, Zilong Dong, Laurence Tianruo Yang, Weihao Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09901
Pdf URL: https://arxiv.org/pdf/2412.09901
Copy Paste: [[2412.09901]] MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow(https://arxiv.org/abs/2412.09901)
Keywords: generation
Abstract: Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.
摘要：生成符合目标风格的运动序列并遵循给定的内容提示需要同时适应内容和风格。在现有方法中，信息通常仅从风格流向内容，这可能会导致风格和内容之间发生冲突，从而损害整合。不同的是，在这项工作中，我们在风格和内容之间建立了双向控制流，并将风格调整为内容，在这种情况下，风格-内容冲突得到缓解，风格的动态在整合中得到更好的保留。此外，我们通过对比学习将风格化运动生成从一种模态（即风格运动）扩展到多种模态（包括文本和图像），从而可以灵活地控制运动生成的风格。大量实验表明，我们的方法在不同数据集上的表现明显优于以前的方法，同时还能够控制多模态信号。我们方法的代码将公开。

Title: Precision-Enhanced Human-Object Contact Detection via Depth-Aware Perspective Interaction and Object Texture Restoration

Authors: Yuxiao Wang, Wenpeng Neng, Zhenao Wei, Yu Lei, Weiying Xue, Nan Zhuang, Yanwu Xu, Xinyu Jiang, Qi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09920
Pdf URL: https://arxiv.org/pdf/2412.09920
Copy Paste: [[2412.09920]] Precision-Enhanced Human-Object Contact Detection via Depth-Aware Perspective Interaction and Object Texture Restoration(https://arxiv.org/abs/2412.09920)
Keywords: restoration, generation
Abstract: Human-object contact (HOT) is designed to accurately identify the areas where humans and objects come into contact. Current methods frequently fail to account for scenarios where objects are frequently blocking the view, resulting in inaccurate identification of contact areas. To tackle this problem, we suggest using a perspective interaction HOT detector called PIHOT, which utilizes a depth map generation model to offer depth information of humans and objects related to the camera, thereby preventing false interaction detection. Furthermore, we use mask dilatation and object restoration techniques to restore the texture details in covered areas, improve the boundaries between objects, and enhance the perception of humans interacting with objects. Moreover, a spatial awareness perception is intended to concentrate on the characteristic features close to the points of contact. The experimental results show that the PIHOT algorithm achieves state-of-the-art performance on three benchmark datasets for HOT detection tasks. Compared to the most recent DHOT, our method enjoys an average improvement of 13%, 27.5%, 16%, and 18.5% on SC-Acc., C-Acc., mIoU, and wIoU metrics, respectively.
摘要：人与物体接触 (HOT) 旨在准确识别人与物体接触的区域。当前的方法经常无法考虑物体经常阻挡视线的情况，从而导致无法准确识别接触区域。为了解决这个问题，我们建议使用一种透视交互 HOT 检测器，称为 PIHOT，它利用深度图生成模型来提供与相机相关的人和物体的深度信息，从而防止错误的交互检测。此外，我们使用掩模扩张和物体恢复技术来恢复覆盖区域中的纹理细节，改善物体之间的边界，并增强人与物体交互的感知。此外，空间意识感知旨在集中在靠近接触点的特征上。实验结果表明，PIHOT 算法在三个基准数据集上实现了 HOT 检测任务的当前最佳性能。与最新的 DHOT 相比，我们的方法在 SC-Acc.、C-Acc.、mIoU 和 wIoU 指标上分别平均提高了 13%、27.5%、16% 和 18.5%。

Title: FaceShield: Defending Facial Image against Deepfake Threats

Authors: Jaehwan Jeong, Sumin In, Sieun Kim, Hannie Shin, Jongheon Jeong, Sang Ho Yoon, Jaewook Chung, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09921
Pdf URL: https://arxiv.org/pdf/2412.09921
Copy Paste: [[2412.09921]] FaceShield: Defending Facial Image against Deepfake Threats(https://arxiv.org/abs/2412.09921)
Keywords: generative
Abstract: The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity verification is not critical. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel attack strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates attacks on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting applicability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.
摘要：犯罪活动中深度伪造的使用日益增多，这是一个重大问题，引发了广泛的争议。虽然许多研究已经解决了这个问题，但大多数主要集中在深度伪造检测上。这些被动解决方案不足以作为基本方法，因为对于真实性验证并不重要的犯罪来说。现有的主动防御也有局限性，因为它们只对基于特定生成对抗网络 (GAN) 的深度伪造模型有效，鉴于基于扩散的模型的最新进展，它们的适用性较差。在本文中，我们提出了一种名为 FaceShield 的主动防御方法，它引入了针对由扩散模型 (DM) 生成的深度伪造的新型攻击策略，并通过面部特征提取器操作促进对各种现有的基于 GAN 的深度伪造模型的攻击。我们的方法由三个主要部分组成：(i) 操纵 DM 的注意力机制以在去噪过程中排除受保护的面部特征，(ii) 针对突出的面部特征提取模型以增强对抗性扰动的鲁棒性，以及 (iii) 采用高斯模糊和低通滤波技术来提高不可感知性，同时增强对 JPEG 失真的鲁棒性。在 CelebA-HQ 和 VGGFace2-HQ 数据集上的实验结果表明，我们的方法在基于 DM 的最新深度伪造模型中实现了最佳性能，同时也表现出对 GAN 的适用性，并展示了更高的噪声不可感知性以及增强的鲁棒性。

Title: CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models

Authors: Dongyu Yao, Keling Yao, Junhong Zhou, Yinghao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09936
Pdf URL: https://arxiv.org/pdf/2412.09936
Copy Paste: [[2412.09936]] CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models(https://arxiv.org/abs/2412.09936)
Keywords: generation
Abstract: The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at this https URL.
摘要：肥胖现象被称为“重度问题”，是全球可预防慢性疾病的主要原因。传统的卡路里估算工具通常依赖于特定的数据格式或复杂的管道，限制了它们在现实场景中的实用性。最近，视觉语言模型 (VLM) 在理解现实世界背景和实现对话交互方面表现出色，使其成为成分分析等下游任务的理想选择。然而，将 VLM 应用于卡路里估算需要特定领域的数据和对齐策略。为此，我们整理了 CalData，这是一个 330K 的图像文本对数据集，专门用于成分识别和卡路里估算，将大规模食谱数据集与详细的营养说明相结合，以实现强大的视觉语言训练。基于此数据集，我们提出了 CaLoRAify，这是一种新颖的 VLM 框架，通过视觉文本对训练将成分识别和卡路里估算对齐。在推理过程中，用户只需要一张单眼食物图像即可估算卡路里，同时保留了基于代理的对话交互的灵活性。借助低秩自适应 (LoRA) 和检索增强生成 (RAG) 技术，我们的系统增强了卡路里估算垂直领域中基础 VLM 的性能。我们的代码和数据在此 https URL 上完全开源。

Title: Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization

Authors: Xinhao Zhong, Shuoyang Sun, Xulin Gu, Zhaoyang Xu, Yaowei Wang, Jianlong Wu, Bin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09959
Pdf URL: https://arxiv.org/pdf/2412.09959
Copy Paste: [[2412.09959]] Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization(https://arxiv.org/abs/2412.09959)
Keywords: generation
Abstract: Dataset distillation offers an efficient way to reduce memory and computational costs by optimizing a smaller dataset with performance comparable to the full-scale original. However, for large datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the extensive optimization space limits performance, reducing its practicality. Recent approaches employ pre-trained diffusion models to generate informative images directly, avoiding pixel-level optimization and achieving notable results. However, these methods often face challenges due to distribution shifts between pre-trained models and target datasets, along with the need for multiple distillation steps across varying settings. To address these issues, we propose a novel framework orthogonal to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation. Our method starts by predicting noise generated by the diffusion model based on input images and text prompts (with or without label text), then calculates the corresponding loss for each pair. With the loss differences, we identify distinctive regions of the original images. Additionally, we perform intra-class clustering and ranking on selected patches to maintain diversity constraints. This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.
摘要：数据集蒸馏是一种有效的方法，它通过优化较小的数据集来减少内存和计算成本，同时性能与全尺寸原始数据集相当。然而，对于大型数据集和复杂的深度网络（例如，带有 ResNet-101 的 ImageNet-1K），广泛的优化空间限制了性能，降低了实用性。最近的方法采用预训练的扩散模型直接生成信息丰富的图像，避免了像素级优化并取得了显著的效果。然而，这些方法往往面临挑战，因为预训练模型和目标数据集之间的分布存在差异，并且需要在不同设置中进行多个蒸馏步骤。为了解决这些问题，我们提出了一个与现有的基于扩散的蒸馏方法正交的新框架，利用扩散模型进行选择而不是生成。我们的方法首先根据输入图像和文本提示（带或不带标签文本）预测扩散模型产生的噪声，然后计算每对的相应损失。利用损失差异，我们可以识别原始图像的不同区域。此外，我们对选定的补丁执行类内聚类和排名，以保持多样性约束。这个简化的框架实现了单步蒸馏过程，大量实验表明，我们的方法在各个指标上都优于最先进的方法。

Title: What constitutes a Deep Fake? The blurry line between legitimate processing and manipulation under the EU AI Act

Authors: Kristof Meding, Christoph Sorge
Subjects: cs.LG, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2412.09961
Pdf URL: https://arxiv.org/pdf/2412.09961
Copy Paste: [[2412.09961]] What constitutes a Deep Fake? The blurry line between legitimate processing and manipulation under the EU AI Act(https://arxiv.org/abs/2412.09961)
Keywords: generation
Abstract: When does a digital image resemble reality? The relevance of this question increases as the generation of synthetic images -- so called deep fakes -- becomes increasingly popular. Deep fakes have gained much attention for a number of reasons -- among others, due to their potential to disrupt the political climate. In order to mitigate these threats, the EU AI Act implements specific transparency regulations for generating synthetic content or manipulating existing content. However, the distinction between real and synthetic images is -- even from a computer vision perspective -- far from trivial. We argue that the current definition of deep fakes in the AI act and the corresponding obligations are not sufficiently specified to tackle the challenges posed by deep fakes. By analyzing the life cycle of a digital photo from the camera sensor to the digital editing features, we find that: (1.) Deep fakes are ill-defined in the EU AI Act. The definition leaves too much scope for what a deep fake is. (2.) It is unclear how editing functions like Google's ``best take'' feature can be considered as an exception to transparency obligations. (3.) The exception for substantially edited images raises questions about what constitutes substantial editing of content and whether or not this editing must be perceptible by a natural person. Our results demonstrate that complying with the current AI Act transparency obligations is difficult for providers and deployers. As a consequence of the unclear provisions, there is a risk that exceptions may be either too broad or too limited. We intend our analysis to foster the discussion on what constitutes a deep fake and to raise awareness about the pitfalls in the current AI Act transparency obligations.
摘要：数字图像何时才能与现实相似？随着合成图像（即所谓的深度伪造）的生成越来越流行，这个问题的重要性也随之增加。深度伪造之所以受到广泛关注，原因有很多，其中包括它们有可能扰乱政治气氛。为了减轻这些威胁，《欧盟人工智能法案》对生成合成内容或操纵现有内容实施了具体的透明度法规。然而，即使从计算机视觉的角度来看，真实图像和合成图像之间的区别也并非微不足道。我们认为，人工智能法案中目前对深度伪造的定义和相应的义务还不够明确，无法应对深度伪造带来的挑战。通过分析数码照片从相机传感器到数字编辑功能的生命周期，我们发现：（1）《欧盟人工智能法案》对深度伪造的定义不明确。该定义对深度伪造留下了太多空间。（2）目前尚不清楚如何将谷歌的“最佳拍摄”功能等编辑功能视为透明度义务的例外。（3）对经过大量编辑的图像的例外情况提出了一些问题，即什么构成对内容的大量编辑，以及这种编辑是否必须被自然人感知。我们的结果表明，对于提供商和部署者来说，遵守当前的《人工智能法案》透明度义务是困难的。由于规定不明确，存在例外情况可能过于宽泛或过于有限的风险。我们希望通过分析来促进对什么是深度伪造的讨论，并提高人们对当前《人工智能法案》透明度义务中存在的缺陷的认识。

Title: GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

Authors: Sitong Su, Xiao Cai, Lianli Gao, Pengpeng Zeng, Qinhong Du, Mengqi Li, Heng Tao Shen, Jingkuan Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.09997
Pdf URL: https://arxiv.org/pdf/2412.09997
Copy Paste: [[2412.09997]] GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark(https://arxiv.org/abs/2412.09997)
Keywords: generation
Abstract: Recent advances in General Text-to-3D (GT23D) have been significant. However, the lack of a benchmark has hindered systematic evaluation and progress due to issues in datasets and metrics: 1) The largest 3D dataset Objaverse suffers from omitted annotations, disorganization, and low-quality. 2) Existing metrics only evaluate textual-image alignment without considering the 3D-level quality. To this end, we are the first to present a comprehensive benchmark for GT23D called GT23D-Bench consisting of: 1) a 400k high-fidelity and well-organized 3D dataset that curated issues in Objaverse through a systematical annotation-organize-filter pipeline; and 2) comprehensive 3D-aware evaluation metrics which encompass 10 clearly defined metrics thoroughly accounting for multi-dimension of GT23D. Notably, GT23D-Bench features three properties: 1) Multimodal Annotations. Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. 2) Holistic Evaluation Dimensions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness. 3) Valuable Insights. We delve into the performance of current GT23D baselines across different evaluation dimensions and provide insightful analysis. Extensive experiments demonstrate that our annotations and metrics are aligned with human preferences.
摘要：通用文本转 3D (GT23D) 的最新进展十分显著。然而，由于数据集和指标方面的问题，缺乏基准阻碍了系统评估和进展：1) 最大的 3D 数据集 Objaverse 存在注释遗漏、混乱和质量低下的问题。2) 现有指标仅评估文本-图像对齐，而不考虑 3D 级质量。为此，我们首次提出了 GT23D 的综合基准，称为 GT23D-Bench，包括：1) 一个 400k 高保真且组织良好的 3D 数据集，通过系统的注释-组织-过滤管道整理 Objaverse 中的问题；2) 全面的 3D 感知评估指标，包括 10 个明确定义的指标，全面考虑了 GT23D 的多维度。值得注意的是，GT23D-Bench 具有三个属性：1) 多模态注释。我们的数据集使用 64 视图深度图、法线图、渲染图像和由粗到细的字幕注释每个 3D 对象。2) 整体评估维度。我们的指标分为 a) 文本 3D 对齐，使用多粒度视觉 3D 表示测量文本对齐；b) 3D 视觉质量，考虑纹理保真度、多视图一致性和几何正确性。3) 有价值的见解。我们深入研究了当前 GT23D 基线在不同评估维度上的表现，并提供了深刻的分析。大量实验表明，我们的注释和指标符合人类偏好。

Title: NowYouSee Me: Context-Aware Automatic Audio Description

Authors: Seon-Ho Lee, Jue Wang, David Fan, Zhikang Zhang, Linda Liu, Xiang Hao, Vimal Bhat, Xinyu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10002
Pdf URL: https://arxiv.org/pdf/2412.10002
Copy Paste: [[2412.10002]] NowYouSee Me: Context-Aware Automatic Audio Description(https://arxiv.org/abs/2412.10002)
Keywords: generation
Abstract: Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce $\mathrm{CA^3D}$, the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, $\mathrm{CA^3D}$ system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine. Unlike conventional methods which rely on metadata and ground truth AD timestamp for AD detection and generation tasks, the proposed $\mathrm{CA^3D}$ is the first end-to-end trainable system that only uses visual cue. Extensive experiments demonstrate that the proposed $\mathrm{CA^3D}$ improves existing architectures for both AD event detection and script generation metrics, establishing the new state-of-the-art performances in the AD automation.
摘要：音频描述 (AD) 作为一种旨在保证多媒体内容可访问性的应用系统，起着关键作用，它以适当的间隔提供额外的旁白来描述视觉元素，专门满足视障观众的需求。在本文中，我们介绍了 $\mathrm{CA^3D}$，这是一种开创性的统一情境感知自动音频描述系统，可为 AD 事件脚本提供在长电影内容中的精确位置。具体来说，$\mathrm{CA^3D}$ 系统包括：1）时间特征增强模块，用于有效捕获长期依赖关系；2）基于锚点的 AD 事件检测器，带有特征抑制模块，用于定位 AD 事件并提取用于 AD 生成的判别特征；3）自我细化模块，利用生成的输出将 AD 事件边界从粗略调整到精细。与依赖元数据和基本事实 AD 时间戳进行 AD 检测和生成任务的传统方法不同，所提出的 $\mathrm{CA^3D}$ 是第一个仅使用视觉提示的端到端可训练系统。大量实验表明，所提出的 $\mathrm{CA^3D}$ 改进了现有架构的 AD 事件检测和脚本生成指标，在 AD 自动化中建立了新的最先进性能。

Title: Object-Focused Data Selection for Dense Prediction Tasks

Authors: Niclas Popp, Dan Zhang, Jan Hendrik Metzen, Matthias Hein, Lukas Schott
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10032
Pdf URL: https://arxiv.org/pdf/2412.10032
Copy Paste: [[2412.10032]] Object-Focused Data Selection for Dense Prediction Tasks(https://arxiv.org/abs/2412.10032)
Keywords: generation
Abstract: Dense prediction tasks such as object detection and segmentation require high-quality labels at pixel level, which are costly to obtain. Recent advances in foundation models have enabled the generation of autolabels, which we find to be competitive but not yet sufficient to fully replace human annotations, especially for more complex datasets. Thus, we consider the challenge of selecting a representative subset of images for labeling from a large pool of unlabeled images under a constrained annotation budget. This task is further complicated by imbalanced class distributions, as rare classes are often underrepresented in selected subsets. We propose object-focused data selection (OFDS) which leverages object-level representations to ensure that the selected image subsets semantically cover the target classes, including rare ones. We validate OFDS on PASCAL VOC and Cityscapes for object detection and semantic segmentation tasks. Our experiments demonstrate that prior methods which employ image-level representations fail to consistently outperform random selection. In contrast, OFDS consistently achieves state-of-the-art performance with substantial improvements over all baselines in scenarios with imbalanced class distributions. Moreover, we demonstrate that pre-training with autolabels on the full datasets before fine-tuning on human-labeled subsets selected by OFDS further enhances the final performance.
摘要：密集预测任务（例如对象检测和分割）需要像素级的高质量标签，而获取这些标签的成本很高。基础模型的最新进展使得自动标记的生成成为可能，我们发现自动标记具有竞争力，但还不足以完全取代人工注释，尤其是对于更复杂的数据集。因此，我们考虑在受限的注释预算下从大量未标记图像中选择代表性图像子集进行标记的挑战。不平衡的类别分布使这项任务变得更加复杂，因为稀有类别在选定的子集中通常代表性不足。我们提出了以对象为中心的数据选择 (OFDS)，它利用对象级表示来确保所选图像子集在语义上覆盖目标类别，包括稀有类别。我们在 PASCAL VOC 和 Cityscapes 上验证了 OFDS 的对象检测和语义分割任务。我们的实验表明，采用图像级表示的先前方法无法始终优于随机选择。相比之下，在类别分布不平衡的场景中，OFDS 始终实现最先进的性能，并且比所有基线都有显着改进。此外，我们证明，在对 OFDS 选择的人工标记子集进行微调之前，对完整数据集进行自动标记预训练可以进一步提高最终性能。

Title: SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution

Authors: Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10049
Pdf URL: https://arxiv.org/pdf/2412.10049
Copy Paste: [[2412.10049]] SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution(https://arxiv.org/abs/2412.10049)
Keywords: super-resolution
Abstract: In today's digital landscape, the blending of AI-generated and authentic content has underscored the need for copyright protection and content authentication. Watermarking has become a vital tool to address these challenges, safeguarding both generated and real content. Effective watermarking methods must withstand various distortions and attacks. Current deep watermarking techniques often use an encoder-noise layer-decoder architecture and include distortions to enhance robustness. However, they struggle to balance robustness and fidelity and remain vulnerable to adaptive attacks, despite extensive training. To overcome these limitations, we propose SuperMark, a robust, training-free watermarking framework. Inspired by the parallels between watermark embedding/extraction in watermarking and the denoising/noising processes in diffusion models, SuperMark embeds the watermark into initial Gaussian noise using existing techniques. It then applies pre-trained Super-Resolution (SR) models to denoise the watermarked noise, producing the final watermarked image. For extraction, the process is reversed: the watermarked image is inverted back to the initial watermarked noise via DDIM Inversion, from which the embedded watermark is extracted. This flexible framework supports various noise injection methods and diffusion-based SR models, enabling enhanced customization. The robustness of the DDIM Inversion process against perturbations allows SuperMark to achieve strong resilience to distortions while maintaining high fidelity. Experiments demonstrate that SuperMark achieves fidelity comparable to existing methods while significantly improving robustness. Under standard distortions, it achieves an average watermark extraction accuracy of 99.46%, and 89.29% under adaptive attacks. Moreover, SuperMark shows strong transferability across datasets, SR models, embedding methods, and resolutions.
摘要：在当今的数字环境中，人工智能生成内容与真实内容的融合凸显了版权保护和内容认证的必要性。水印已成为应对这些挑战的重要工具，可以保护生成内容和真实内容。有效的水印方法必须能够承受各种扭曲和攻击。当前的深度水印技术通常使用编码器-噪声层解码器架构，并包含扭曲以增强鲁棒性。然而，尽管经过了大量的训练，它们仍然难以平衡鲁棒性和保真度，并且仍然容易受到自适应攻击。为了克服这些限制，我们提出了 SuperMark，一个强大的、无需训练的水印框架。受水印中水印嵌入/提取与扩散模型中的去噪/去噪过程之间的相似之处的启发，SuperMark 使用现有技术将水印嵌入初始高斯噪声中。然后，它应用预先训练的超分辨率 (SR) 模型对水印噪声进行去噪，从而生成最终的水印图像。对于提取，该过程是相反的：通过 DDIM 反转将水印图像反转回初始水印噪声，从中提取嵌入的水印。这个灵活的框架支持各种噪声注入方法和基于扩散的 SR 模型，从而实现增强的定制化。DDIM 反转过程对干扰的鲁棒性使 SuperMark 能够在保持高保真度的同时实现对失真的强大弹性。实验表明，SuperMark 实现了与现有方法相当的保真度，同时显著提高了鲁棒性。在标准失真下，它实现了 99.46% 的平均水印提取准确率，在自适应攻击下实现了 89.29% 的平均水印提取准确率。此外，SuperMark 在数据集、SR 模型、嵌入方法和分辨率之间表现出很强的可移植性。

Title: Quaffure: Real-Time Quasi-Static Neural Hair Simulation

Authors: Tuur Stuyck, Gene Wei-Chin Lin, Egor Larionov, Hsiao-yu Chen, Aljaz Bozic, Nikolaos Sarafianos, Doug Roble
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.10061
Pdf URL: https://arxiv.org/pdf/2412.10061
Copy Paste: [[2412.10061]] Quaffure: Real-Time Quasi-Static Neural Hair Simulation(https://arxiv.org/abs/2412.10061)
Keywords: generation
Abstract: Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hairstyles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method's effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting the drape of 1000 grooms in 0.3 seconds.
摘要：逼真的头发运动对于高质量的虚拟形象至关重要，但它通常受到实时应用可用的计算资源的限制。为了应对这一挑战，我们提出了一种新颖的神经方法来预测物理上合理的头发变形，这种方法可以推广到各种身体姿势、形状和发型。我们的模型使用自监督损失进行训练，无需昂贵的数据生成和存储。我们通过大量结果证明了我们方法的有效性，这些结果涵盖了广泛的姿势和形状变化，展示了其强大的泛化能力和时间平滑的结果。我们的方法非常适合实时应用，在消费级硬件上的推理时间仅为几毫秒，并且能够扩展到在 0.3 秒内预测 1000 次新郎的垂坠。

Title: Guidance Not Obstruction: A Conjugate Consistent Enhanced Strategy for Domain Generalization

Authors: Meng Cao, Songcan Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10089
Pdf URL: https://arxiv.org/pdf/2412.10089
Copy Paste: [[2412.10089]] Guidance Not Obstruction: A Conjugate Consistent Enhanced Strategy for Domain Generalization(https://arxiv.org/abs/2412.10089)
Keywords: generation
Abstract: Domain generalization addresses domain shift in real-world applications. Most approaches adopt a domain angle, seeking invariant representation across domains by aligning their marginal distributions, irrespective of individual classes, naturally leading to insufficient exploration of discriminative information. Switching to a class angle, we find that multiple domain-related peaks or clusters within the same individual classes must emerge due to distribution shift. In other words, marginal alignment does not guarantee conditional alignment, leading to suboptimal generalization. Therefore, we argue that acquiring discriminative generalization between classes within domains is crucial. In contrast to seeking distribution alignment, we endeavor to safeguard domain-related between-class discrimination. To this end, we devise a novel Conjugate Consistent Enhanced Module, namely Con2EM, based on a distribution over domains, i.e., a meta-distribution. Specifically, we employ a novel distribution-level Universum strategy to generate supplementary diverse domain-related class-conditional distributions, thereby enhancing generalization. This allows us to resample from these generated distributions to provide feedback to the primordial instance-level classifier, further improving its adaptability to the target-agnostic. To ensure generation accuracy, we establish an additional distribution-level classifier to regularize these conditional distributions. Extensive experiments have been conducted to demonstrate its effectiveness and low computational cost compared to SOTAs.
摘要：领域泛化解决了实际应用中的领域转移问题。大多数方法采用领域角度，通过对齐其边缘分布来寻求跨领域的不变表示，而不考虑单个类别，这自然会导致对判别信息的探索不足。切换到类角度，我们发现由于分布转移，同一个单个类别中必须出现多个与领域相关的峰值或聚类。换句话说，边缘对齐不能保证条件对齐，导致泛化不理想。因此，我们认为在领域内获得类别之间的判别泛化至关重要。与寻求分布对齐相反，我们努力保障与领域相关的类间判别。为此，我们设计了一个基于领域分布（即元分布）的新型共轭一致性增强模块，即 Con2EM。具体而言，我们采用一种新型分布级 Universum 策略来生成补充的多样化领域相关类条件分布，从而增强泛化。这样我们就可以从这些生成的分布中重新采样，以向原始实例级分类器提供反馈，从而进一步提高其对目标无关的适应性。为了确保生成准确性，我们建立了一个额外的分布级分类器来规范这些条件分布。已经进行了大量实验来证明其有效性和与 SOTA 相比较低的计算成本。

Title: Feature Selection for Latent Factor Models

Authors: Rittwika Kansabanik, Adrian Barbu
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2412.10128
Pdf URL: https://arxiv.org/pdf/2412.10128
Copy Paste: [[2412.10128]] Feature Selection for Latent Factor Models(https://arxiv.org/abs/2412.10128)
Keywords: generative
Abstract: Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the 'curse of dimensionality,' and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.
摘要：特征选择对于在高维数据集中精确定位相关特征、缓解“维数灾难”以及提高机器学习性能至关重要。传统的分类特征选择方法使用来自所有类别的数据来为每个类别选择特征。本文探讨了特征选择方法，该方法使用基于低秩生成方法的类模型并引入信噪比 (SNR) 特征选择标准，为每个类别分别选择特征。这种新方法在某些假设下具有理论上真正的特征恢复保证，并且已被证明在标准分类数据集上优于一些现有的特征选择方法。

Title: VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation

Authors: Hyeonseok Lim, Dongjae Shin, Seohyun Song, Inho Won, Minjun Kim, Junghun Yuk, Haneol Jang, KyungTae Lim
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.10151
Pdf URL: https://arxiv.org/pdf/2412.10151
Copy Paste: [[2412.10151]] VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation(https://arxiv.org/abs/2412.10151)
Keywords: generation
Abstract: We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.
摘要：我们提出了 VLR-Bench，这是一个用于评估基于检索增强生成 (RAG) 的视觉语言模型 (VLM) 的视觉问答 (VQA) 基准。与现有的基于外部知识的 VQA 评估数据集不同，提出的 VLR-Bench 包含五个输入段落。这允许测试确定哪个段落对于回答给定查询有用的能力，这是以前的研究所缺乏的能力。在此背景下，我们构建了一个包含 32,000 个自动生成的指令遵循示例的数据集，我们将其表示为 VLR-IF。该数据集专门用于增强 VLM 的 RAG 功能，使它们能够学习如何根据输入段落生成适当的答案。我们评估了所提出的基准和训练数据的有效性，并使用最先进的基于 Llama3 的 VLM，即 Llava-Llama-3 模型验证了其性能。提出的 VLR-Bench 和 VLR-IF 数据集可在线公开获取。

Title: Simple Guidance Mechanisms for Discrete Diffusion Models

Authors: Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla-torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, Volodymyr Kuleshov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10193
Pdf URL: https://arxiv.org/pdf/2412.10193
Copy Paste: [[2412.10193]] Simple Guidance Mechanisms for Discrete Diffusion Models(https://arxiv.org/abs/2412.10193)
Keywords: generation
Abstract: Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.
摘要：连续数据的扩散模型因其高质量的生成和控制机制而得到广泛采用。然而，由于连续指导方法不直接适用于离散扩散，离散数据上的可控扩散面临挑战。在这里，我们提供了离散扩散的无分类器和基于分类器的指导的直接推导，以及一类新的扩散模型，这些模型利用均匀噪声，并且由于它们可以连续编辑其输出而更具指导性。我们通过一种新颖的连续时间变分下限来提高这些模型的质量，从而产生最先进的性能，尤其是在涉及指导或快速生成的环境中。从经验上讲，我们证明我们的指导机制与均匀噪声扩散相结合，相对于自回归和扩散基线，在几个离散数据域（包括基因组序列、小分子设计和离散化图像生成）上改善了可控生成。

Title: Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Authors: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10208
Pdf URL: https://arxiv.org/pdf/2412.10208
Copy Paste: [[2412.10208]] Efficient Generative Modeling with Residual Vector Quantization-Based Tokens(https://arxiv.org/abs/2412.10208)
Keywords: generation, generative
Abstract: We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at this https URL
摘要：我们探索在矢量量化生成模型中使用残差矢量量化 (RVQ) 进行高保真生成。这种量化技术通过使用更深入的标记来保持更高的数据保真度。然而，增加生成模型中的标记数量会导致推理速度变慢。为此，我们引入了 ResGen，这是一种基于 RVQ 的高效离散扩散模型，可在不影响采样速度的情况下生成高保真样本。我们的关键思想是直接预测集体标记而不是单个标记的矢量嵌入。此外，我们证明了我们提出的标记掩码和多标记预测方法可以在原则性概率框架内使用离散扩散过程和变分推理来制定。我们在两个跨不同模态的具有挑战性的任务上验证了所提出方法的有效性和通用性：ImageNet 256x256 上的条件图像生成和零样本文本到语音合成。实验结果表明，ResGen 在两个任务中的表现都优于自回归模型，在不影响采样速度的情况下提供了卓越的性能。此外，随着我们扩展 RVQ 的深度，与类似大小的基线模型相比，我们的生成模型表现出更高的生成保真度或更快的采样速度。项目页面可在此 https URL 中找到

Title: Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication

Authors: Alireza Furutanpey, Pantelis A. Frangoudis, Patrik Szabo, Schahram Dustdar
Subjects: cs.LG, cs.DC, cs.NI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.10265
Pdf URL: https://arxiv.org/pdf/2412.10265
Copy Paste: [[2412.10265]] Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication(https://arxiv.org/abs/2412.10265)
Keywords: generation, generative
Abstract: This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.
摘要：本文使用面向任务的通信系统的信息瓶颈 (IB) 目标研究了深度神经网络 (DNN) 的对抗鲁棒性。我们通过经验证明，虽然基于 IB 的方法提供了针对下游任务的攻击的基本弹性，但对面向任务的通信生成模型的依赖带来了新的漏洞。通过对多个数据集进行大量实验，我们分析了瓶颈深度和任务复杂性如何影响对抗鲁棒性。我们的主要发现表明，与深度变分信息瓶颈 (DVIB) 方法相比，浅变分瓶颈注入 (SVBI) 提供的对抗鲁棒性较低，并且对于更复杂的任务，差距会扩大。此外，我们发现基于 IB 的目标对专注于高强度显著像素的攻击表现出比干扰许多低强度像素的攻击更强的鲁棒性。最后，我们证明依赖生成模型来提取和恢复显著信息的任务导向通信系统具有更大的攻击面。结果强调了利用神经网络进行面向目标压缩的下一代通信系统的重要安全考虑。

Title: Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry

Authors: Rishabh Kabra, Drew A. Hudson, Sjoerd van Steenkiste, Joao Carreira, Niloy J. Mitra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10273
Pdf URL: https://arxiv.org/pdf/2412.10273
Copy Paste: [[2412.10273]] Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry(https://arxiv.org/abs/2412.10273)
Keywords: generation
Abstract: We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" models the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.
摘要：我们引入了一种分层概率方法，从 2D 图像转换为多视图 3D：扩散“先验”对看不见的 3D 几何图形进行建模，然后调节扩散“解码器”以生成主体的新视图。我们使用基于点图的多视图图像格式的几何表示来协调同时生成多个目标视图。我们通过假设相对于源相机的固定目标相机姿势并为每个目标构建可预测的几何特征分布来促进视图之间的对应关系。我们的模块化、几何驱动的新视图合成方法（称为“unPIC”）在 ObjaverseXL 的保留对象以及从 Google 扫描对象、Amazon Berkeley 对象到数字孪生目录等现实世界对象上击败了 SoTA 基线（例如 CAT3D 和 One-2-3-45）。

Title: TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Authors: Xingrui Wang, Xin Li, Yaosi Hu, Hanxin Zhu, Chen Hou, Cuiling Lan, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10275
Pdf URL: https://arxiv.org/pdf/2412.10275
Copy Paste: [[2412.10275]] TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation(https://arxiv.org/abs/2412.10275)
Keywords: generation
Abstract: Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.
摘要：文本驱动的图像到视频生成 (TI2V) 旨在根据第一帧和相应的文本描述生成可控视频。这项任务的主要挑战在于两部分：(i) 如何识别目标对象并确保运动轨迹和文本描述之间的一致性。(ii) 如何提高生成的视频的主观质量。为了应对上述挑战，我们提出了一种新的基于扩散的 TI2V 框架，称为 TIV-Diffusion，通过以对象为中心的文本-视觉对齐，旨在实现基于不同对象的文本描述运动的精确控制和高质量视频生成。具体而言，我们通过尺度偏移调制结合融合的文本和视觉知识，使我们的 TIV-Diffuion 模型能够感知文本描述的对象及其运动轨迹。此外，为了缓解物体消失和物体与运动错位的问题，我们引入了以物体为中心的文本-视觉对齐模块，通过分离参考图像中的物体并将文本特征与每个物体单独对齐，降低了物体/运动错位的风险。基于上述创新，与现有的 TI2V 方法相比，我们的 TIV-Diffusion 实现了最先进的高质量视频生成。

Title: Coherent 3D Scene Diffusion From a Single RGB Image

Authors: Manuel Dahnert, Angela Dai, Norman Müller, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10294
Pdf URL: https://arxiv.org/pdf/2412.10294
Copy Paste: [[2412.10294]] Coherent 3D Scene Diffusion From a Single RGB Image(https://arxiv.org/abs/2412.10294)
Keywords: generative
Abstract: We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.
摘要：我们提出了一种基于扩散的新型方法，用于从单个 RGB 图像进行连贯的 3D 场景重建。我们的方法利用图像调节的 3D 场景扩散模型来同时对场景内所有对象的 3D 姿势和几何形状进行去噪。受任务的不适定性质的启发，为了获得一致的场景重建结果，我们通过同时调节所有场景对象来捕捉场景背景，并允许模型在整个扩散过程中学习对象间关系，从而学习生成场景先验。我们还提出了一种有效的表面对齐损失，即使在没有完整的地面实况注释的情况下也能促进训练，这在公开可用的数据集中很常见。这种损失利用了一种富有表现力的形状表示，可以从中间形状预测中直接进行点采样。通过将单个 RGB 图像 3D 场景重建任务构建为条件扩散过程，我们的方法超越了当前最先进的方法，在 SUN RGB-D 上实现了 AP3D 的 12.04% 的提升，在 Pix3D 上实现了 F 分数的 13.43% 的提升。

Title: Generative AI in Medicine

Authors: Divya Shanmugam, Monica Agrawal, Rajiv Movva, Irene Y. Chen, Marzyeh Ghassemi, Emma Pierson
Subjects: cs.LG, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2412.10337
Pdf URL: https://arxiv.org/pdf/2412.10337
Copy Paste: [[2412.10337]] Generative AI in Medicine(https://arxiv.org/abs/2412.10337)
Keywords: generative
Abstract: The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges -- including maintaining privacy and security, improving transparency and interpretability, upholding equity, and rigorously evaluating models -- which must be overcome to realize this potential, and the open research directions they give rise to.
摘要：生成式人工智能能力的增强极大地扩展了其在医学领域的潜在用例。我们为临床医生、患者、临床试验组织者、研究人员和受训人员提供了生成式人工智能用例的全面概述。然后，我们讨论了实现这一潜力必须克服的诸多挑战（包括维护隐私和安全、提高透明度和可解释性、维护公平性以及严格评估模型），以及它们引发的开放研究方向。

Title: XYScanNet: An Interpretable State Space Model for Perceptual Image Deblurring

Authors: Hanzhou Liu, Chengkai Liu, Jiacong Xu, Peng Jiang, Mi Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10338
Pdf URL: https://arxiv.org/pdf/2412.10338
Copy Paste: [[2412.10338]] XYScanNet: An Interpretable State Space Model for Perceptual Image Deblurring(https://arxiv.org/abs/2412.10338)
Keywords: restoration
Abstract: Deep state-space models (SSMs), like recent Mamba architectures, are emerging as a promising alternative to CNN and Transformer networks. Existing Mamba-based restoration methods process the visual data by leveraging a flatten-and-scan strategy that converts image patches into a 1D sequence before scanning. However, this scanning paradigm ignores local pixel dependencies and introduces spatial misalignment by positioning distant pixels incorrectly adjacent, which reduces local noise-awareness and degrades image sharpness in low-level vision tasks. To overcome these issues, we propose a novel slice-and-scan strategy that alternates scanning along intra- and inter-slices. We further design a new Vision State Space Module (VSSM) for image deblurring, and tackle the inefficiency challenges of the current Mamba-based vision module. Building upon this, we develop XYScanNet, an SSM architecture integrated with a lightweight feature fusion module for enhanced image deblurring. XYScanNet, maintains competitive distortion metrics and significantly improves perceptual performance. Experimental results show that XYScanNet enhances KID by $17\%$ compared to the nearest competitor. Our code will be released soon.
摘要：深度状态空间模型 (SSM)，如最近的 Mamba 架构，正在成为 CNN 和 Transformer 网络的有前途的替代方案。现有的基于 Mamba 的恢复方法通过利用展平和扫描策略来处理视觉数据，该策略在扫描之前将图像块转换为 1D 序列。然而，这种扫描范式忽略了局部像素依赖性，并通过将远处的像素错误地相邻放置而引入了空间错位，从而降低了局部噪声感知能力并降低了低级视觉任务中的图像清晰度。为了克服这些问题，我们提出了一种新颖的切片和扫描策略，该策略交替沿切片内和切片间扫描。我们进一步设计了一个新的视觉状态空间模块 (VSSM) 用于图像去模糊，并解决了当前基于 Mamba 的视觉模块的低效率挑战。在此基础上，我们开发了 XYScanNet，这是一种集成了轻量级特征融合模块的 SSM 架构，用于增强图像去模糊。XYScanNet 保持了有竞争力的失真指标并显着提高了感知性能。实验结果表明，与最接近的竞争对手相比，XYScanNet 将 KID 提高了 $17\%$。我们的代码即将发布。

Title: OP-LoRA: The Blessing of Dimensionality

Authors: Piotr Teterwak, Kate Saenko, Bryan A. Plummer, Ser-Nam Lim
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.10362
Pdf URL: https://arxiv.org/pdf/2412.10362
Copy Paste: [[2412.10362]] OP-LoRA: The Blessing of Dimensionality(https://arxiv.org/abs/2412.10362)
Keywords: generation
Abstract: Low-rank adapters enable fine-tuning of large models with only a small number of parameters, thus reducing storage costs and minimizing the risk of catastrophic forgetting. However, they often pose optimization challenges, with poor convergence. To overcome these challenges, we introduce an over-parameterized approach that accelerates training without increasing inference costs. This method reparameterizes low-rank adaptation by employing a separate MLP and learned embedding for each layer. The learned embedding is input to the MLP, which generates the adapter parameters. Such overparamaterization has been shown to implicitly function as an adaptive learning rate and momentum, accelerating optimization. At inference time, the MLP can be discarded, leaving behind a standard low-rank adapter. To study the effect of MLP overparameterization on a small yet difficult proxy task, we implement it for matrix factorization, and find it achieves faster convergence and lower final loss. Extending this approach to larger-scale tasks, we observe consistent performance gains across domains. We achieve improvements in vision-language tasks and especially notable increases in image generation, with CMMD scores improving by up to 15 points.
摘要：低秩适配器仅使用少量参数即可对大型模型进行微调，从而降低存储成本并最大限度地降低灾难性遗忘的风险。然而，它们通常会带来优化挑战，收敛性较差。为了克服这些挑战，我们引入了一种过度参数化的方法，可以在不增加推理成本的情况下加速训练。该方法通过为每一层采用单独的 MLP 和学习到的嵌入来重新参数化低秩自适应。学习到的嵌入被输入到 MLP，MLP 生成适配器参数。这种过度参数化已被证明可以隐式地充当自适应学习率和动量，从而加速优化。在推理时，可以丢弃 MLP，留下一个标准的低秩适配器。为了研究 MLP 过度参数化对小而难的代理任务的影响，我们将其用于矩阵分解，并发现它可以实现更快的收敛和更低的最终损失。将这种方法扩展到更大规模的任务，我们观察到跨领域的一致性能提升。我们在视觉语言任务方面取得了进步，尤其是在图像生成方面取得了显着的提升，CMMD 分数提高了 15 分。