2025-01-24

Title: Graph Representation Learning with Diffusion Generative Models

Authors: Daniel Wesego
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13133
Pdf URL: https://arxiv.org/pdf/2501.13133
Copy Paste: [[2501.13133]] Graph Representation Learning with Diffusion Generative Models(https://arxiv.org/abs/2501.13133)
Keywords: generation, generative
Abstract: Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We only need the encoder at the end to extract representations. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning.
摘要：由于能够准确地近似复杂的数据分布，扩散模型已成为各种数据模式（包括图像和视频）中最先进的生成模型。与 VAE 和 GAN 等传统生成方法不同，扩散模型采用渐进式去噪过程，通过多个迭代步骤将噪声转换为有意义的数据。这种渐进式方法增强了其表现力和生成质量。不仅如此，扩散模型还被证明可以在学习生成样本的同时从数据中提取有意义的表示。尽管扩散模型取得了成功，但其在图形结构数据中的应用仍然相对未被探索，这主要是由于图形的离散性质，这需要离散扩散过程，而这不同于其他领域中使用的连续方法。在这项工作中，我们利用扩散模型的表示能力来学习图形数据的有意义的嵌入。通过在自动编码器框架内训练离散扩散模型，我们可以实现有效的自动编码和针对图形结构数据的独特特征的表示学习。我们只需要最后的编码器来提取表示。我们的方法证明了离散扩散模型用于图形表示学习的潜力。

Title: Scaling for Fairness? Analyzing Model Size, Data Composition, and Multilinguality in Vision-Language Bias

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.13223
Pdf URL: https://arxiv.org/pdf/2501.13223
Copy Paste: [[2501.13223]] Scaling for Fairness? Analyzing Model Size, Data Composition, and Multilinguality in Vision-Language Bias(https://arxiv.org/abs/2501.13223)
Keywords: generation
Abstract: As large-scale vision-language models (VLMs) become increasingly central to modern AI applications, understanding and mitigating social biases in these systems has never been more this http URL investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open-source variants. In particular, we systematically evaluate models trained on varying dataset scales and architectures, as well as multilingual versions encompassing English along with Persian, Turkish, and Finnish, languages with minimal gender marking. To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms rooted in the psychological constructs of communion and agency, and demographic labeling bias using both the FairFace and PATA datasets. Our findings reveal three key insights. First, while larger training datasets can mitigate some biases, they may also introduce or amplify others when the data composition is imbalanced. Second, although increasing model size generally improves performance, it does not consistently reduce bias and can, in certain cases, exacerbate it. Finally, while multilingual training broadens linguistic coverage, it does not inherently neutralize bias and can transfer or intensify inequities across languages. Taken together, these results highlight the necessity of inclusive, carefully curated training data to foster fairness rather than relying solely on model scaling or language expansion. We provide a systematic evaluation of vision language bias across diverse demographics, underscoring the urgent need for intentional bias mitigation strategies in next generation AI systems.
摘要：随着大规模视觉语言模型 (VLM) 在现代 AI 应用中变得越来越重要，理解和减轻这些系统中的社会偏见从未像现在这样重要。研究数据集组成、模型大小和多语言训练如何影响流行的 VLM、CLIP 及其开源变体中的性别和种族偏见。具体来说，我们系统地评估在不同数据集规模和架构上训练的模型，以及包含英语以及波斯语、土耳其语和芬兰语（性别标记最少的语言）的多语言版本。为了评估社会认知偏见，我们使用 FairFace 和 PATA 数据集测量了零样本面部图像的性能，这些图像具有植根于交流和代理的心理构造的社会术语以及人口统计标签偏见。我们的研究结果揭示了三个关键见解。首先，虽然更大的训练数据集可以减轻一些偏见，但当数据组成不平衡时，它们也可能会引入或放大其他偏见。其次，虽然增加模型大小通常会提高性能，但它并不能始终如一地减少偏见，在某些情况下可能会加剧偏见。最后，虽然多语言训练扩大了语言覆盖范围，但它本身并不能消除偏见，而且可能会转移或加剧不同语言之间的不平等。总而言之，这些结果凸显了包容性、精心策划的训练数据的必要性，以促进公平，而不是仅仅依靠模型扩展或语言扩展。我们对不同人口统计数据中的视觉语言偏见进行了系统评估，强调了下一代人工智能系统迫切需要有意识的偏见缓解策略。

Title: AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback

Authors: Joshua Park, Yongfeng Zhang
Subjects: cs.LG, cs.AI, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2501.13333
Pdf URL: https://arxiv.org/pdf/2501.13333
Copy Paste: [[2501.13333]] AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback(https://arxiv.org/abs/2501.13333)
Keywords: generation
Abstract: Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at this https URL.
摘要：多智能体系统必须决定哪个智能体最适合执行给定的任务。我们提出了一种新颖的架构，通过扩展 Sentence-BERT (SBERT) 编码器模型，推荐在给定自然语言提示的情况下，从众多 LLM 智能体中挑选出哪个来执行任务。在测试数据上，我们能够实现 92.2% 的 top-1 准确率，每次分类只需不到 300 毫秒。与传统分类方法相比，我们的架构计算成本低、适应新类别、可解释，并且可通过强化学习使用任意指标进行控制。通过将自然语言提示编码为句子嵌入，我们的模型可以捕获与推荐智能体相关的语义内容。然后通过微调最小化属于同一智能体的句子嵌入之间的距离，并通过从人类反馈中进行强化学习使其与人类价值观保持一致。这允许通过测量嵌入之间的余弦相似度，根据它们的最近邻居对自然语言提示进行分类。这项工作是通过生成代理推荐的合成数据集实现的，我们已将其与 AgentRec 推荐系统的代码一起通过此 https URL 向公众开源。

Title: Gradient-Free Adversarial Purification with Diffusion Models

Authors: Xuelong Dai, Dong Wang, Duan Mingxing, Bin Xiao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.13336
Pdf URL: https://arxiv.org/pdf/2501.13336
Copy Paste: [[2501.13336]] Gradient-Free Adversarial Purification with Diffusion Models(https://arxiv.org/abs/2501.13336)
Keywords: super-resolution
Abstract: Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model's robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is ineffective against the recently proposed unrestricted adversarial attacks. In this paper, we propose an effective and efficient adversarial defense method that counters both perturbation-based and unrestricted adversarial attacks. Our defense is inspired by the observation that adversarial attacks are typically located near the decision boundary and are sensitive to pixel changes. To address this, we introduce adversarial anti-aliasing to mitigate adversarial modifications. Additionally, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly recover images. These approaches do not require additional training and are computationally efficient without calculating gradients. Extensive experiments against both perturbation-based and unrestricted adversarial attacks demonstrate that our defense method outperforms state-of-the-art adversarial purification methods.
摘要：对抗训练和对抗净化是两种有效且实用的防御方法，可增强模型对对抗攻击的鲁棒性。然而，对抗训练需要额外的训练，而对抗净化的时间效率较低。更关键的是，当前的防御是在基于扰动的对抗威胁模型下设计的，这对最近提出的无限制对抗攻击无效。在本文中，我们提出了一种有效且高效的对抗防御方法，可以对抗基于扰动和无限制的对抗攻击。我们的防御受到以下观察的启发：对抗攻击通常位于决策边界附近，并且对像素变化很敏感。为了解决这个问题，我们引入了对抗抗锯齿来减轻对抗修改。此外，我们提出了对抗超分辨率，它利用干净数据集中的先验知识来良性地恢复图像。这些方法不需要额外的训练，并且无需计算梯度即可实现计算效率。针对基于扰动和不受限制的对抗性攻击的大量实验表明，我们的防御方法优于最先进的对抗性净化方法。

Title: Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models

Authors: Hao Fang, Xiaohang Sui, Hongyao Yu, Jiawei Kong, Sijin Yu, Bin Chen, Hao Wu, Shu-Tao Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13340
Pdf URL: https://arxiv.org/pdf/2501.13340
Copy Paste: [[2501.13340]] Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models(https://arxiv.org/abs/2501.13340)
Keywords: generation, generative
Abstract: Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.
摘要：扩散模型 (DM) 最近表现出了卓越的生成能力。然而，它们的训练通常需要大量的计算资源和大规模数据集。为了解决这些问题，最近的研究为 DM 提供了先进的检索增强生成 (RAG) 技术，并提出了检索增强扩散模型 (RDM)。通过整合辅助数据库中的丰富知识，RAG 增强了扩散模型的生成和泛化能力，同时显著减少了模型参数。尽管取得了巨大的成功，但 RAG 可能会引入新的安全问题，值得进一步研究。在本文中，我们通过提出一种名为 BadRDM 的多模态对比攻击方法，揭示了 RDM 容易受到后门攻击。我们的框架充分考虑了 RAG 的特点，旨在操纵给定文本触发器的检索项目，从而进一步控制生成的内容。具体来说，我们首先将一小部分图像插入检索数据库作为目标毒性替代物。随后，采用对比学习的恶意变体将后门注入检索器，从而建立从触发器到毒性代理的快捷方式。此外，我们通过新颖的基于熵的选择和生成增强策略增强攻击，从而可以得到更好的毒性代理。在两个主流任务上进行的大量实验表明，提出的 BadRDM 在保留模型良性效用的同时实现了出色的攻击效果。

Title: One Fits All: General Mobility Trajectory Modeling via Masked Conditional Diffusion

Authors: Qingyue Long, Can Rong, Huandong Wang, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13347
Pdf URL: https://arxiv.org/pdf/2501.13347
Copy Paste: [[2501.13347]] One Fits All: General Mobility Trajectory Modeling via Masked Conditional Diffusion(https://arxiv.org/abs/2501.13347)
Keywords: generation
Abstract: Trajectory data play a crucial role in many applications, ranging from network optimization to urban planning. Existing studies on trajectory data are task-specific, and their applicability is limited to the specific tasks on which they have been trained, such as generation, recovery, or prediction. However, the potential of a unified model has not yet been fully explored in trajectory modeling. Although various trajectory tasks differ in inputs, outputs, objectives, and conditions, they share common mobility patterns. Based on these common patterns, we can construct a general framework that enables a single model to address different tasks. However, building a trajectory task-general framework faces two critical challenges: 1) the diversity in the formats of different tasks and 2) the complexity of the conditions imposed on different tasks. In this work, we propose a general trajectory modeling framework via masked conditional diffusion (named GenMove). Specifically, we utilize mask conditions to unify diverse formats. To adapt to complex conditions associated with different tasks, we utilize historical trajectory data to obtain contextual trajectory embeddings, which include rich contexts such as spatiotemporal characteristics and user preferences. Integrating the contextual trajectory embedding into diffusion models through a classifier-free guidance approach allows the model to flexibly adjust its outputs based on different conditions. Extensive experiments on mainstream tasks demonstrate that our model significantly outperforms state-of-the-art baselines, with the highest performance improvement exceeding 13% in generation tasks.
摘要：轨迹数据在从网络优化到城市规划等许多应用中都发挥着至关重要的作用。现有的轨迹数据研究都是针对特定任务的，其适用性仅限于它们所训练的特定任务，例如生成、恢复或预测。然而，统一模型的潜力尚未在轨迹建模中得到充分探索。虽然各种轨迹任务在输入、输出、目标和条件方面各不相同，但它们具有共同的移动模式。基于这些共同模式，我们可以构建一个通用框架，使单个模型能够解决不同的任务。然而，构建轨迹任务通用框架面临两个关键挑战：1）不同任务格式的多样性和 2）对不同任务施加条件的复杂性。在这项工作中，我们提出了一个通过掩码条件扩散（称为 GenMove）的通用轨迹建模框架。具体来说，我们利用掩码条件来统一不同的格式。为了适应不同任务的复杂条件，我们利用历史轨迹数据来获取上下文轨迹嵌入，其中包括时空特征和用户偏好等丰富的上下文。通过无分类器引导方法将上下文轨迹嵌入集成到扩散模型中，使模型能够根据不同条件灵活调整其输出。在主流任务上的大量实验表明，我们的模型明显优于最先进的基线，在生成任务中最高性能提升超过 13%。

Title: MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize

Authors: Haohang Xu, Longyu Chen, Shuangrui Ding, Yilin Gao, Dongsheng Jiang, Yin Li, Shugong Xu, Junqing Yu, Wei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13349
Pdf URL: https://arxiv.org/pdf/2501.13349
Copy Paste: [[2501.13349]] MSF: Efficient Diffusion Model Via Multi-Scale Latent Factorize(https://arxiv.org/abs/2501.13349)
Keywords: generation, generative
Abstract: Diffusion-based generative models have achieved remarkable progress in visual content generation. However, traditional diffusion models directly denoise the entire image from noisy inputs, disregarding the hierarchical structure present in visual signals. This method is computationally intensive, especially for high-resolution image generation. Signal processing often leverages hierarchical decompositions; for instance, Fourier analysis decomposes signals by frequency, while wavelet analysis captures localized frequency components, reflecting both spatial and frequency information simultaneously. Inspired by these principles, we propose a multiscale diffusion framework that generates hierarchical visual representations, which are subsequently integrated to form the final output. The diffusion model target, whether raw RGB pixels or latent features from a Variational Autoencoder, s divided into multiple components that each capture distinct spatial levels. The low-resolution component contains the primary informative signal, while higher-resolution components add high-frequency details, such as texture. This approach divides image generation into two stages: producing a low-resolution base signal, followed by a high-resolution residual signal. Both stages can be effectively modeled using simpler, lightweight transformer architectures compared to full-resolution generation. This decomposition is conceptually similar to wavelet decomposition but offers a more streamlined and intuitive design. Our method, termed MSF(short for Multi-Scale Factorization), achieves an FID of 2.2 and an IS of 255.4 on the ImageNet 256x256 benchmark, reducing computational costs by 50% compared to baseline methods.
摘要：基于扩散的生成模型在视觉内容生成方面取得了显著进展。然而，传统的扩散模型直接从噪声输入中去除整个图像的噪声，而忽略了视觉信号中存在的层次结构。这种方法计算量大，尤其是对于高分辨率图像生成而言。信号处理通常利用层次分解；例如，傅里叶分析按频率分解信号，而小波分析捕获局部频率分量，同时反映空间和频率信息。受这些原理的启发，我们提出了一个多尺度扩散框架，该框架生成层次化的视觉表示，随后将其集成以形成最终输出。扩散模型目标，无论是原始 RGB 像素还是来自变分自动编码器的潜在特征，都被分为多个组件，每个组件捕获不同的空间级别。低分辨率组件包含主要信息信号，而高分辨率组件则添加高频细节，例如纹理。这种方法将图像生成分为两个阶段：产生低分辨率基础信号，然后产生高分辨率残差信号。与全分辨率生成相比，这两个阶段都可以使用更简单、更轻量的 Transformer 架构进行有效建模。这种分解在概念上类似于小波分解，但提供了更简化和直观的设计。我们的方法称为 MSF（多尺度分解的缩写），在 ImageNet 256x256 基准上实现了 2.2 的 FID 和 255.4 的 IS，与基线方法相比，计算成本降低了 50%。

Title: Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision

Authors: Aman Urumbekov, Zheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13353
Pdf URL: https://arxiv.org/pdf/2501.13353
Copy Paste: [[2501.13353]] Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision(https://arxiv.org/abs/2501.13353)
Keywords: super-resolution
Abstract: Transformers have become increasingly popular for image super-resolution (SR) tasks due to their strong global context modeling capabilities. However, their quadratic computational complexity necessitates the use of window-based attention mechanisms, which restricts the receptive field and limits effective context expansion. Recently, the Mamba architecture has emerged as a promising alternative with linear computational complexity, allowing it to avoid window mechanisms and maintain a large receptive field. Nevertheless, Mamba faces challenges in handling long-context dependencies when high pixel-level precision is required, as in SR tasks. This is due to its hidden state mechanism, which can compress and store a substantial amount of context but only in an approximate manner, leading to inaccuracies that transformers do not suffer from. In this paper, we propose \textbf{Contrast}, a hybrid SR model that combines \textbf{Con}volutional, \textbf{Tra}nsformer, and \textbf{St}ate Space components, effectively blending the strengths of transformers and Mamba to address their individual limitations. By integrating transformer and state space mechanisms, \textbf{Contrast} compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy. We demonstrate that combining these two architectures allows us to mitigate the problems inherent in each, resulting in improved performance on image super-resolution tasks.
摘要：由于其强大的全局上下文建模能力，Transformer 在图像超分辨率 (SR) 任务中越来越受欢迎。然而，它们的二次计算复杂度需要使用基于窗口的注意机制，这限制了接受域并限制了有效的上下文扩展。最近，Mamba 架构已成为一种有前途的替代方案，具有线性计算复杂度，使其能够避免窗口机制并保持较大的接受域。然而，当需要高像素级精度时，Mamba 在处理长上下文依赖性方面面临挑战，就像在 SR 任务中一样。这是由于其隐藏状态机制，它可以压缩和存储大量上下文，但只能以近似的方式，导致不准确，而 Transformer 不会出现这种情况。在本文中，我们提出了 \textbf{Contrast}，这是一种混合 SR 模型，结合了 \textbf{Con}volutional、\textbf{Tra}nsformer 和 \textbf{State}ate Space 组件，有效地融合了 Transformer 和 Mamba 的优势，以解决它们各自的局限性。通过整合变换器和状态空间机制，\textbf{Contrast} 弥补了每种方法的缺点，增强了全局上下文建模和像素级精度。我们证明，结合这两种架构可以减轻每种架构固有的问题，从而提高图像超分辨率任务的性能。

Title: From Images to Point Clouds: An Efficient Solution for Cross-media Blind Quality Assessment without Annotated Training

Authors: Yipeng Liu, Qi Yang, Yujie Zhang, Yiling Xu, Le Yang, Zhu Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.13387
Pdf URL: https://arxiv.org/pdf/2501.13387
Copy Paste: [[2501.13387]] From Images to Point Clouds: An Efficient Solution for Cross-media Blind Quality Assessment without Annotated Training(https://arxiv.org/abs/2501.13387)
Keywords: quality assessment
Abstract: We present a novel quality assessment method which can predict the perceptual quality of point clouds from new scenes without available annotations by leveraging the rich prior knowledge in images, called the Distribution-Weighted Image-Transferred Point Cloud Quality Assessment (DWIT-PCQA). Recognizing the human visual system (HVS) as the decision-maker in quality assessment regardless of media types, we can emulate the evaluation criteria for human perception via neural networks and further transfer the capability of quality prediction from images to point clouds by leveraging the prior knowledge in the images. Specifically, domain adaptation (DA) can be leveraged to bridge the images and point clouds by aligning feature distributions of the two media in the same feature space. However, the different manifestations of distortions in images and point clouds make feature alignment a difficult task. To reduce the alignment difficulty and consider the different distortion distribution during alignment, we have derived formulas to decompose the optimization objective of the conventional DA into two suboptimization functions with distortion as a transition. Specifically, through network implementation, we propose the distortion-guided biased feature alignment which integrates existing/estimated distortion distribution into the adversarial DA framework, emphasizing common distortion patterns during feature alignment. Besides, we propose the quality-aware feature disentanglement to mitigate the destruction of the mapping from features to quality during alignment with biased distortions. Experimental results demonstrate that our proposed method exhibits reliable performance compared to general blind PCQA methods without needing point cloud annotations.
摘要：我们提出了一种新颖的质量评估方法，该方法可以利用图像中丰富的先验知识预测没有可用注释的新场景中点云的感知质量，称为分布加权图像传输点云质量评估 (DWIT-PCQA)。无论媒体类型如何，人类视觉系统 (HVS) 都是质量评估的决策者，我们可以通过神经网络模拟人类感知的评估标准，并利用图像中的先验知识进一步将质量预测能力从图像转移到点云。具体而言，可以利用域自适应 (DA) 通过在相同特征空间中对齐两种媒体的特征分布来桥接图像和点云。然而，图像和点云中失真的不同表现形式使特征对齐成为一项艰巨的任务。为了降低对齐难度并考虑对齐过程中不同的失真分布，我们推导出公式，将传统 DA 的优化目标分解为两个子优化函数，以失真为过渡。具体来说，通过网络实现，我们提出了失真引导的有偏特征对齐，将现有/估计的失真分布集成到对抗性 DA 框架中，强调特征对齐过程中的常见失真模式。此外，我们提出了质量感知特征解缠，以减轻在有偏失真的对齐过程中从特征到质量的映射的破坏。实验结果表明，与不需要点云注释的一般盲 PCQA 方法相比，我们提出的方法表现出可靠的性能。

Title: Towards Intelligent Design: A Self-driven Framework for Collocated Clothing Synthesis Leveraging Fashion Styles and Textures

Authors: Minglong Dong, Dongliang Zhou, Jianghong Ma, Haijun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13396
Pdf URL: https://arxiv.org/pdf/2501.13396
Copy Paste: [[2501.13396]] Towards Intelligent Design: A Self-driven Framework for Collocated Clothing Synthesis Leveraging Fashion Styles and Textures(https://arxiv.org/abs/2501.13396)
Keywords: generation, generative
Abstract: Collocated clothing synthesis (CCS) has emerged as a pivotal topic in fashion technology, primarily concerned with the generation of a clothing item that harmoniously matches a given item. However, previous investigations have relied on using paired outfits, such as a pair of matching upper and lower clothing, to train a generative model for achieving this task. This reliance on the expertise of fashion professionals in the construction of such paired outfits has engendered a laborious and time-intensive process. In this paper, we introduce a new self-driven framework, named style- and texture-guided generative network (ST-Net), to synthesize collocated clothing without the necessity for paired outfits, leveraging self-supervised learning. ST-Net is designed to extrapolate fashion compatibility rules from the style and texture attributes of clothing, using a generative adversarial network. To facilitate the training and evaluation of our model, we have constructed a large-scale dataset specifically tailored for unsupervised CCS. Extensive experiments substantiate that our proposed method outperforms the state-of-the-art baselines in terms of both visual authenticity and fashion compatibility.
摘要：搭配服装合成 (CCS) 已成为时尚技术的一个关键主题，主要涉及生成与给定物品和谐匹配的服装。然而，之前的研究依赖于使用成对的服装，例如一对匹配的上衣和下装，来训练生成模型以完成此任务。这种对时尚专业人士在构建此类搭配服装方面的专业知识的依赖导致了一个费力且耗时的过程。在本文中，我们介绍了一种新的自驱动框架，称为风格和纹理引导的生成网络 (ST-Net)，利用自监督学习来合成搭配服装而无需成对服装。ST-Net 旨在使用生成对抗网络从服装的风格和纹理属性中推断出时尚兼容性规则。为了方便我们模型的训练和评估，我们构建了一个专门为无监督 CCS 定制的大规模数据集。大量实验证实，我们提出的方法在视觉真实性和时尚兼容性方面都优于最先进的基线。

Title: Auto-Prompting SAM for Weakly Supervised Landslide Extraction

Authors: Jian Wang, Xiaokang Zhang, Xianping Ma, Weikang Yu, Pedram Ghamisi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13426
Pdf URL: https://arxiv.org/pdf/2501.13426
Copy Paste: [[2501.13426]] Auto-Prompting SAM for Weakly Supervised Landslide Extraction(https://arxiv.org/abs/2501.13426)
Keywords: generation
Abstract: Weakly supervised landslide extraction aims to identify landslide regions from remote sensing data using models trained with weak labels, particularly image-level labels. However, it is often challenged by the imprecise boundaries of the extracted objects due to the lack of pixel-wise supervision and the properties of landslide objects. To tackle these issues, we propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM), i.e., APSAM. Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering. Specifically, it adaptively generates hybrid prompts from the CAMs obtained by an object localization network. To provide sufficient information for SAM prompting, an adaptive prompt generation (APG) algorithm is designed to fully leverage the visual patterns of CAMs, enabling the efficient generation of pseudo-masks for landslide extraction. These informative prompts are able to identify the extent of landslide areas (box prompts) and denote the centers of landslide objects (point prompts), guiding SAM in landslide segmentation. Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method, achieving improvements of at least 3.0\% in F1 score and 3.69\% in IoU compared to other state-of-the-art methods. The source codes and datasets will be available at this https URL.
摘要：弱监督滑坡提取旨在使用用弱标签（尤其是图像级标签）训练的模型从遥感数据中识别滑坡区域。然而，由于缺乏像素级监督和滑坡物体的属性，它经常受到提取物体边界不精确的挑战。为了解决这些问题，我们提出了一种简单而有效的方法，即自动提示任何分割模型 (SAM)，即 APSAM。我们的方法不是依赖高质量的类激活图 (CAM) 进行伪标记或微调 SAM，而是通过提示工程直接从 SAM 推理中产生细粒度的分割掩码。具体来说，它从对象定位网络获得的 CAM 中自适应地生成混合提示。为了为 SAM 提示提供足够的信息，设计了一种自适应提示生成 (APG) 算法来充分利用 CAM 的视觉模式，从而高效生成用于滑坡提取的伪掩码。这些信息提示能够识别滑坡区域的范围（框提示）并标明滑坡物体的中心（点提示），从而指导 SAM 进行滑坡分割。高分辨率航空和卫星数据集上的实验结果证明了我们方法的有效性，与其他最先进的方法相比，F1 得分至少提高了 3.0%，IoU 提高了 3.69%。源代码和数据集将在此 https URL 上提供。

Title: GC-ConsFlow: Leveraging Optical Flow Residuals and Global Context for Robust Deepfake Detection

Authors: Jiaxin Chen, Miao Hu, Dengyong Zhang, Jingyang Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13435
Pdf URL: https://arxiv.org/pdf/2501.13435
Copy Paste: [[2501.13435]] GC-ConsFlow: Leveraging Optical Flow Residuals and Global Context for Robust Deepfake Detection(https://arxiv.org/abs/2501.13435)
Keywords: generation
Abstract: The rapid development of Deepfake technology has enabled the generation of highly realistic manipulated videos, posing severe social and ethical challenges. Existing Deepfake detection methods primarily focused on either spatial or temporal inconsistencies, often neglecting the interplay between the two or suffering from interference caused by natural facial motions. To address these challenges, we propose the global context consistency flow (GC-ConsFlow), a novel dual-stream framework that effectively integrates spatial and temporal features for robust Deepfake detection. The global grouped context aggregation module (GGCA), integrated into the global context-aware frame flow stream (GCAF), enhances spatial feature extraction by aggregating grouped global context information, enabling the detection of subtle, spatial artifacts within frames. The flow-gradient temporal consistency stream (FGTC), rather than directly modeling the residuals, it is used to improve the robustness of temporal feature extraction against the inconsistency introduced by unnatural facial motion using optical flow residuals and gradient-based features. By combining these two streams, GC-ConsFlow demonstrates the effectiveness and robustness in capturing complementary spatiotemporal forgery traces. Extensive experiments show that GC-ConsFlow outperforms existing state-of-the-art methods in detecting Deepfake videos under various compression scenarios.
摘要：Deepfake 技术的快速发展使得高度逼真的伪造视频成为可能，带来了严重的社会和伦理挑战。现有的 Deepfake 检测方法主要关注空间或时间不一致，往往忽略了两者之间的相互作用或受到自然面部运动的干扰。为了应对这些挑战，我们提出了全局上下文一致性流 (GC-ConsFlow)，这是一种新颖的双流框架，可有效整合空间和时间特征，实现稳健的 Deepfake 检测。全局分组上下文聚合模块 (GGCA) 集成到全局上下文感知帧流流 (GCAF) 中，通过聚合分组的全局上下文信息来增强空间特征提取，从而能够检测帧内的细微空间伪影。流梯度时间一致性流 (FGTC) 不是直接对残差进行建模，而是使用光流残差和基于梯度的特征来提高时间特征提取的稳健性，以抵御非自然面部运动引入的不一致性。通过结合这两种流，GC-ConsFlow 展示了捕获互补时空伪造痕迹的有效性和鲁棒性。大量实验表明，GC-ConsFlow 在检测各种压缩场景下的 Deepfake 视频方面优于现有的最先进方法。

Title: EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

Authors: Jiangchuan Wei, Shiyue Yan, Wenfeng Lin, Boyuan Liu, Renjie Chen, Mingyu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13452
Pdf URL: https://arxiv.org/pdf/2501.13452
Copy Paste: [[2501.13452]] EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion(https://arxiv.org/abs/2501.13452)
Keywords: generation
Abstract: Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with "copy-paste" artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information. The objective is to balance the enhancements in fidelity provided by shallow features while mitigating excessive reliance on them. This strategy encourages the model to utilize high-level features during training, ultimately fostering a more robust representation of facial identities. EchoVideo effectively preserves facial identities and maintains full-body integrity. Extensive experiments demonstrate that it achieves excellent results in generating high-quality, controllability and fidelity videos.
摘要：视频生成领域的最新进展对各种下游应用产生了重大影响，尤其是在身份保留视频生成 (IPT2V) 领域。然而，现有方法在“复制粘贴”伪影和低相似度问题方面存在困难，这主要是因为它们依赖于低级面部图像信息。这种依赖性可能导致面部外观僵硬，并且伪影会反映不相关的细节。为了应对这些挑战，我们提出了 EchoVideo，它采用了两种关键策略：(1) 身份图像文本融合模块 (IITF)，它集成了文本中的高级语义特征，捕获清晰的面部身份表征，同时丢弃遮挡、姿势和光照变化以避免引入伪影；(2) 两阶段训练策略，在第二阶段采用随机方法随机利用浅层面部信息。目标是平衡浅层特征提供的保真度增强，同时减轻对它们的过度依赖。该策略鼓励模型在训练期间利用高级特征，最终形成更强大的面部身份表征。 EchoVideo能够有效保留面部特征，并保持全身的完整性，大量实验表明，其在生成高质量、可控性强、保真度高的视频方面取得了优异的效果。

Title: LDR-Net: A Novel Framework for AI-generated Image Detection via Localized Discrepancy Representation

Authors: JiaXin Chen, Miao Hu, DengYong Zhang, Yun Song, Xin Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13475
Pdf URL: https://arxiv.org/pdf/2501.13475
Copy Paste: [[2501.13475]] LDR-Net: A Novel Framework for AI-generated Image Detection via Localized Discrepancy Representation(https://arxiv.org/abs/2501.13475)
Keywords: generative
Abstract: With the rapid advancement of generative models, the visual quality of generated images has become nearly indistinguishable from the real ones, posing challenges to content authenticity verification. Existing methods for detecting AI-generated images primarily focus on specific forgery clues, which are often tailored to particular generative models like GANs or diffusion models. These approaches struggle to generalize across architectures. Building on the observation that generative images often exhibit local anomalies, such as excessive smoothness, blurred textures, and unnatural pixel variations in small regions, we propose the localized discrepancy representation network (LDR-Net), a novel approach for detecting AI-generated images. LDR-Net captures smoothing artifacts and texture irregularities, which are common but often overlooked. It integrates two complementary modules: local gradient autocorrelation (LGA) which models local smoothing anomalies to detect smoothing anomalies, and local variation pattern (LVP) which captures unnatural regularities by modeling the complexity of image patterns. By merging LGA and LVP features, a comprehensive representation of localized discrepancies can be provided. Extensive experiments demonstrate that our LDR-Net achieves state-of-the-art performance in detecting generated images and exhibits satisfactory generalization across unseen generative models. The code will be released upon acceptance of this paper.
摘要：随着生成模型的快速发展，生成的图像的视觉质量已变得几乎与真实图像难以区分，这对内容真实性验证提出了挑战。现有的检测人工智能生成图像的方法主要关注特定的伪造线索，这些线索通常针对特定的生成模型（如 GAN 或扩散模型）进行量身定制。这些方法很难在架构之间推广。基于生成图像经常表现出局部异常（例如过度平滑、纹理模糊和小区域内不自然的像素变化）的观察，我们提出了局部差异表示网络 (LDR-Net)，这是一种检测人工智能生成图像的新方法。LDR-Net 可捕获常见但经常被忽视的平滑伪影和纹理不规则。它集成了两个互补的模块：局部梯度自相关 (LGA)，它对局部平滑异常进行建模以检测平滑异常，以及局部变化模式 (LVP)，它通过对图像模式的复杂性进行建模来捕获不自然的规律。通过合并 LGA 和 LVP 特征，可以提供局部差异的全面表示。大量实验表明，我们的 LDR-Net 在检测生成的图像方面实现了最先进的性能，并在未见过的生成模型中表现出令人满意的泛化能力。代码将在本文被接受后发布。

Title: ReasVQA: Advancing VideoQA with Imperfect Reasoning Process

Authors: Jianxin Liang, Xiaojun Meng, Huishuai Zhang, Yueqian Wang, Jiansheng Wei, Dongyan Zhao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.13536
Pdf URL: https://arxiv.org/pdf/2501.13536
Copy Paste: [[2501.13536]] ReasVQA: Advancing VideoQA with Imperfect Reasoning Process(https://arxiv.org/abs/2501.13536)
Keywords: generation
Abstract: Video Question Answering (VideoQA) is a challenging task that requires understanding complex visual and temporal relationships within videos to answer questions accurately. In this work, we introduce \textbf{ReasVQA} (Reasoning-enhanced Video Question Answering), a novel approach that leverages reasoning processes generated by Multimodal Large Language Models (MLLMs) to improve the performance of VideoQA models. Our approach consists of three phases: reasoning generation, reasoning refinement, and learning from reasoning. First, we generate detailed reasoning processes using additional MLLMs, and second refine them via a filtering step to ensure data quality. Finally, we use the reasoning data, which might be in an imperfect form, to guide the VideoQA model via multi-task learning, on how to interpret and answer questions based on a given video. We evaluate ReasVQA on three popular benchmarks, and our results establish new state-of-the-art performance with significant improvements of +2.9 on NExT-QA, +7.3 on STAR, and +5.9 on IntentQA. Our findings demonstrate the supervising benefits of integrating reasoning processes into VideoQA. Further studies validate each component of our method, also with different backbones and MLLMs, and again highlight the advantages of this simple but effective method. We offer a new perspective on enhancing VideoQA performance by utilizing advanced reasoning techniques, setting a new benchmark in this research field.
摘要：视频问答 (VideoQA) 是一项具有挑战性的任务，需要理解视频中复杂的视觉和时间关系才能准确回答问题。在这项工作中，我们引入了 \textbf{ReasVQA}（推理增强型视频问答），这是一种新颖的方法，它利用多模态大型语言模型 (MLLM) 生成的推理过程来提高 VideoQA 模型的性能。我们的方法包括三个阶段：推理生成、推理改进和从推理中学习。首先，我们使用额外的 MLLM 生成详细的推理过程，然后通过过滤步骤对其进行改进以确保数据质量。最后，我们使用可能形式不完善的推理数据通过多任务学习来指导 VideoQA 模型如何根据给定的视频解释和回答问题。我们在三个流行的基准上对 ReasVQA 进行了评估，结果确立了新的最先进性能，NExT-QA 上显著提高了 +2.9，STAR 上提高了 +7.3，IntentQA 上提高了 +5.9。我们的研究结果表明，将推理过程集成到 VideoQA 中具有监督优势。进一步的研究验证了我们方法的每个组成部分，也验证了不同的主干和 MLLM，并再次强调了这种简单但有效的方法的优势。我们提供了一个利用先进的推理技术来提高 VideoQA 性能的新视角，为该研究领域树立了新的标杆。

Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt

Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13554
Pdf URL: https://arxiv.org/pdf/2501.13554
Copy Paste: [[2501.13554]] One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt(https://arxiv.org/abs/2501.13554)
Keywords: generation
Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at this https URL.
摘要：文本到图像生成模型可以根据输入提示创建高质量的图像。然而，它们很难支持故事叙述中身份保留要求的一致生成。现有的解决此问题的方法通常需要在大型数据集中进行大量训练或对原始模型架构进行额外修改。这限制了它们在不同领域和不同扩散模型配置中的适用性。在本文中，我们首先观察了语言模型的固有能力，即所谓的上下文一致性，即通过单个提示通过上下文理解身份。从固有的上下文一致性中汲取灵感，我们提出了一种新颖的无需训练的一致文本到图像 (T2I) 生成方法，称为“一个提示一个故事”（1Prompt1Story）。我们的方法 1Prompt1Story 将所有提示连接到 T2I 扩散模型的单个输入中，最初保留角色身份。然后，我们使用两种新技术改进生成过程：奇异值重加权和身份保留交叉注意力，确保与每个帧的输入描述更好地对齐。在我们的实验中，我们将我们的方法与各种现有的一致 T2I 生成方法进行比较，以通过定量指标和定性评估证明其有效性。代码可在此 https URL 上找到。

Title: EventVL: Understand Event Streams via Multimodal Large Language Model

Authors: Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, Hui Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13707
Pdf URL: https://arxiv.org/pdf/2501.13707
Copy Paste: [[2501.13707]] EventVL: Understand Event Streams via Multimodal Large Language Model(https://arxiv.org/abs/2501.13707)
Keywords: generation, generative
Abstract: The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.
摘要：基于事件的视觉语言模型 (VLM) 近期在实际视觉任务中取得了良好进展。然而，这些工作大多仅利用 CLIP 来关注传统的感知任务，这阻碍了模型明确理解事件流中的充分语义和上下文。为了解决这一不足，我们提出了 EventVL，这是第一个用于明确语义理解的基于事件的生成式 MLLM（多模态大型语言模型）框架。具体而言，为了弥合连接不同模态语义的数据鸿沟，我们首先注释了一个大型事件图像/视频文本数据集，其中包含近 140 万个高质量数据对，这使得能够在各种场景（例如驾驶场景或人体运动）中进行有效学习。之后，我们设计了事件时空表示，通过对事件流进行多样化聚合和分割来充分探索综合信息。为了进一步促进紧凑的语义空间，引入了动态语义对齐来改进和完成事件的稀疏语义空间。大量实验表明，我们的 EventVL 在事件字幕和场景描述生成任务中可以显著超越现有的 MLLM 基线。我们希望我们的研究能够为事件视觉社区的发展做出贡献。

Title: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

Authors: Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13718
Pdf URL: https://arxiv.org/pdf/2501.13718
Copy Paste: [[2501.13718]] A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation(https://arxiv.org/abs/2501.13718)
Keywords: generation, generative
Abstract: In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics and latent variable utilization remain only empirically observed. In this work, we propose a novel framework to systematically quantify the impact of each latent variable in MLVGMs, using Mutual Information (MI) as a guiding metric. Our analysis reveals underutilized variables and can guide the use of MLVGMs in downstream applications. With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, and guided by the previous analysis, we apply tailored latent perturbations to produce diverse views for SSCRL, without relying on real data altogether. Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs' generated views compete on par with or even surpass views generated from real data. This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning.
摘要：在图像生成中，多隐变量生成模型 (MLVGM) 使用多个隐变量逐步塑造最终图像，从全局特征到更精细和局部细节（例如 StyleGAN、NVAE），成为各种应用的强大工具。然而，它们的生成动力学和隐变量利用率仍然只是经验观察。在这项工作中，我们提出了一个新框架，以互信息 (MI) 作为指导指标，系统地量化 MLVGM 中每个隐变量的影响。我们的分析揭示了未充分利用的变量，可以指导 MLVGM 在下游应用中的使用。在此基础上，我们引入了一种为自监督对比表征学习 (SSCRL) 生成合成数据的方法。通过利用 MLVGM 的分层和解开变量，并在之前的分析的指导下，我们应用定制的潜在扰动来为 SSCRL 产生不同的视图，而无需完全依赖真实数据。此外，我们引入了连续采样 (CS) 策略，其中生成器在 SSCRL 训练期间动态创建新样本，从而大大增加了数据的可变性。我们的全面实验证明了这些贡献的有效性，表明 MLVGM 生成的视图与从真实数据生成的视图相媲美甚至超越。这项工作建立了一种理解和利用 MLVGM 的原则性方法，推进了生成建模和自监督学习。

Title: Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

Authors: Tanya Rodchenko, Natasha Noy, Nino Scherrer, Jennifer Prendki
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13779
Pdf URL: https://arxiv.org/pdf/2501.13779
Copy Paste: [[2501.13779]] Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling(https://arxiv.org/abs/2501.13779)
Keywords: generation
Abstract: While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the topology of data itself informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.
摘要：虽然大型语言模型需要越来越多的数据来训练和扩展，但我们不应该寻找任何数据来获取，而应该考虑哪些类型的任务更有可能从数据扩展中受益。我们应该有意识地获取数据。我们认为，数据本身的拓扑结构决定了在数据扩展中优先考虑哪些任务，并塑造了下一代计算范式的发展，用于数据扩展效率低下甚至不足的任务。

Title: Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes

Authors: Shiling Deng, Serge Belongie, Peter Ebert Christensen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.13851
Pdf URL: https://arxiv.org/pdf/2501.13851
Copy Paste: [[2501.13851]] Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes(https://arxiv.org/abs/2501.13851)
Keywords: generation
Abstract: Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.
摘要：模因已成为一种强大的交流形式，它融合了视觉和文本元素来传达幽默、讽刺和文化信息。现有研究主要集中在情绪分类、模因生成、传播、解释、比喻语言和社会语言学等方面，但往往忽略了更深层次的模因理解和模因文本检索。为了弥补这些差距，本研究引入了 ClassicMemes-50-templates (CM50)，这是一个包含超过 33,000 个模因的大型数据集，以 50 个流行的模因模板为中心。我们还提出了一种基于知识的自动化注释流程，利用大型视觉语言模型来生成高质量的图像标题、模因标题和文学设备标签，从而克服了手动注释的劳动密集型需求。此外，我们提出了一个模因文本检索 CLIP 模型 (mtrCLIP)，该模型利用跨模态嵌入来增强模因分析，显著提高检索性能。我们的贡献包括：（1）用于大规模 meme 研究的新型数据集、（2）可扩展的 meme 注释框架，以及（3）用于 meme 文本检索的微调 CLIP，所有这些贡献都旨在提高对 meme 的大规模理解和分析。

Title: Generating Realistic Forehead-Creases for User Verification via Conditioned Piecewise Polynomial Curves

Authors: Abhishek Tandon, Geetanjali Sharma, Gaurav Jaswal, Aditya Nigam, Raghavendra Ramachandra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13889
Pdf URL: https://arxiv.org/pdf/2501.13889
Copy Paste: [[2501.13889]] Generating Realistic Forehead-Creases for User Verification via Conditioned Piecewise Polynomial Curves(https://arxiv.org/abs/2501.13889)
Keywords: generation
Abstract: We propose a trait-specific image generation method that models forehead creases geometrically using B-spline and Bézier curves. This approach ensures the realistic generation of both principal creases and non-prominent crease patterns, effectively constructing detailed and authentic forehead-crease images. These geometrically rendered images serve as visual prompts for a diffusion-based Edge-to-Image translation model, which generates corresponding mated samples. The resulting novel synthetic identities are then used to train a forehead-crease verification network. To enhance intra-subject diversity in the generated samples, we employ two strategies: (a) perturbing the control points of B-splines under defined constraints to maintain label consistency, and (b) applying image-level augmentations to the geometric visual prompts, such as dropout and elastic transformations, specifically tailored to crease patterns. By integrating the proposed synthetic dataset with real-world data, our method significantly improves the performance of forehead-crease verification systems under a cross-database verification protocol.
摘要：我们提出了一种特定特征的图像生成方法，该方法使用 B 样条和贝塞尔曲线对额头皱纹进行几何建模。这种方法确保了主要皱纹和非突出皱纹图案的真实生成，从而有效地构建了详细而真实的额头皱纹图像。这些几何渲染的图像可作为基于扩散的边缘到图像转换模型的视觉提示，该模型可生成相应的配对样本。然后使用生成的新合成身份来训练额头皱纹验证网络。为了增强生成样本中的受试者内多样性，我们采用了两种策略：(a) 在定义的约束下扰动 B 样条的控制点以保持标签一致性，以及 (b) 将图像级增强应用于几何视觉提示，例如 dropout 和弹性变换，专门针对皱纹图案进行定制。通过将提出的合成数据集与真实世界数据相结合，我们的方法显著提高了跨数据库验证协议下额头皱纹验证系统的性能。

Title: Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

Authors: Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13893
Pdf URL: https://arxiv.org/pdf/2501.13893
Copy Paste: [[2501.13893]] Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning(https://arxiv.org/abs/2501.13893)
Keywords: generation
Abstract: We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset, and strengthens its region understanding ability on the ViP-BENCH, with an overall improvement of +5.1%, including notable increases in recognition accuracy +11.2% and language generation quality +22.2%.
摘要：我们推出了 Pix2Cap-COCO，这是第一个全景像素级字幕数据集，旨在提高细粒度的视觉理解能力。为了实现这一目标，我们精心设计了一个自动注释管道，提示 GPT-4V 为图像中的各个对象生成像素对齐、特定于实例的字幕，使模型能够学习对象与其上下文之间更细粒度的关系。这种方法产生了 167,254 条详细字幕，平均每条字幕 22.94 个字。在 Pix2Cap-COCO 的基础上，我们引入了一项新任务，即全景分割字幕，它要求模型识别图像中的实例并同时为每个实例提供详细描述。为了对这项任务进行基准测试，我们基于 X-Decoder 设计了一个强大的基线。实验结果表明，Pix2Cap-COCO 是一个特别具有挑战性的数据集，因为它要求模型在细粒度的视觉理解和详细的语言生成方面都表现出色。此外，我们利用 Pix2Cap-COCO 对大型多模态模型 (LMM) 进行监督微调 (SFT)，以提高其性能。例如，使用 Pix2Cap-COCO 进行训练可显著提高 GPT4RoI 的性能，在 Visual Genome 数据集上，CIDEr 提高了 1.4%，ROUGE 提高了 0.4%，SPICE 提高了 0.5%，并增强了其在 ViP-BENCH 上的区域理解能力，总体提升了 5.1%，其中识别准确率显著提高 11.2%，语言生成质量显著提高 22.2%。

Title: PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection

Authors: Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.13898
Pdf URL: https://arxiv.org/pdf/2501.13898
Copy Paste: [[2501.13898]] PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection(https://arxiv.org/abs/2501.13898)
Keywords: generation
Abstract: With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model's ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at this https URL.
摘要：随着对定向物体检测 (OOD) 的需求不断增长，最近对点监督 OOD 的研究引起了广泛关注。在本文中，我们提出了 PointOBB-v3，一个更强大的单点监督 OOD 框架。与现有方法相比，它无需额外的先验即可生成伪旋转框，并支持端到端范式。PointOBB-v3 通过整合三个独特的图像视图来发挥作用：原始视图、调整大小的视图和旋转/翻转 (rot/flp) 视图。基于视图，构建了尺度增强模块和角度获取模块。在第一个模块中，引入了尺度敏感一致性 (SSC) 损失和尺度敏感特征融合 (SSFF) 模块，以提高模型估计物体尺度的能力。为了实现精确的角度预测，第二个模块采用基于对称性的自监督学习。此外，我们引入了一个端到端版本，通过集成检测器分支消除了伪标签生成过程，并引入了实例感知加权 (IAW) 策略来专注于高质量预测。我们对 DIOR-R、DOTA-v1.0/v1.5/v2.0、FAIR1M、STAR 和 RSAR 数据集进行了广泛的实验。在所有这些数据集中，与之前最先进的方法相比，我们的方法实现了 3.56% 的平均准确率提升。代码将在此 https URL 上提供。

Title: Binary Diffusion Probabilistic Model

Authors: Vitaliy Kinakh, Slava Voloshynovskiy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13915
Pdf URL: https://arxiv.org/pdf/2501.13915
Copy Paste: [[2501.13915]] Binary Diffusion Probabilistic Model(https://arxiv.org/abs/2501.13915)
Keywords: restoration, super-resolution, generative
Abstract: We introduce the Binary Diffusion Probabilistic Model (BDPM), a novel generative model optimized for binary data representations. While denoising diffusion probabilistic models (DDPMs) have demonstrated notable success in tasks like image synthesis and restoration, traditional DDPMs rely on continuous data representations and mean squared error (MSE) loss for training, applying Gaussian noise models that may not be optimal for discrete or binary data structures. BDPM addresses this by decomposing images into bitplanes and employing XOR-based noise transformations, with a denoising model trained using binary cross-entropy loss. This approach enables precise noise control and computationally efficient inference, significantly lowering computational costs and improving model convergence. When evaluated on image restoration tasks such as image super-resolution, inpainting, and blind image restoration, BDPM outperforms state-of-the-art methods on the FFHQ, CelebA, and CelebA-HQ datasets. Notably, BDPM requires fewer inference steps than traditional DDPM models to reach optimal results, showcasing enhanced inference efficiency.
摘要：我们引入了二元扩散概率模型 (BDPM)，这是一种针对二元数据表示优化的新型生成模型。虽然去噪扩散概率模型 (DDPM) 在图像合成和恢复等任务中表现出显著的成功，但传统的 DDPM 依赖于连续数据表示和均方误差 (MSE) 损失进行训练，应用高斯噪声模型，这些模型可能不是离散或二元数据结构的最佳选择。BDPM 通过将图像分解为位平面并采用基于 XOR 的噪声变换来解决此问题，并使用二元交叉熵损失训练去噪模型。这种方法可以实现精确的噪声控制和计算效率高的推理，从而显着降低计算成本并提高模型收敛性。在图像恢复任务（例如图像超分辨率、修复和盲图像恢复）上进行评估时，BDPM 在 FFHQ、CelebA 和 CelebA-HQ 数据集上的表现优于最先进的方法。值得注意的是，与传统 DDPM 模型相比，BDPM 需要更少的推理步骤即可达到最佳结果，从而展现出增强的推理效率。

Title: Improving Video Generation with Human Feedback

Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13918
Pdf URL: https://arxiv.org/pdf/2501.13918
Copy Paste: [[2501.13918]] Improving Video Generation with Human Feedback(https://arxiv.org/abs/2501.13918)
Keywords: generation
Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: this https URL.
摘要：通过整流技术，视频生成取得了重大进展，但诸如运动不流畅和视频与提示不一致等问题仍然存在。在这项工作中，我们开发了一个系统管道，利用人类反馈来缓解这些问题并改进视频生成模型。具体来说，我们首先构建一个专注于现代视频生成模型的大规模人类偏好数据集，并结合跨多维度的成对注释。然后，我们引入了 VideoReward，一个多维视频奖励模型，并研究注释和各种设计选择如何影响其奖励效果。从旨在通过 KL 正则化最大化奖励的统一强化学习角度来看，我们通过扩展扩散模型中的算法，为基于流的模型引入了三种对齐算法。这些包括两种训练时间策略：流的直接偏好优化 (Flow-DPO) 和流的奖励加权回归 (Flow-RWR)，以及一种推理时间技术 Flow-NRG，它将奖励指导直接应用于嘈杂的视频。实验结果表明，VideoReward 的表现明显优于现有的奖励模型，Flow-DPO 的表现优于 Flow-RWR 和标准监督微调方法。此外，Flow-NRG 允许用户在推理过程中为多个目标分配自定义权重，满足个性化的视频质量需求。项目页面：此 https URL。

Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.13920
Pdf URL: https://arxiv.org/pdf/2501.13920
Copy Paste: [[2501.13920]] IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models(https://arxiv.org/abs/2501.13920)
Keywords: generation
Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at this https URL.
摘要：随着传播模型的快速发展，文本转图像 (T2I) 模型取得了重大进展，在快速跟踪和图像生成方面表现出色。最近推出的 FLUX.1 和 Ideogram2.0 等模型，以及 Dall-E3 和 Stable Diffusion 3 等模型在各种复杂任务中都表现出色，这引发了人们对 T2I 模型是否正在走向通用适用性的疑问。除了传统的图像生成之外，这些模型还展现出跨多个领域的能力，包括可控生成、图像编辑、视频、音频、3D 和运动生成，以及语义分割和深度估计等计算机视觉任务。然而，目前的评估框架不足以全面评估这些模型在不断扩展的领域中的表现。为了全面评估这些模型，我们开发了 IMAGINE-E 并测试了六个突出的模型：FLUX.1、Ideogram2.0、Midjourney、Dall-E3、Stable Diffusion 3 和 Jimeng。我们的评估分为五个关键领域：结构化输出生成、真实性和物理一致性、特定领域生成、具有挑战性的场景生成和多风格创作任务。这项综合评估突出了每个模型的优势和局限性，特别是 FLUX.1 和 Ideogram2.0 在结构化和特定领域任务中的出色表现，强调了 T2I 模型作为基础 AI 工具的不断扩展的应用和潜力。这项研究为 T2I 模型在向通用可用性发展过程中的现状和未来发展轨迹提供了宝贵的见解。评估脚本将在此 https URL 上发布。

Title: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.13925
Pdf URL: https://arxiv.org/pdf/2501.13925
Copy Paste: [[2501.13925]] GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing(https://arxiv.org/abs/2501.13925)
Keywords: generation
Abstract: Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.
摘要：大型多模态模型 (LMM) 的最新进展已认识到细粒度基础是视觉理解和对话的重要因素。然而，LMM 中这种表示的好处仅限于自然图像域，并且这些模型在遥感 (RS) 方面表现不佳。高分辨率 RS 图像中独特的俯视视点、尺度变化和小物体的存在对区域级理解提出了独特的挑战。此外，由于缺乏粒度、RS 域特定的基础数据，阻碍了 RS 中 LMM 基础对话能力的发展。为了解决这些限制，我们提出了 GeoPixel - 第一个支持像素级基础的端到端高分辨率 RS-LMM。此功能通过在对话中生成交错掩码来实现细粒度的视觉感知。GeoPixel 在任何宽高比下都支持高达 4K 的高清分辨率，非常适合高精度 RS 图像分析。为了支持 RS 图像中的接地对话生成 (GCG)，我们通过半自动化管道整理了视觉接地数据集 GeoPixelD，该管道利用针对 RS 数据量身定制的标记集提示和空间先验来有条不紊地控制数据生成过程。GeoPixel 在像素级理解方面表现出色，在单目标和多目标分割任务中均超越了现有的 LMM。我们的方法消融研究验证了整体架构中每个组件的有效性。我们的代码和数据将公开发布。

Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.13926
Pdf URL: https://arxiv.org/pdf/2501.13926
Copy Paste: [[2501.13926]] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step(https://arxiv.org/abs/2501.13926)
Keywords: generation
Abstract: Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at this https URL
摘要：思路链 (CoT) 推理已在大型模型中得到广泛探索，以解决复杂的理解任务。然而，这种策略是否可以应用于验证和强化图像生成场景仍是一个悬而未决的问题。在本文中，我们首次全面研究了 CoT 推理增强自回归图像生成的潜力。我们专注于三种技术：扩展测试时间计算以进行验证、将模型偏好与直接偏好优化 (DPO) 对齐以及集成这些技术以产生互补效果。我们的结果表明，这些方法可以有效地调整和组合，以显着提高图像生成性能。此外，鉴于奖励模型在我们的研究结果中起着关键作用，我们提出了潜在评估奖励模型 (PARM) 和 PARM++，专门用于自回归图像生成。PARM 通过潜在评估方法自适应地评估每个生成步骤，融合现有奖励模型的优势，PARM++ 进一步引入反射机制来自我纠正生成的不令人满意的图像。使用我们研究的推理策略，我们增强了基线模型 Show-o，以实现卓越的结果，在 GenEval 基准上显著提高了 24%，超过 Stable Diffusion 3 15%。我们希望我们的研究能够提供独特的见解，并为将 CoT 推理与自回归图像生成相结合铺平道路。代码和模型在此 https URL 上发布