2025-01-03

Title: Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Authors: Dibakar Gope, David Mansell, Danny Loh, Ian Bratt
Subjects: cs.LG, cs.AI, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00032
Pdf URL: https://arxiv.org/pdf/2501.00032
Copy Paste: [[2501.00032]] Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs(https://arxiv.org/abs/2501.00032)
Keywords: generation
Abstract: Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to this http URL-based solution. The optimized kernels are available at this https URL.
摘要：大型语言模型 (LLM) 改变了我们对语言理解和生成的看法，让研究人员和开发人员都为之着迷。然而，由于 LLM 规模空前庞大，资源需求巨大，因此部署 LLM 进行推理是一项重大挑战。虽然将模型权重量化为亚字节精度已成为缓解内存压力的一种有前途的解决方案，但常用于 LLM 量化的组量化格式具有很大的计算开销和资源密集型的反量化过程。因此，更高比例的计算指令不执行乘法，即实际工作，这使得它们不适合满足部署在商用 CPU 上的 LLM 所需的延迟要求。在这项工作中，我们提出了一组高度优化的内核来加速 LLM 推理并充分发挥 CPU（尤其是 Arm CPU）的潜力。这些内核分摊了加载操作数的成本和跨多个输出行解包权重的成本。结合引入优化的权重交错组数据布局和解压缩路径优化，以减少不必要的操作和去量化开销，同时最大限度地利用向量和矩阵乘法运算，显著提高了 MAC 运算的效率。此外，我们提出了一种基于分组非均匀码本的量化方法，用于 LLM 的超低精度量化，以更好地匹配其权重分布中的非均匀模式，在令牌生成期间展示更好的吞吐量，同时确保比最先进的质量更好。与基于 http URL 的解决方案相比，将这些改进应用于 4 位 LLM 可使 Arm CPU 上的快速处理速度提高 3-3.2 倍，自回归解码速度提高 2 倍。优化的内核可在此 https URL 上获得。

Title: DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework

Authors: Yu-Zheng Lin, Qinxuan Shi, Zhanglong Yang, Banafsheh Saber Latibari, Sicong Shao, Soheil Salehi, Pratik Satam
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2501.00051
Pdf URL: https://arxiv.org/pdf/2501.00051
Copy Paste: [[2501.00051]] DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework(https://arxiv.org/abs/2501.00051)
Keywords: generative
Abstract: Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical systems. This paper presents a new method inspired by dynamic data-driven applications systems (DDDAS), called the dynamic data-driven generative of digital twins framework (DDD-GenDT), which combines the physical system with LLM, allowing LLM to act as DT to interact with the physical system operating status and generate the corresponding physical behaviors. We apply DDD-GenDT to the computer numerical control (CNC) machining process, and we use the spindle current measurement data in the NASA milling wear data set as an example to enable LLMs to forecast the physical behavior from historical data and interact with current observations. Experimental results show that in the zero-shot prediction setting, the LLM-based DT can adapt to the change in the system, and the average RMSE of the GPT-4 prediction is 0.479A, which is 4.79% of the maximum spindle motor current measurement of 10A, with little training data and instructions required. Furthermore, we analyze the performance of DDD-GenDT in this specific application and their potential to construct digital twins. We also discuss the limitations and challenges that may arise in practical implementations.
摘要：数字孪生 (DT) 技术已成为一种模拟、预测和优化物理系统行为的变革性方法，其应用范围涵盖制造业、医疗保健、气候科学等领域。然而，DT 模型的开发通常面临着数据要求高、集成复杂性以及对物理系统动态变化的适应性有限等挑战。本文提出了一种受动态数据驱动应用系统 (DDDAS) 启发的新方法，称为动态数据驱动的数字孪生生成框架 (DDD-GenDT)，它将物理系统与 LLM 相结合，允许 LLM 充当 DT 与物理系统运行状态交互并生成相应的物理行为。我们将 DDD-GenDT 应用于计算机数控 (CNC) 加工过程，并以 NASA 铣削磨损数据集中的主轴电流测量数据为例，使 LLM 能够从历史数据中预测物理行为并与当前观察结果进行交互。实验结果表明，在零样本预测设置下，基于 LLM 的 DT 可以适应系统的变化，GPT-4 预测的平均 RMSE 为 0.479A，相当于主轴电机最大电流测量值 10A 的 4.79%，且几乎不需要训练数据和指令。此外，我们分析了 DDD-GenDT 在此特定应用中的性能及其构建数字孪生的潜力。我们还讨论了实际实施中可能出现的限制和挑战。

Title: "Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market

Authors: Guangming Che
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00063
Pdf URL: https://arxiv.org/pdf/2501.00063
Copy Paste: [[2501.00063]] "Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market(https://arxiv.org/abs/2501.00063)
Keywords: generative
Abstract: The financial industry is increasingly seeking robust methods to address the challenges posed by data scarcity and low signal-to-noise ratios, which limit the application of deep learning techniques in stock market analysis. This paper presents two innovative generative model-based approaches to synthesize stock data, specifically tailored for different scenarios within the A-share market in China. The first method, a sector-based synthesis approach, enhances the signal-to-noise ratio of stock data by classifying the characteristics of stocks from various sectors in China's A-share market. This method employs an Approximate Non-Local Total Variation algorithm to smooth the generated data, a bandpass filtering method based on Fourier Transform to eliminate noise, and Denoising Diffusion Implicit Models to accelerate sampling speed. The second method, a recursive stock data synthesis approach based on pattern recognition, is designed to synthesize data for stocks with short listing periods and limited comparable companies. It leverages pattern recognition techniques and Markov models to learn and generate variable-length stock sequences, while introducing a sub-time-level data augmentation method to alleviate data scarcity this http URL validate the effectiveness of these methods through extensive experiments on various datasets, including those from the main board, STAR Market, Growth Enterprise Market Board, Beijing Stock Exchange, NASDAQ, NYSE, and AMEX. The results demonstrate that our synthesized data not only improve the performance of predictive models but also enhance the signal-to-noise ratio of individual stock signals in price trading strategies. Furthermore, the introduction of sub-time-level data significantly improves the quality of synthesized data.
摘要：金融行业越来越多地寻求可靠的方法来应对数据稀缺和低信噪比所带来的挑战，这限制了深度学习技术在股市分析中的应用。本文提出了两种基于生成模型的创新股票数据合成方法，专门针对中国 A 股市场的不同场景。第一种方法是基于行业的合成方法，通过对中国 A 股市场各个行业的股票特征进行分类来提高股票数据的信噪比。该方法采用近似非局部总变分算法来平滑生成的数据，采用基于傅里叶变换的带通滤波方法消除噪声，并采用去噪扩散隐式模型来加快采样速度。第二种方法是一种基于模式识别的递归股票数据合成方法，旨在合成上市期短、可比公司有限的股票数据。它利用模式识别技术和马尔可夫模型来学习和生成可变长度的股票序列，同时引入子时间级数据增强方法来缓解数据稀缺性，该 http URL 通过在各种数据集上的大量实验验证了这些方法的有效性，包括主板、科创板、创业板、北京证券交易所、纳斯达克、纽约证券交易所和美国证券交易所。结果表明，我们的合成数据不仅可以提高预测模型的性能，还可以提高价格交易策略中个股信号的信噪比。此外，引入子时间级数据显着提高了合成数据的质量。

Title: A Novel Framework for Learning Stochastic Representations for Sequence Generation and Recognition

Authors: Jungsik Hwang, Ahmadreza Ahmadi
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00076
Pdf URL: https://arxiv.org/pdf/2501.00076
Copy Paste: [[2501.00076]] A Novel Framework for Learning Stochastic Representations for Sequence Generation and Recognition(https://arxiv.org/abs/2501.00076)
Keywords: generation
Abstract: The ability to generate and recognize sequential data is fundamental for autonomous systems operating in dynamic environments. Inspired by the key principles of the brain-predictive coding and the Bayesian brain-we propose a novel stochastic Recurrent Neural Network with Parametric Biases (RNNPB). The proposed model incorporates stochasticity into the latent space using the reparameterization trick used in variational autoencoders. This approach enables the model to learn probabilistic representations of multidimensional sequences, capturing uncertainty and enhancing robustness against overfitting. We tested the proposed model on a robotic motion dataset to assess its performance in generating and recognizing temporal patterns. The experimental results showed that the stochastic RNNPB model outperformed its deterministic counterpart in generating and recognizing motion sequences. The results highlighted the proposed model's capability to quantify and adjust uncertainty during both learning and inference. The stochasticity resulted in a continuous latent space representation, facilitating stable motion generation and enhanced generalization when recognizing novel sequences. Our approach provides a biologically inspired framework for modeling temporal patterns and advances the development of robust and adaptable systems in artificial intelligence and robotics.
摘要：生成和识别顺序数据的能力对于在动态环境中运行的自主系统至关重要。受大脑预测编码和贝叶斯大脑的关键原理的启发，我们提出了一种具有参数偏差的新型随机循环神经网络 (RNNPB)。所提出的模型使用变分自动编码器中使用的重新参数化技巧将随机性融入潜在空间。这种方法使模型能够学习多维序列的概率表示，捕捉不确定性并增强对过度拟合的鲁棒性。我们在机器人运动数据集上测试了所提出的模型，以评估其在生成和识别时间模式方面的性能。实验结果表明，随机 RNNPB 模型在生成和识别运动序列方面优于其确定性模型。结果突出了所提出的模型在学习和推理过程中量化和调整不确定性的能力。随机性产生了连续的潜在空间表示，促进了稳定的运动生成和识别新序列时的增强泛化。我们的方法为时间模式建模提供了一个受生物启发的框架，并推动了人工智能和机器人技术领域的稳健和适应性系统的发展。

Title: LTX-Video: Realtime Video Latent Diffusion

Authors: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, Ofir Bibi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00103
Pdf URL: https://arxiv.org/pdf/2501.00103
Copy Paste: [[2501.00103]] LTX-Video: Realtime Video Latent Diffusion(https://arxiv.org/abs/2501.00103)
Keywords: generation
Abstract: We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.
摘要：我们推出了 LTX-Video，这是一种基于 Transformer 的潜在扩散模型，它通过无缝集成 Video-VAE 和去噪 Transformer 的职责，采用整体方法生成视频。与将这些组件视为独立组件的现有方法不同，LTX-Video 旨在优化它们的交互以提高效率和质量。其核心是经过精心设计的 Video-VAE，可实现 1:192 的高压缩比，时空缩小为每标记 32 x 32 x 8 像素，通过将修补操作从 Transformer 的输入重新定位到 VAE 的输入来实现。在这种高度压缩的潜在空间中运行使 Transformer 能够有效地执行完整的时空自注意力，这对于生成具有时间一致性的高分辨率视频至关重要。然而，高压缩本质上限制了精细细节的表示。为了解决这个问题，我们的 VAE 解码器负责潜在到像素的转换和最后的去噪步骤，直接在像素空间中产生干净的结果。这种方法保留了生成精细细节的能力，而无需承担单独上采样模块的运行时成本。我们的模型支持多种用例，包括文本到视频和图像到视频生成，两种功能同时训练。它实现了比实时更快的生成速度，在 Nvidia H100 GPU 上仅用 2 秒就生成了 5 秒的 24 fps 768x512 分辨率视频，优于所有现有类似规模的模型。源代码和预训练模型都是公开的，为可访问和可扩展的视频生成树立了新的标杆。

Title: PQD: Post-training Quantization for Efficient Diffusion Models

Authors: Jiaojiao Ye, Zhen Wang, Linnan Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00124
Pdf URL: https://arxiv.org/pdf/2501.00124
Copy Paste: [[2501.00124]] PQD: Post-training Quantization for Efficient Diffusion Models(https://arxiv.org/abs/2501.00124)
Keywords: generation, generative
Abstract: Diffusionmodels(DMs)havedemonstratedremarkableachievements in synthesizing images of high fidelity and diversity. However, the extensive computational requirements and slow generative speed of diffusion models have limited their widespread adoption. In this paper, we propose a novel post-training quantization for diffusion models (PQD), which is a time-aware optimization framework for diffusion models based on post-training quantization. The proposed framework optimizes the inference process by selecting representative samples and conducting time-aware calibration. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a few FID change on ImageNet for unconditional image generation. Our approach demonstrates compatibility and can also be applied to 512x512 text-guided image generation for the first time.
摘要：扩散模型（DM）在合成高保真度和多样性图像方面取得了显著成就。然而，扩散模型的大量计算要求和缓慢的生成速度限制了它们的广泛应用。在本文中，我们提出了一种新颖的扩散模型训练后量化（PQD），这是一种基于训练后量化的扩散模型时间感知优化框架。该框架通过选择代表性样本和进行时间感知校准来优化推理过程。实验结果表明，我们提出的方法能够将全精度扩散模型直接量化为 8 位或 4 位模型，同时以无训练的方式保持可比的性能，在 ImageNet 上实现了无条件图像生成的少量 FID 变化。我们的方法展示了兼容性，并且首次可以应用于 512x512 文本引导的图像生成。

Title: TrajLearn: Trajectory Prediction Learning using Deep Generative Models

Authors: Amirhossein Nadiri, Jing Li, Ali Faraji, Ghadeer Abuoda, Manos Papagelis
Subjects: cs.LG, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2501.00184
Pdf URL: https://arxiv.org/pdf/2501.00184
Copy Paste: [[2501.00184]] TrajLearn: Trajectory Prediction Learning using Deep Generative Models(https://arxiv.org/abs/2501.00184)
Keywords: generative
Abstract: Trajectory prediction aims to estimate an entity's future path using its current position and historical movement data, benefiting fields like autonomous navigation, robotics, and human movement analytics. Deep learning approaches have become key in this area, utilizing large-scale trajectory datasets to model movement patterns, but face challenges in managing complex spatial dependencies and adapting to dynamic environments. To address these challenges, we introduce TrajLearn, a novel model for trajectory prediction that leverages generative modeling of higher-order mobility flows based on hexagonal spatial representation. TrajLearn predicts the next $k$ steps by integrating a customized beam search for exploring multiple potential paths while maintaining spatial continuity. We conducted a rigorous evaluation of TrajLearn, benchmarking it against leading state-of-the-art approaches and meaningful baselines. The results indicate that TrajLearn achieves significant performance gains, with improvements of up to ~40% across multiple real-world trajectory datasets. In addition, we evaluated different prediction horizons (i.e., various values of $k$), conducted resolution sensitivity analysis, and performed ablation studies to assess the impact of key model components. Furthermore, we developed a novel algorithm to generate mixed-resolution maps by hierarchically subdividing hexagonal regions into finer segments within a specified observation area. This approach supports selective detailing, applying finer resolution to areas of interest or high activity (e.g., urban centers) while using coarser resolution for less significant regions (e.g., rural areas), effectively reducing data storage requirements and computational overhead. We promote reproducibility and adaptability by offering complete code, data, and detailed documentation with flexible configuration options for various applications.
摘要：轨迹预测旨在利用实体的当前位置和历史运动数据来估计其未来路径，从而使自主导航、机器人和人体运动分析等领域受益。深度学习方法已成为该领域的关键，利用大规模轨迹数据集来建模运动模式，但在管理复杂的空间依赖性和适应动态环境方面面临挑战。为了应对这些挑战，我们引入了 TrajLearn，这是一种新颖的轨迹预测模型，它利用基于六边形空间表示的高阶移动流生成建模。TrajLearn 通过集成定制的波束搜索来探索多条潜在路径，同时保持空间连续性，从而预测接下来的 $k$ 步。我们对 TrajLearn 进行了严格的评估，将其与领先的最先进方法和有意义的基线进行了对比。结果表明，TrajLearn 实现了显着的性能提升，在多个真实世界轨迹数据集中性能提升高达约 40%。此外，我们评估了不同的预测范围（即各种 k 值），进行了分辨率敏感性分析，并进行了消融研究，以评估关键模型组件的影响。此外，我们开发了一种新算法，通过在指定的观察区域内将六边形区域分层细分为更精细的部分来生成混合分辨率地图。这种方法支持选择性细节，将更精细的分辨率应用于感兴趣区域或高活动区域（例如城市中心），同时将更粗糙的分辨率应用于不太重要的区域（例如农村地区），从而有效降低数据存储要求和计算开销。我们通过为各种应用程序提供完整的代码、数据和详细文档以及灵活的配置选项来提高可重复性和适应性。

Title: MLLM-as-a-Judge for Image Safety without Human Labeling

Authors: Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain
Subjects: cs.CV, cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00192
Pdf URL: https://arxiv.org/pdf/2501.00192
Copy Paste: [[2501.00192]] MLLM-as-a-Judge for Image Safety without Human Labeling(https://arxiv.org/abs/2501.00192)
Keywords: generation
Abstract: Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.
摘要：随着在线平台上视觉媒体的兴起，图像内容安全已成为一项重大挑战。同时，在人工智能生成内容 (AIGC) 时代，许多图像生成模型能够生成有害内容，例如包含色情或暴力内容的图像。因此，根据既定的安全规则识别此类不安全图像变得至关重要。预训练的多模态大型语言模型 (MLLM) 在这方面具有潜力，因为它们具有强大的模式识别能力。现有方法通常使用人工标记的数据集对 MLLM 进行微调，但这带来了一系列缺点。首先，依靠人工注释者按照复杂而详细的指南标记数据既昂贵又费力。此外，安全判断系统的用户可能需要频繁更新安全规则，这使得对基于人工的注释进行微调更具挑战性。这提出了一个研究问题：我们能否通过使用预定义的安全构成（一组安全规则）在零样本设置中查询 MLLM 来检测不安全的图像？我们的研究表明，仅仅查询预训练的 MLLM 并不能产生令人满意的结果。这种有效性的缺乏源于安全规则的主观性、冗长章程的复杂性以及模型固有的偏见等因素。针对这些挑战，我们提出了一种基于 MLLM 的方法，包括将安全规则客观化，评估规则与图像之间的相关性，使用逻辑完整但简化的安全规则前提条件链基于去偏的 token 概率进行快速判断，并在必要时使用级联的思路链过程进行更深入的推理。实验结果表明，我们的方法对于零样本图像安全性判断任务非常有效。

Title: DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion

Authors: Zixuan Yin, Han Sun, Ningzhong Liu, Huiyu Zhou, Jiaquan Shen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00220
Pdf URL: https://arxiv.org/pdf/2501.00220
Copy Paste: [[2501.00220]] DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion(https://arxiv.org/abs/2501.00220)
Keywords: generation
Abstract: Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.
摘要：激光雷达和摄像头在自动驾驶中扮演着重要的角色，为 3D 检测提供补充信息。最先进的融合方法在特征级别将它们集成在一起，但它们大多依赖于学习到的点云和图像之间的软关联，缺乏可解释性并且忽略了它们之间的硬关联。在本文中，我们将特征级融合与点级融合相结合，使用校准矩阵建立的硬关联来指导对象查询的生成。具体而言，在早期融合阶段，我们使用图像的 2D CNN 特征来修饰点云数据，并使用两个独立的稀疏卷积来提取修饰后的点云特征。在中级融合阶段，我们使用中心热图初始化查询，并将预测的类标签作为辅助信息嵌入到查询中，使初始位置更接近目标的实际中心。在两个流行数据集（即 KITTI、Waymo）上进行的大量实验证明了 DecoratingFusion 的优越性。

Title: ReFormer: Generating Radio Fakes for Data Augmentation

Authors: Yagna Kaasaragadda, Silvija Kokalj-Filipovic
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.00282
Pdf URL: https://arxiv.org/pdf/2501.00282
Copy Paste: [[2501.00282]] ReFormer: Generating Radio Fakes for Data Augmentation(https://arxiv.org/abs/2501.00282)
Keywords: generation, generative
Abstract: We present ReFormer, a generative AI (GAI) model that can efficiently generate synthetic radio-frequency (RF) data, or RF fakes, statistically similar to the data it was trained on, or with modified statistics, in order to augment datasets collected in real-world experiments. For applications like this, adaptability and scalability are important issues. This is why ReFormer leverages transformer-based autoregressive generation, trained on learned discrete representations of RF signals. By using prompts, such GAI can be made to generate the data which complies with specific constraints or conditions, particularly useful for training channel estimation and modeling. It may also leverage the data from a source system to generate training data for a target system. We show how different transformer architectures and other design choices affect the quality of generated RF fakes, evaluated using metrics such as precision and recall, classification accuracy and signal constellation diagrams.
摘要：我们提出了 ReFormer，这是一种生成式人工智能 (GAI) 模型，它可以高效地生成合成射频 (RF) 数据或 RF 伪造品，这些伪造品在统计上与其训练数据相似或具有修改后的统计数据，以扩充在实际实验中收集的数据集。对于这样的应用，适应性和可扩展性是重要的问题。这就是为什么 ReFormer 利用基于变压器的自回归生成，在学习到的 RF 信号离散表示上进行训练。通过使用提示，可以使此类 GAI 生成符合特定约束或条件的数据，这对于训练信道估计和建模特别有用。它还可以利用来自源系统的数据为目标系统生成训练数据。我们展示了不同的变压器架构和其他设计选择如何影响生成的 RF 伪造品的质量，并使用精度和召回率、分类准确度和信号星座图等指标进行评估。

Title: Dual Diffusion for Unified Image Generation and Understanding

Authors: Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, Peng Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00289
Pdf URL: https://arxiv.org/pdf/2501.00289
Copy Paste: [[2501.00289]] Dual Diffusion for Unified Image Generation and Understanding(https://arxiv.org/abs/2501.00289)
Keywords: generation
Abstract: Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.
摘要：扩散模型在文本到图像生成方面取得了巨大成功，但在视觉理解任务方面仍然落后，而视觉理解任务则由自回归视觉语言模型主导。我们提出了一种大规模、完全端到端的扩散模型，用于多模态理解和生成，该模型显著改进了现有的基于扩散的多模态模型，并且是同类模型中第一个支持全套视觉语言建模功能的模型。受多模态扩散变换器 (MM-DiT) 和离散扩散语言建模的最新进展的启发，我们利用跨模态最大似然估计框架，在单个损失函数下同时联合训练图像和文本的条件似然，该损失函数通过扩散变换器的两个分支反向传播。生成的模型非常灵活，能够完成各种任务，包括图像生成、字幕和视觉问答。与最近的统一图像理解和生成模型相比，我们的模型获得了具有竞争力的性能，展示了多模态扩散建模作为自回归下一个标记预测模型的有希望的替代方案的潜力。

Title: Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Authors: Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00375
Pdf URL: https://arxiv.org/pdf/2501.00375
Copy Paste: [[2501.00375]] Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free(https://arxiv.org/abs/2501.00375)
Keywords: generation, generative
Abstract: Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.
摘要：稳定扩散在文本到图像生成领域取得了显著的成功，其强大的生成能力和多样化的生成结果产生了持久的影响。然而，它的迭代去噪带来了高计算成本并减慢了生成速度，限制了更广泛的采用。社区为减少这种计算负担做出了许多努力，其中特征缓存等方法因其有效性和简单性而受到关注。尽管如此，简单地重复使用在之前的时间步骤计算的特征会导致相邻时间步骤之间的特征变得相似，从而降低特征随时间的动态，最终损害生成图像的质量。在本文中，我们介绍了一种动态感知标记修剪 (DaTo) 方法，解决了特征缓存的局限性。DaTo 有选择地修剪具有较低动态的标记，只允许高动态标记参与自注意力层，从而扩展跨时间步骤的特征动态。DaTo 以无训练的方式将特征缓存与标记修剪相结合，实现了时间和标记信息重用。将我们的方法应用于 ImageNet 上的稳定扩散，可实现 9$\times$ 的加速，同时将 FID 降低 0.33，这表明图像质量有所提高。在 COCO-30k 上，我们观察到 7$\times$ 的加速，同时 FID 显著降低了 2.17。

Title: Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Authors: Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00418
Pdf URL: https://arxiv.org/pdf/2501.00418
Copy Paste: [[2501.00418]] Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models(https://arxiv.org/abs/2501.00418)
Keywords: generative
Abstract: The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model's outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
摘要：生成式人工智能（尤其是大型语言模型）的快速普及已导致它们被整合到各种应用中。一种称为弱到强泛化的关键现象（即在弱模型输出上训练的强模型在任务执行方面超越弱模型）已引起广泛关注。然而，诸如鲁棒性、公平性和隐私性等关键可信度属性是否可以类似地推广仍是一个悬而未决的问题。在这项工作中，我们通过检查较强的模型是否可以在对较弱模型的输出进行微调时继承可信度属性来研究这个问题，我们将这一过程称为弱到强可信度泛化。为了解决这个问题，我们引入了两种基础训练策略：1) 弱可信度微调 (Weak TFT)，它在弱模型微调期间利用可信度正则化，以及 2) 弱和弱到强可信度微调 (Weak+WTS TFT)，它将正则化扩展到弱模型和强模型。我们对真实数据集的实验评估表明，虽然公平性、对抗性和 OOD 稳健性等某些可信度属性在两个模型都经过正则化后在迁移方面表现出显著改善，但隐私等其他属性并未表现出由弱到强的可信度迹象。作为第一项通过由弱到强的泛化探索可信度泛化的研究，我们的工作为了解由弱到强泛化的潜力和局限性提供了宝贵的见解。

Title: Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Authors: Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Subjects: cs.CV, cs.CL, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2501.00437
Pdf URL: https://arxiv.org/pdf/2501.00437
Copy Paste: [[2501.00437]] Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning(https://arxiv.org/abs/2501.00437)
Keywords: generation
Abstract: Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at this https URL.
摘要：最近，零样本图像字幕越来越受到关注，其中只有文本数据可用于训练。文本到图像扩散模型的显著进步表明，通过使用由这种预训练先验生成的合成图像字幕对，有可能解决此任务。尽管如此，合成图像显着区域中的缺陷细节会导致合成图像和文本之间的语义错位，从而导致结果受损。为了应对这一挑战，我们提出了一种新颖的逐块跨模态特征混合 (PCM) 机制，以自适应地在训练过程中以细粒度的方式缓解不真实的内容，该机制可以集成到大多数编码器-解码器框架中，从而引入我们的 PCM-Net。具体而言，对于每个输入图像，首先考虑 CLIP 空间中的图像-文本相似性来检测图像中的显着视觉概念。接下来，将输入图像的逐块视觉特征与显着视觉概念的文本特征选择性地融合，从而产生具有较少缺陷内容的混合特征图。最后，利用视觉语义编码器来优化导出的特征图，该特征图进一步合并到句子解码器中以生成字幕。此外，为了便于使用合成数据进行模型训练，设计了一种新颖的 CLIP 加权交叉熵损失，以优先考虑高质量的图像文本对，而不是低质量的图像文本对。在 MSCOCO 和 Flickr30k 数据集上进行的大量实验证明了我们的 PCM-Net 与最先进的基于 VLM 的方法相比具有优越性。值得注意的是，我们的 PCM-Net 在域内和跨域零样本图像字幕方面均排名第一。合成数据集 SynthImgCap 和代码可在此 https URL 上找到。

Title: SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training

Authors: Lu Zhang, Liang Zeng
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2501.00463
Pdf URL: https://arxiv.org/pdf/2501.00463
Copy Paste: [[2501.00463]] SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training(https://arxiv.org/abs/2501.00463)
Keywords: generation
Abstract: The proliferation of AI-generated images necessitates effective watermarking to protect intellectual property and identify fake content. While existing training-based watermarking methods show promise, they often struggle with generalization across diverse prompts and tend to produce noticeable artifacts. To this end, we introduce a provably generalizable image watermarking method for Latent Diffusion Models with Self-Augmented Training (SAT-LDM), which aligns the training and testing phases by a free generation distribution to bolster the watermarking module's generalization capabilities. We theoretically consolidate our method by proving that the free generation distribution contributes to its tight generalization bound without the need to collect new data. Extensive experimental results show that SAT-LDM achieves robust watermarking while significantly improving the quality of watermarked images across diverse prompts. Furthermore, we conduct experimental analyses to demonstrate the strong generalization abilities of SAT-LDM. We hope our method offers a practical and convenient solution for securing high-fidelity AI-generated content.
摘要：人工智能生成的图像的激增需要有效的水印来保护知识产权和识别虚假内容。虽然现有的基于训练的水印方法很有前景，但它们往往难以在不同的提示中进行泛化，并且往往会产生明显的伪影。为此，我们引入了一种可证明的泛化图像水印方法，用于自增强训练的潜在扩散模型 (SAT-LDM)，该方法通过自由生成分布将训练和测试阶段对齐，以增强水印模块的泛化能力。我们从理论上巩固了我们的方法，证明自由生成分布有助于其严格的泛化界限，而无需收集新数据。大量的实验结果表明，SAT-LDM 实现了稳健的水印，同时显著提高了不同提示中水印图像的质量。此外，我们进行了实验分析，以证明 SAT-LDM 强大的泛化能力。我们希望我们的方法为保护高保真人工智能生成的内容提供一种实用且方便的解决方案。

Title: Dementia Detection using Multi-modal Methods on Audio Data

Authors: Saugat Kannojia, Anirudh Praveen, Danish Vasdev, Saket Nandedkar, Divyansh Mittal, Sarthak Kalankar, Shaurya Johari, Vipul Arora
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00465
Pdf URL: https://arxiv.org/pdf/2501.00465
Copy Paste: [[2501.00465]] Dementia Detection using Multi-modal Methods on Audio Data(https://arxiv.org/abs/2501.00465)
Keywords: generative
Abstract: Dementia is a neurodegenerative disease that causes gradual cognitive impairment, which is very common in the world and undergoes a lot of research every year to prevent and cure it. It severely impacts the patient's ability to remember events and communicate clearly, where most variations of it have no known cure, but early detection can help alleviate symptoms before they become worse. One of the main symptoms of dementia is difficulty in expressing ideas through speech. This paper attempts to talk about a model developed to predict the onset of the disease using audio recordings from patients. An ASR-based model was developed that generates transcripts from the audio files using Whisper model and then applies RoBERTa regression model to generate an MMSE score for the patient. This score can be used to predict the extent to which the cognitive ability of a patient has been affected. We use the PROCESS_V1 dataset for this task, which is introduced through the PROCESS Grand Challenge 2025. The model achieved an RMSE score of 2.6911 which is around 10 percent lower than the described baseline.
摘要：痴呆症是一种导致认知逐渐受损的神经退行性疾病，在世界上非常常见，每年都会进行大量研究以预防和治疗它。它严重影响患者记忆事件和清晰交流的能力，其中大多数变体没有已知的治疗方法，但早期发现可以帮助缓解症状，防止症状恶化。痴呆症的主要症状之一是难以通过言语表达想法。本文试图讨论一种使用患者的录音来预测疾病发作的模型。开发了一个基于 ASR 的模型，该模型使用 Whisper 模型从音频文件中生成转录，然后应用 RoBERTa 回归模型为患者生成 MMSE 分数。该分数可用于预测患者的认知能力受到影响的程度。我们使用 PROCESS_V1 数据集来完成这项任务，该数据集是通过 PROCESS Grand Challenge 2025 引入的。该模型的 RMSE 得分为 2.6911，比描述的基线低约 10%。

Title: Probing Visual Language Priors in VLMs

Authors: Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00569
Pdf URL: https://arxiv.org/pdf/2501.00569
Copy Paste: [[2501.00569]] Probing Visual Language Priors in VLMs(https://arxiv.org/abs/2501.00569)
Keywords: generative
Abstract: Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual language priors present in their training data rather than true visual reasoning. To examine the situation, we introduce ViLP, a visual question answering (VQA) benchmark that pairs each question with three potential answers and three corresponding images: one image whose answer can be inferred from text alone, and two images that demand visual reasoning. By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts, making our benchmark images distinctly out-of-distribution. While humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on actual visual inputs and have demonstrated their effectiveness in enhancing the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
摘要：尽管视觉语言模型 (VLM) 近期取得了进展，但许多模型仍然过度依赖训练数据中存在的视觉语言先验，而不是真正的视觉推理。为了研究这种情况，我们引入了 ViLP，这是一个视觉问答 (VQA) 基准，它将每个问题与三个潜在答案和三个相应的图像配对：一个图像的答案可以仅从文本推断出来，两个图像需要视觉推理。通过利用图像生成模型，我们确保纹理、形状、概念组合、幻觉元素和基于谚语的上下文存在显著变化，使我们的基准图像明显偏离分布。虽然人类实现了近乎完美的准确度，但现代 VLM 却表现不佳；例如，GPT-4 在 ViLP 上仅达到 66.17%。为了缓解这种情况，我们提出了一个自我改进的框架，其中模型生成新的 VQA 对和图像，然后应用像素级和语义损坏来形成用于自我训练的“好坏”图像对。我们的训练目标迫使 VLM 更加关注实际的视觉输入，并已证明其在提高开源 VLM（包括 LLaVA-v1.5 和 Cambrian）性能方面的有效性。

Title: Unbiased GNN Learning via Fairness-Aware Subgraph Diffusion

Authors: Abdullah Alchihabi, Yuhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00595
Pdf URL: https://arxiv.org/pdf/2501.00595
Copy Paste: [[2501.00595]] Unbiased GNN Learning via Fairness-Aware Subgraph Diffusion(https://arxiv.org/abs/2501.00595)
Keywords: generative
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in tackling a wide array of graph-related tasks across diverse domains. However, a significant challenge lies in their propensity to generate biased predictions, particularly with respect to sensitive node attributes such as age and gender. These biases, inherent in many machine learning models, are amplified in GNNs due to the message-passing mechanism, which allows nodes to influence each other, rendering the task of making fair predictions notably challenging. This issue is particularly pertinent in critical domains where model fairness holds paramount importance. In this paper, we propose a novel generative Fairness-Aware Subgraph Diffusion (FASD) method for unbiased GNN learning. The method initiates by strategically sampling small subgraphs from the original large input graph, and then proceeds to conduct subgraph debiasing via generative fairness-aware graph diffusion processes based on stochastic differential equations (SDEs). To effectively diffuse unfairness in the input data, we introduce additional adversary bias perturbations to the subgraphs during the forward diffusion process, and train score-based models to predict these applied perturbations, enabling them to learn the underlying dynamics of the biases present in the data. Subsequently, the trained score-based models are utilized to further debias the original subgraph samples through the reverse diffusion process. Finally, FASD induces fair node predictions on the input graph by performing standard GNN learning on the debiased subgraphs. Experimental results demonstrate the superior performance of the proposed method over state-of-the-art Fair GNN baselines across multiple benchmark datasets.
摘要：图神经网络 (GNN) 在解决不同领域的各种图相关任务方面表现出了卓越的效果。然而，一个重大挑战在于它们倾向于生成有偏见的预测，特别是对于敏感的节点属性，例如年龄和性别。这些偏见是许多机器学习模型中固有的，由于消息传递机制允许节点相互影响，这些偏见在 GNN 中被放大，这使得做出公平预测的任务变得尤为具有挑战性。这个问题在模型公平性至关重要的关键领域尤其突出。在本文中，我们提出了一种新颖的生成公平感知子图扩散 (FASD) 方法，用于无偏 GNN 学习。该方法首先从原始大输入图中策略性地采样小子图，然后通过基于随机微分方程 (SDE) 的生成公平感知图扩散过程进行子图去偏。为了有效地消除输入数据中的不公平性，我们在正向扩散过程中向子图引入了额外的对抗性偏差扰动，并训练基于分数的模型来预测这些应用的扰动，使它们能够学习数据中存在的偏差的潜在动态。随后，利用训练后的基于分数的模型通过反向扩散过程进一步消除原始子图样本的偏差。最后，FASD 通过对去偏子图执行标准 GNN 学习来在输入图上诱导公平节点预测。实验结果表明，所提出的方法在多个基准数据集上的性能优于最先进的 Fair GNN 基线。

Title: DreamDrive: Generative 4D Scene Modeling from Street View Images

Authors: Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2501.00601
Pdf URL: https://arxiv.org/pdf/2501.00601
Copy Paste: [[2501.00601]] DreamDrive: Generative 4D Scene Modeling from Street View Images(https://arxiv.org/abs/2501.00601)
Keywords: generation, generative
Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
摘要：从自车的驾驶轨迹中合成照片般逼真的视觉观察是实现自动驾驶模型可扩展训练的关键一步。基于重建的方法从驾驶日志中创建 3D 场景，并通过神经渲染合成几何一致的驾驶视频，但它们对昂贵的对象注释的依赖限制了它们推广到野外驾驶场景的能力。另一方面，生成模型可以以更通用的方式合成动作条件驾驶视频，但通常难以保持 3D 视觉一致性。在本文中，我们提出了 DreamDrive，这是一种 4D 时空场景生成方法，它结合了生成和重建的优点，可以合成具有 3D 一致性的可推广 4D 驾驶场景和动态驾驶视频。具体来说，我们利用视频扩散模型的生成能力来合成一系列视觉参考，并通过新颖的混合高斯表示将它们进一步提升到 4D。给定驾驶轨迹后，我们通过高斯分层渲染 3D 一致的驾驶视频。使用生成先验使我们的方法能够从野外驾驶数据中生成高质量的 4D 场景，而神经渲染则确保从 4D 场景中生成 3D 一致的视频。对 nuScenes 和街景图像进行的大量实验表明，DreamDrive 可以生成可控且可推广的 4D 驾驶场景，合成具有高保真度和 3D 一致性的驾驶视频的新视图，以自监督的方式分解静态和动态元素，并增强自动驾驶的感知和规划任务。

Title: DiC: Rethinking Conv3x3 Designs in Diffusion Models

Authors: Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00603
Pdf URL: https://arxiv.org/pdf/2501.00603
Copy Paste: [[2501.00603]] DiC: Rethinking Conv3x3 Designs in Diffusion Models(https://arxiv.org/abs/2501.00603)
Keywords: generation
Abstract: Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: this https URL
摘要：扩散模型在视觉生成任务中表现出色。最近，这些模型已经从传统的 U 型 CNN-Attention 混合结构转变为完全基于 Transformer 的各向同性架构。虽然这些 Transformer 表现出强大的可扩展性和性能，但它们对复杂的自注意力操作的依赖导致推理速度缓慢。与这些工作相反，我们重新思考了深度学习中最简单但最快的模块之一 3x3 卷积，以构建一个放大的纯卷积扩散模型。我们首先发现，编码器-解码器沙漏设计优于 Conv3x3 的可扩展各向同性架构，但仍低于我们的预期。进一步改进架构，我们引入了稀疏跳过连接以减少冗余并提高可扩展性。基于该架构，我们引入了条件改进，包括阶段特定的嵌入、中间块条件注入和条件门控。这些改进导致了我们提出的扩散 CNN (DiC)，它是一种快速而有竞争力的扩散架构基线。各种规模和设置下的实验表明，DiC 在性能方面远远超过现有的扩散变压器，同时保持了良好的速度优势。项目页面：此 https URL

Title: SoundBrush: Sound as a Brush for Visual Scene Editing

Authors: Kim Sung-Bin, Kim Jun-Seong, Junseok Ko, Yewon Kim, Tae-Hyun Oh
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2501.00645
Pdf URL: https://arxiv.org/pdf/2501.00645
Copy Paste: [[2501.00645]] SoundBrush: Sound as a Brush for Visual Scene Editing(https://arxiv.org/abs/2501.00645)
Keywords: generative
Abstract: We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at this https URL.
摘要：我们提出了 SoundBrush，这是一种使用声音作为画笔来编辑和操纵视觉场景的模型。我们扩展了潜在扩散模型 (LDM) 的生成功能，以结合音频信息来编辑视觉场景。受现有图像编辑作品的启发，我们将此任务定义为监督学习问题，并利用各种现成的模型构建声音配对的视觉场景数据集进行训练。这个丰富的生成数据集使 SoundBrush 能够学习将音频特征映射到 LDM 的文本空间中，从而允许在各种野外声音的指导下进行视觉场景编辑。与现有方法不同，SoundBrush 可以准确地操纵整体场景，甚至可以插入发声对象以最佳匹配音频输入，同时保留原始内容。此外，通过与新颖的视图合成技术相结合，我们的框架可以扩展到编辑 3D 场景，从而促进声音驱动的 3D 场景操纵。演示可在此 https URL 上找到。

Title: Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models

Authors: Suttisak Wizadwongsa, Jinfan Zhou, Edward Li, Jeong Joon Park
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00651
Pdf URL: https://arxiv.org/pdf/2501.00651
Copy Paste: [[2501.00651]] Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models(https://arxiv.org/abs/2501.00651)
Keywords: generation, generative
Abstract: Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.
摘要：最近基于人工智能的 3D 内容创作主要沿着两条路径发展：前馈图像到 3D 重建方法和用 2D 或 3D 监督训练的 3D 生成模型。在这项工作中，我们表明现有的前馈重建方法可以作为训练 3D 生成模型的有效潜在编码器，从而弥合这两个范式之间的差距。通过重复使用功能强大的预训练重建模型，我们避免了计算成本高昂的编码器网络训练，并免费获得了用于生成建模的丰富 3D 潜在特征。然而，由于重建模型的潜在空间具有非结构化性质，因此不太适合生成建模。为了在这些潜在特征上实现基于流的模型训练，我们开发了后处理流程，包括标准化特征的协议和空间加权以集中在重要区域。我们进一步结合了 2D 图像空间感知渲染损失来处理高维潜在空间。最后，我们提出了一种基于多流 Transformer 的整流架构，以实现线性缩放和高质量的文本条件 3D 生成。我们的框架利用前馈重构模型的进步来增强 3D 生成模型的可扩展性，在文本到 3D 生成中实现了高计算效率和最先进的性能。

Title: Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection

Authors: Hao Wang, Cheng Deng, Zhidong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00700
Pdf URL: https://arxiv.org/pdf/2501.00700
Copy Paste: [[2501.00700]] Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection(https://arxiv.org/abs/2501.00700)
Keywords: generative
Abstract: Recent generative models demonstrate impressive performance on synthesizing photographic images, which makes humans hardly to distinguish them from pristine ones, especially on realistic-looking synthetic facial images. Previous works mostly focus on mining discriminative artifacts from vast amount of visual data. However, they usually lack the exploration of prior knowledge and rarely pay attention to the domain shift between training categories (e.g., natural and indoor objects) and testing ones (e.g., fine-grained human facial images), resulting in unsatisfactory detection performance. To address these issues, we propose a novel knowledge-guided prompt learning method for deepfake facial image detection. Specifically, we retrieve forgery-related prompts from large language models as expert knowledge to guide the optimization of learnable prompts. Besides, we elaborate test-time prompt tuning to alleviate the domain shift, achieving significant performance improvement and boosting the application in real-world scenarios. Extensive experiments on DeepFakeFaceForensics dataset show that our proposed approach notably outperforms state-of-the-art methods.
摘要：最近的生成模型在合成摄影图像方面表现出色，使得人类很难将它们与原始图像区分开来，尤其是在逼真的合成面部图像上。以前的研究大多侧重于从大量视觉数据中挖掘判别性伪影。然而，它们通常缺乏对先验知识的探索，很少关注训练类别（例如自然和室内物体）和测试类别（例如细粒度人脸图像）之间的领域转移，导致检测性能不令人满意。为了解决这些问题，我们提出了一种新颖的知识引导提示学习方法，用于深度伪造面部图像检测。具体而言，我们从大型语言模型中检索与伪造相关的提示作为专家知识，以指导可学习提示的优化。此外，我们精心设计了测试时提示调整以缓解领域转移，实现了显着的性能改进并促进了实际场景中的应用。在 DeepFakeFaceForensics 数据集上的大量实验表明，我们提出的方法明显优于最先进的方法。

Title: RORem: Training a Robust Object Remover with Human-in-the-Loop

Authors: Ruibin Li, Tao Yang, Song Guo, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00740
Pdf URL: https://arxiv.org/pdf/2501.00740
Copy Paste: [[2501.00740]] RORem: Training a Robust Object Remover with Human-in-the-Loop(https://arxiv.org/abs/2501.00740)
Keywords: generation
Abstract: Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at this https URL.
摘要：现有的物体移除方法尽管取得了显著的进步，但仍存在移除不完整、内容合成不正确和合成区域模糊等问题，导致成功率较低。这些问题主要是由于缺乏高质量的配对训练数据，以及这些方法采用的自监督训练范式，迫使模型对被遮盖的区域进行修复，导致合成被遮盖的物体和恢复背景之间产生歧义。为了解决这些问题，我们提出了一种半监督学习策略，采用人机交互来创建高质量的配对训练数据，旨在训练一个鲁棒物体移除器（RORem）。我们首先从开源数据集中收集 60K 个训练对，以训练一个初始物体移除模型来生成移除样本，然后利用人工反馈来选择一组高质量的物体移除对，我们利用这些对来训练一个鉴别器来自动化接下来的训练数据生成过程。通过迭代这个过程几轮，我们最终获得了一个包含超过 200K 对的大量物体移除数据集。使用该数据集对预训练的稳定扩散模型进行微调，我们获得了 RORem，它在可靠性和图像质量方面都表现出了最先进的物体移除性能。特别是，与以前的方法相比，RORem 将物体移除成功率提高了 18% 以上。数据集、源代码和训练模型可在此 https URL 上找到。

Title: Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Authors: Suho Park, SuBeen Lee, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00752
Pdf URL: https://arxiv.org/pdf/2501.00752
Copy Paste: [[2501.00752]] Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation(https://arxiv.org/abs/2501.00752)
Keywords: generation
Abstract: We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at this https URL
摘要：我们提出前景覆盖原型生成与匹配来解决少样本分割 (FSS)，旨在基于标记的支持图像分割未标记查询图像中的目标区域。与以前的研究不同，以前的研究通常使用支持原型和查询像素来估计查询中的目标区域，而我们利用支持原型和查询原型之间的关系。为此，我们利用两个互补的特征：用于像素聚合的 SAM 图像编码器特征和用于类一致性的 ResNet 特征。具体而言，我们使用 SAM 特征构建支持和查询原型，并基于 ResNet 特征区分目标区域的查询原型。对于查询原型构建，我们首先使用传统的伪掩码粗略地引导 SAM 特征中的前景区域，然后采用迭代交叉注意将前景特征聚合为可学习的标记。在这里，我们发现交叉注意权重可以有效地替代传统的伪掩码。因此，我们使用基于注意机制的伪掩码来引导 ResNet 特征聚焦于前景，然后将引导的 ResNet 特征注入可学习的标记中以生成类一致的查询原型。支持原型的生成与查询原型的生成对称，伪掩码被真实掩码取代。最后，我们将这些查询原型与支持原型进行比较以生成提示，随后通过 SAM 掩码解码器生成对象掩码。我们在各种数据集上的最新性能验证了所提出方法对 FSS 的有效性。我们的官方代码可在此 https URL 上找到

Title: Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation

Authors: Mingjia Li, Shuang Li, Tongrui Su, Longhui Yuan, Jian Liang, Wei Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00873
Pdf URL: https://arxiv.org/pdf/2501.00873
Copy Paste: [[2501.00873]] Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation(https://arxiv.org/abs/2501.00873)
Keywords: generative
Abstract: Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at this https URL.
摘要：利用生成模型和判别模型的互补优势一直是机器学习领域的一个引人注目的愿景，并得到了越来越多的研究支持。这项工作揭示了基于分数的生成模型中隐藏的语义结构，揭示了它们作为有效判别先验的潜力。受我们理论发现的启发，我们提出 DUSA 利用扩散分数背后的结构化语义先验来促进图像分类器或密集预测器的测试时间自适应。值得注意的是，DUSA 从去噪扩散的单个时间步中提取知识，从而打破了基于蒙特卡洛的似然估计随时间步长的魔咒。我们证明了我们的 DUSA 在适应各种竞争性预训练判别模型方面在不同的测试时间场景中的有效性。此外，我们还进行了一项彻底的消融研究，以剖析 DUSA 中的关键元素。代码可在此 https URL 上公开获取。

Title: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Authors: Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00880
Pdf URL: https://arxiv.org/pdf/2501.00880
Copy Paste: [[2501.00880]] Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction(https://arxiv.org/abs/2501.00880)
Keywords: generation
Abstract: Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation.
摘要：使用 LLM 进行视觉生成近年来成为研究热点。然而，现有的方法主要将 LLM 架构转移到视觉生成，而很少研究语言和视觉之间的根本区别。这种疏忽可能导致 LLM 框架内视觉生成能力的利用率不足。在本文中，我们探索了 LLM 框架下视觉嵌入空间的特性，发现视觉嵌入之间的相关性有助于实现更稳定、更鲁棒的生成结果。我们提出了一种改进的自回归视觉生成方法 IAR，可提高基于 LLM 的视觉生成模型的训练效率和生成质量。首先，我们提出了一种码本重排策略，该策略使用平衡 k 均值聚类算法将视觉码本重新排列成簇，确保每个簇内视觉特征之间的高度相似性。利用重新排列的码本，我们提出了一种面向簇的交叉熵损失，指导模型正确预测 token 所在的簇。这种方法确保即使模型预测了错误的 token 索引，预测的 token 也有很大概率位于正确的集群中，从而显著提高了生成质量和鲁棒性。大量实验表明，我们的方法可以持续提高模型训练效率和性能，从 100M 提高到 1.4B，在实现相同 FID 的同时将训练时间缩短了一半。此外，我们的方法可以应用于各种基于 LLM 的视觉生成模型，并遵循缩放定律，为基于 LLM 的视觉生成的未来研究提供了一个有希望的方向。

Title: Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts

Authors: Zhuohua Li, Maoli Liu, Xiangxiang Dai, John C.S. Lui
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2501.00891
Pdf URL: https://arxiv.org/pdf/2501.00891
Copy Paste: [[2501.00891]] Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts(https://arxiv.org/abs/2501.00891)
Keywords: generation
Abstract: The contextual multi-armed bandit (MAB) problem is crucial in sequential decision-making. A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters, utilizing shared features to improve learning efficiency. However, existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters. As a result, their theoretical analyses require several strong assumptions about the "diversity" of contexts generated by the environment, leading to impractical settings, complicated analyses, and poor practical performance. Removing these assumptions has been a long-standing open problem in the clustering of bandits literature. In this paper, we provide two solutions to this open problem. First, following the i.i.d. context generation setting in existing studies, we propose two novel algorithms, UniCLUB and PhaseUniCLUB, which incorporate enhanced exploration mechanisms to accelerate cluster identification. Remarkably, our algorithms require substantially weaker assumptions while achieving regret bounds comparable to prior work. Second, inspired by the smoothed analysis framework, we propose a more practical setting that eliminates the requirement for i.i.d. context generation used in previous studies, thus enhancing the performance of existing algorithms for online clustering of bandits. Our technique can be applied to both graph-based and set-based clustering of bandits frameworks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our proposed algorithms consistently outperform existing approaches.
摘要：上下文多臂老虎机 (MAB) 问题在顺序决策中至关重要。一项名为“老虎机在线聚类”的研究扩展了上下文多臂老虎机，将相似的用户分组为群集，利用共享特征来提高学习效率。然而，现有的算法依赖于置信上限 (UCB) 策略，难以收集足够的统计信息来准确识别未知的用户群集。因此，它们的理论分析需要对环境生成的上下文的“多样性”做出几个强有力的假设，从而导致不切实际的设置、复杂的分析和糟糕的实际性能。消除这些假设一直是老虎机聚类文献中长期存在的未解决的问题。在本文中，我们为这个未解决的问题提供了两种解决方案。首先，遵循现有研究中的 i.i.d. 上下文生成设置，我们提出了两种新算法，UniCLUB 和 PhaseUniCLUB，它们结合了增强的探索机制来加速群集识别。值得注意的是，我们的算法需要的假设要弱得多，同时实现了与之前工作相当的遗憾界限。其次，受平滑分析框架的启发，我们提出了一种更实用的设置，消除了以前研究中使用的 i.i.d. 上下文生成要求，从而提高了现有老虎机在线聚类算法的性能。我们的技术可以应用于基于图和基于集合的老虎机聚类框架。对合成数据集和真实世界数据集的广泛评估表明，我们提出的算法始终优于现有方法。

Title: Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model

Authors: Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00895
Pdf URL: https://arxiv.org/pdf/2501.00895
Copy Paste: [[2501.00895]] Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model(https://arxiv.org/abs/2501.00895)
Keywords: generation, generative
Abstract: Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA this http URL project page is \url{this https URL}
摘要：生成式基础模型推动了大规模文本驱动的自然图像生成，成为各垂直领域的突出研究趋势。然而，在遥感领域，仍然缺乏大规模文本到图像 (text2image) 生成技术的研究。现有的遥感图文数据集规模较小，局限于特定的地理区域和场景类型。此外，现有的 text2image 方法难以实现全球规模、多分辨率可控和无界的图像生成。针对这些挑战，本文提出了两个关键贡献：Git-10M 数据集和 Text2Earth 基础模型。Git-10M 是一个全球规模的图文数据集，包含 1000 万个图文对，比之前最大的数据集大 5 倍。该数据集涵盖广泛的地理场景并包含分辨率信息，在大小和多样性方面都远远超越现有的数据集。在 Git-10M 的基础上，我们提出了 Text2Earth，这是一个基于扩散框架的 13 亿参数生成基础模型，用于建模全球规模的遥感场景。Text2Earth 集成了分辨率指导机制，使用户能够指定图像分辨率。提出了一种动态条件自适应策略进行训练和推理，以提高图像质量。Text2Earth 在零样本文本到图像生成方面表现出色，并在多个任务中展示了强大的泛化和灵活性，包括无界场景构建、图像编辑和跨模态图像生成。这种强大的能力超越了以前局限于基本固定大小和有限场景类型的模型。在之前的基准数据集上，Text2Earth 的表现优于之前的模型，FID 提高了 +26.23，零样本 Cls-OA 提高了 +20.95% 此 http URL 项目页面是 \url{此 https URL}

Title: Population Aware Diffusion for Time Series Generation

Authors: Yang Li, Han Meng, Zhenyu Bi, Ingolv T. Urnes, Haipeng Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.00910
Pdf URL: https://arxiv.org/pdf/2501.00910
Copy Paste: [[2501.00910]] Population Aware Diffusion for Time Series Generation(https://arxiv.org/abs/2501.00910)
Keywords: generation
Abstract: Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.
摘要：扩散模型在生成高质量时间序列 (TS) 数据方面表现出良好的能力。尽管取得了初步成功，但现有研究主要关注个体层面的数据真实性，而较少关注保留整个数据集的群体层面属性。此类群体层面属性包括每个维度的值分布以及不同维度之间某些函数依赖关系（例如互相关，CC）的分布。例如，在生成房屋能耗 TS 数据时，应保留室外温度和厨房温度的值分布，以及它们之间的 CC 分布。保留此类 TS 群体层面属性对于维护数据集的统计洞察力、减轻模型偏差和增强 TS 预测等下游任务至关重要。然而，现有模型经常忽视这一点。因此，现有模型生成的数据通常会与原始数据存在分布偏差。我们提出了一种新的 TS 生成模型，即时间序列的群体感知扩散 (PaD-TS)，它可以更好地保留群体层面属性。 PaD-TS 的主要创新之处包括：1）一种明确结合 TS 群体级属性保存的新训练方法，以及 2）一种能够更好地捕捉 TS 数据结构的新双通道编码器模型架构。主要基准数据集中的实证结果表明，PaD-TS 可以将真实数据和合成数据之间的平均 CC 分布偏移得分提高 5.9 倍，同时在个体级真实性方面保持与最先进模型相当的性能。

Title: AutoPresent: Designing Structured Visuals from Scratch

Authors: Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.00912
Pdf URL: https://arxiv.org/pdf/2501.00912
Copy Paste: [[2501.00912]] AutoPresent: Designing Structured Visuals from Scratch(https://arxiv.org/abs/2501.00912)
Keywords: generation
Abstract: Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.
摘要：设计结构化的视觉效果（例如演示文稿幻灯片）对于交流需求至关重要，需要内容创建和视觉规划技能。在这项工作中，我们解决了自动幻灯片生成的挑战，其中模型根据自然语言 (NL) 指令生成幻灯片演示文稿。我们首先介绍 SlidesBench 基准，这是幻灯片生成的第一个基准，具有 7k 个训练和 585 个测试示例，这些示例来自 10 个领域的 310 个幻灯片组。SlidesBench 支持以下评估：(i) 基于参考以衡量与目标幻灯片的相似性，以及 (ii) 无参考以单独衡量生成幻灯片的设计质量。我们使用各种模型对端到端图像生成和程序生成方法进行了基准测试，并发现编程方法可以以用户可交互的格式生成更高质量的幻灯片。在程序生成的成功基础上，我们创建了 AutoPresent，这是一个基于 8B Llama 的模型，该模型在 7k 对与幻灯片生成代码配对的指令上进行了训练，并取得了与闭源模型 GPT-4o 相当的结果。我们进一步探索了迭代设计改进，其中模型的任务是自我改进自己的输出，我们发现这个过程提高了幻灯片的质量。我们希望我们的工作能为未来生成结构化视觉效果的工作奠定基础。

Title: Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

Authors: Emily Johnson, Noah Wilson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.00917
Pdf URL: https://arxiv.org/pdf/2501.00917
Copy Paste: [[2501.00917]] Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models(https://arxiv.org/abs/2501.00917)
Keywords: generation, generative
Abstract: Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image quality, semantic alignment, and text rendering accuracy. Human evaluations further validate the superior performance of VLAD, making it a promising approach for text-to-image generation in complex scenarios.
摘要：随着大型视觉语言模型 (LVLM) 的集成，文本到图像生成取得了重大进展，但在将复杂的文本描述与高质量、视觉连贯的图像对齐方面仍然存在挑战。本文介绍了视觉语言对齐扩散 (VLAD) 模型，这是一个生成框架，通过结合语义对齐和分层扩散的双流策略来解决这些挑战。VLAD 利用上下文组合模块 (CCM) 将文本提示分解为全局和局部表示，确保与视觉特征精确对齐。此外，它结合了多阶段扩散过程和分层指导来生成高保真图像。在 MARIO-Eval 和 INNOVATOR-Eval 基准上进行的实验表明，VLAD 在图像质量、语义对齐和文本渲染准确性方面明显优于最先进的方法。人工评估进一步验证了 VLAD 的卓越性能，使其成为复杂场景中文本到图像生成的有前途的方法。

Title: A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset

Authors: Junhuan Yang, Yuzhou Zhang, Yi Sheng, Youzuo Lin, Lei Yang
Subjects: cs.LG, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2501.00941
Pdf URL: https://arxiv.org/pdf/2501.00941
Copy Paste: [[2501.00941]] A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset(https://arxiv.org/abs/2501.00941)
Keywords: generation, generative
Abstract: Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present ``UB-Diff'', a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.
摘要：最近，生成式人工智能技术的出现对我们的日常生活产生了变革性的影响，但它在科学应用中的应用仍处于早期阶段。数据稀缺是数据驱动科学计算的一个主要且众所周知的障碍，因此物理引导的生成式人工智能具有重大前景。在科学计算中，大多数任务研究多种数据模态的转换以描述物理现象，例如地震成像中的空间和波形、信号处理中的时间和频率以及气候建模中的时间和频谱；因此，多模态成对数据生成是高度必要的，而不是通常用于自然图像（例如人脸、风景）的单模态数据生成。此外，在现实世界的应用中，模态方面可用数据的不平衡现象普遍存在；例如，地震成像中的空间数据（即速度图）可以轻松模拟，但现实世界的地震波形却严重缺乏。虽然最近的努力使强大的扩散模型能够生成多模态数据，但如何利用不平衡的可用数据仍不清楚。在本研究中，我们以地下地球物理学中的地震成像为载体，提出了一种用于多模态成对科学数据生成的新型扩散模型“UB-Diff”。其中一项重大创新是一进二出的编码器-解码器网络结构，该结构可以确保从共隐表示中获得成对数据。然后，扩散过程将使用共隐表示来生成成对数据。在 OpenFWI 数据集上的实验结果表明，UB-Diff 在 Fréchet 初始距离 (FID) 得分和成对评估方面明显优于现有技术，表明可以生成可靠且有用的多模态成对数据。

Title: Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion

Authors: Hao Wang, Xiwen Chen, Ashish Bastola, Jiayou Qin, Abolfazl Razi
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.00944
Pdf URL: https://arxiv.org/pdf/2501.00944
Copy Paste: [[2501.00944]] Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion(https://arxiv.org/abs/2501.00944)
Keywords: generative
Abstract: The emergence of generative AI and controllable diffusion has made image-to-image synthesis increasingly practical and efficient. However, when input images exhibit low entropy and sparse, the inherent characteristics of diffusion models often result in limited diversity. This constraint significantly interferes with data augmentation. To address this, we propose Diffusion Prism, a training-free framework that efficiently transforms binary masks into realistic and diverse samples while preserving morphological features. We explored that a small amount of artificial noise will significantly assist the image-denoising process. To prove this novel mask-to-image concept, we use nano-dendritic patterns as an example to demonstrate the merit of our method compared to existing controllable diffusion models. Furthermore, we extend the proposed framework to other biological patterns, highlighting its potential applications across various fields.
摘要：生成式人工智能和可控扩散的出现使得图像到图像的合成越来越实用和高效。然而，当输入图像表现出低熵和稀疏性时，扩散模型的固有特性往往导致多样性有限。这种限制严重干扰了数据增强。为了解决这个问题，我们提出了扩散棱镜，这是一个无需训练的框架，可以有效地将二元掩模转换成真实而多样化的样本，同时保留形态特征。我们发现少量的人工噪声将显著有助于图像去噪过程。为了证明这种新颖的掩模到图像概念，我们使用纳米树突图案作为例子来证明我们的方法与现有的可控扩散模型相比的优点。此外，我们将提出的框架扩展到其他生物图案，强调其在各个领域的潜在应用。

Title: OASIS Uncovers: High-Quality T2I Models, Same Old Stereotypes

Authors: Sepehr Dehdashtian, Gautam Sreekumar, Vishnu Naresh Boddeti
Subjects: cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2501.00962
Pdf URL: https://arxiv.org/pdf/2501.00962
Copy Paste: [[2501.00962]] OASIS Uncovers: High-Quality T2I Models, Same Old Stereotypes(https://arxiv.org/abs/2501.00962)
Keywords: generation
Abstract: Images generated by text-to-image (T2I) models often exhibit visual biases and stereotypes of concepts such as culture and profession. Existing quantitative measures of stereotypes are based on statistical parity that does not align with the sociological definition of stereotypes and, therefore, incorrectly categorizes biases as stereotypes. Instead of oversimplifying stereotypes as biases, we propose a quantitative measure of stereotypes that aligns with its sociological definition. We then propose OASIS to measure the stereotypes in a generated dataset and understand their origins within the T2I model. OASIS includes two scores to measure stereotypes from a generated image dataset: (M1) Stereotype Score to measure the distributional violation of stereotypical attributes, and (M2) WALS to measure spectral variance in the images along a stereotypical attribute. OASIS also includes two methods to understand the origins of stereotypes in T2I models: (U1) StOP to discover attributes that the T2I model internally associates with a given concept, and (U2) SPI to quantify the emergence of stereotypical attributes in the latent space of the T2I model during image generation. Despite the considerable progress in image fidelity, using OASIS, we conclude that newer T2I models such as FLUX.1 and SDv3 contain strong stereotypical predispositions about concepts and still generate images with widespread stereotypical attributes. Additionally, the quantity of stereotypes worsens for nationalities with lower Internet footprints.
摘要：文本转图像 (T2I) 模型生成的图像通常表现出视觉偏见和对文化和职业等概念的刻板印象。现有的刻板印象定量测量基于统计均等性，与刻板印象的社会学定义不一致，因此错误地将偏见归类为刻板印象。我们不会将刻板印象过分简化为偏见，而是提出一种符合其社会学定义的刻板印象定量测量方法。然后，我们提出 OASIS 来测量生成的数据集中的刻板印象并了解它们在 T2I 模型中的起源。OASIS 包括两个分数来衡量生成的图像数据集中的刻板印象：(M1) 刻板印象分数用于衡量刻板印象属性的分布违规，以及 (M2) WALS 用于测量图像中沿刻板印象属性的光谱方差。 OASIS 还包括两种方法来了解 T2I 模型中刻板印象的起源：（U1）StOP 用于发现 T2I 模型内部与给定概念关联的属性，以及（U2）SPI 用于量化图像生成过程中刻板印象属性在 T2I 模型潜在空间中的出现。尽管图像保真度取得了长足的进步，但使用 OASIS，我们得出结论，较新的 T2I 模型（例如 FLUX.1 和 SDv3）包含对概念的强烈刻板印象倾向，并且仍然会生成具有广泛刻板印象属性的图像。此外，对于互联网足迹较低的国家/地区，刻板印象的数量会恶化。

Title: Optimizing Noise Schedules of Generative Models in High Dimensionss

Authors: Santiago Aranguri, Giulio Biroli, Marc Mezard, Eric Vanden-Eijnden
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.00988
Pdf URL: https://arxiv.org/pdf/2501.00988
Copy Paste: [[2501.00988]] Optimizing Noise Schedules of Generative Models in High Dimensionss(https://arxiv.org/abs/2501.00988)
Keywords: generative
Abstract: Recent works have shown that diffusion models can undergo phase transitions, the resolution of which is needed for accurately generating samples. This has motivated the use of different noise schedules, the two most common choices being referred to as variance preserving (VP) and variance exploding (VE). Here we revisit these schedules within the framework of stochastic interpolants. Using the Gaussian Mixture (GM) and Curie-Weiss (CW) data distributions as test case models, we first investigate the effect of the variance of the initial noise distribution and show that VP recovers the low-level feature (the distribution of each mode) but misses the high-level feature (the asymmetry between modes), whereas VE performs oppositely. We also show that this dichotomy, which happens when denoising by a constant amount in each step, can be avoided by using noise schedules specific to VP and VE that allow for the recovery of both high- and low-level features. Finally we show that these schedules yield generative models for the GM and CW model whose probability flow ODE can be discretized using $\Theta_d(1)$ steps in dimension $d$ instead of the $\Theta_d(\sqrt{d})$ steps required by constant denoising.
摘要：最近的研究表明，扩散模型可以经历相变，而相变的解决对于准确生成样本是必需的。这促使人们使用不同的噪声方案，最常见的两种选择是方差保持 (VP) 和方差爆炸 (VE)。在这里，我们在随机插值框架内重新审视这些方案。使用高斯混合 (GM) 和居里-外斯 (CW) 数据分布作为测试用例模型，我们首先研究初始噪声分布方差的影响，并表明 VP 恢复了低级特征（每种模式的分布），但错过了高级特征（模式之间的不对称性），而 VE 的表现则相反。我们还表明，这种二分法（在每个步骤中以恒定量去噪时发生）可以通过使用特定于 VP 和 VE 的噪声方案来避免，这些噪声方案允许恢复高级和低级特征。最后，我们表明这些时间表为 GM 和 CW 模型产生了生成模型，其概率流 ODE 可以使用维度 $d$ 中的 $\Theta_d(1)$ 步骤进行离散化，而不是恒定去噪所需的 $\Theta_d(\sqrt{d})$ 步骤。

Title: State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects

Authors: Harshika Goyal, Mohammad Saif Wajid, Mohd Anas Wajid, Akib Mohi Ud Din Khanday, Mehdi Neshat, Amir Gandomi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01029
Pdf URL: https://arxiv.org/pdf/2501.01029
Copy Paste: [[2501.01029]] State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects(https://arxiv.org/abs/2501.01029)
Keywords: generation, generative
Abstract: The rapid advancement of deepfake technologies, specifically designed to create incredibly lifelike facial imagery and video content, has ignited a remarkable level of interest and curiosity across many fields, including forensic analysis, cybersecurity and the innovative creation of digital characters. By harnessing the latest breakthroughs in deep learning methods, such as Generative Adversarial Networks, Variational Autoencoders, Few-Shot Learning Strategies, and Transformers, the outcomes achieved in generating deepfakes have been nothing short of astounding and transformative. Also, the ongoing evolution of detection technologies is being developed to counteract the potential for misuse associated with deepfakes, effectively addressing critical concerns that range from political manipulation to the dissemination of fake news and the ever-growing issue of cyberbullying. This comprehensive review paper meticulously investigates the most recent developments in deepfake generation and detection, including around 400 publications, providing an in-depth analysis of the cutting-edge innovations shaping this rapidly evolving landscape. Starting with a thorough examination of systematic literature review methodologies, we embark on a journey that delves into the complex technical intricacies inherent in the various techniques used for deepfake generation, comprehensively addressing the challenges faced, potential solutions available, and the nuanced details surrounding manipulation formulations. Subsequently, the paper is dedicated to accurately benchmarking leading approaches against prominent datasets, offering thorough assessments of the contributions that have significantly impacted these vital domains. Ultimately, we engage in a thoughtful discussion of the existing challenges, paving the way for continuous advancements in this critical and ever-dynamic study area.
摘要：深度伪造技术旨在创建极其逼真的面部图像和视频内容，其快速发展激发了法医分析、网络安全和数字角色创新创作等许多领域的极大兴趣和好奇心。通过利用深度学习方法的最新突破，例如生成对抗网络、变分自编码器、少样本学习策略和 Transformers，在生成深度伪造方面取得的成果令人震惊且具有变革性。此外，检测技术也在不断发展，以抵消与深度伪造相关的滥用可能性，从而有效解决从政治操纵到传播虚假新闻以及日益严重的网络欺凌问题等关键问题。这篇全面的评论论文仔细研究了深度伪造生成和检测方面的最新发展，包括大约 400 篇出版物，对塑造这一快速发展格局的前沿创新进行了深入分析。从全面研究系统的文献综述方法开始，我们开始深入研究用于深度伪造的各种技术中固有的复杂技术复杂性，全面解决所面临的挑战、可用的潜在解决方案以及围绕操纵公式的细微细节。随后，本文致力于根据知名数据集准确地对领先方法进行基准测试，对对这些重要领域产生重大影响的贡献进行全面评估。最后，我们对现有的挑战进行了深思熟虑的讨论，为这个关键且充满活力的研究领域的持续进步铺平了道路。

Title: Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras

Authors: Jingkai Sun, Qiang Zhang, Jiaxu Wang, Jiahang Cao, Renjing Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01040
Pdf URL: https://arxiv.org/pdf/2501.01040
Copy Paste: [[2501.01040]] Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras(https://arxiv.org/abs/2501.01040)
Keywords: generation
Abstract: Dynamic vision sensors (DVS) are bio-inspired devices that capture visual information in the form of asynchronous events, which encode changes in pixel intensity with high temporal resolution and low latency. These events provide rich motion cues that can be exploited for various computer vision tasks, such as action recognition. However, most existing DVS-based action recognition methods lose temporal information during data transformation or suffer from noise and outliers caused by sensor imperfections or environmental factors. To address these challenges, we propose a novel framework that preserves and exploits the spatiotemporal structure of event data for action recognition. Our framework consists of two main components: 1) a point-wise event masked autoencoder (MAE) that learns a compact and discriminative representation of event patches by reconstructing them from masked raw event camera points data; 2) an improved event points patch generation algorithm that leverages an event data inlier model and point-wise data augmentation techniques to enhance the quality and diversity of event points patches. To the best of our knowledge, our approach introduces the pre-train method into event camera raw points data for the first time, and we propose a novel event points patch embedding to utilize transformer-based models on event cameras.
摘要：动态视觉传感器 (DVS) 是一种仿生设备，它以异步事件的形式捕获视觉信息，这些事件以高时间分辨率和低延迟对像素强度的变化进行编码。这些事件提供了丰富的运动线索，可用于各种计算机视觉任务，例如动作识别。然而，大多数现有的基于 DVS 的动作识别方法在数据转换过程中会丢失时间信息，或者受到传感器缺陷或环境因素导致的噪声和异常值的影响。为了应对这些挑战，我们提出了一个新颖的框架，该框架可以保留和利用事件数据的时空结构进行动作识别。我们的框架由两个主要组件组成：1) 逐点事件掩码自动编码器 (MAE)，它通过从掩码原始事件相机点数据重建事件补丁来学习紧凑且有判别力的事件补丁表示；2) 改进的事件点补丁生成算法，它利用事件数据内点模型和逐点数据增强技术来提高事件点补丁的质量和多样性。据我们所知，我们的方法首次将预训练方法引入事件相机原始点数据，并且我们提出了一种新颖的事件点补丁嵌入，以在事件相机上利用基于变压器的模型。

Title: Enhancing Precision of Automated Teller Machines Network Quality Assessment: Machine Learning and Multi Classifier Fusion Approaches

Authors: Alireza Safarzadeh, Mohammad Reza Jamali, Behzad Moshiri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01067
Pdf URL: https://arxiv.org/pdf/2501.01067
Copy Paste: [[2501.01067]] Enhancing Precision of Automated Teller Machines Network Quality Assessment: Machine Learning and Multi Classifier Fusion Approaches(https://arxiv.org/abs/2501.01067)
Keywords: quality assessment
Abstract: Ensuring reliable ATM services is essential for modern banking, directly impacting customer satisfaction and the operational efficiency of financial institutions. This study introduces a data fusion approach that utilizes multi-classifier fusion techniques, with a special focus on the Stacking Classifier, to enhance the reliability of ATM networks. To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied, enabling balanced learning for both frequent and rare events. The proposed framework integrates diverse classification models - Random Forest, LightGBM, and CatBoost - within a Stacking Classifier, achieving a dramatic reduction in false alarms from 3.56 percent to just 0.71 percent, along with an outstanding overall accuracy of 99.29 percent. This multi-classifier fusion method synthesizes the strengths of individual models, leading to significant cost savings and improved operational decision-making. By demonstrating the power of machine learning and data fusion in optimizing ATM status detection, this research provides practical and scalable solutions for financial institutions aiming to enhance their ATM network performance and customer satisfaction.
摘要：确保可靠的 ATM 服务对于现代银行业务至关重要，直接影响客户满意度和金融机构的运营效率。本研究介绍了一种数据融合方法，该方法利用多分类器融合技术，特别关注堆叠分类器，以提高 ATM 网络的可靠性。为了解决类别不平衡问题，应用了合成少数过采样技术 (SMOTE)，从而实现频繁事件和罕见事件的平衡学习。所提出的框架将各种分类模型（随机森林、LightGBM 和 CatBoost）集成到堆叠分类器中，将误报率从 3.56% 大幅降低到 0.71%，同时实现了 99.29% 的出色总体准确率。这种多分类器融合方法综合了各个模型的优势，从而显著节省了成本并改善了运营决策。通过展示机器学习和数据融合在优化 ATM 状态检测方面的强大功能，本研究为旨在提高 ATM 网络性能和客户满意度的金融机构提供了实用且可扩展的解决方案。

Title: Graph Generative Pre-trained Transformer

Authors: Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01073
Pdf URL: https://arxiv.org/pdf/2501.01073
Copy Paste: [[2501.01073]] Graph Generative Pre-trained Transformer(https://arxiv.org/abs/2501.01073)
Keywords: generation, generative
Abstract: Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.
摘要：图生成是许多领域（包括分子设计和社交网络分析）的一项关键任务，因为它能够对复杂关系和结构化数据进行建模。虽然大多数现代图生成模型都使用邻接矩阵表示，但这项工作重新审视了一种将图表示为节点集和边集序列的替代方法。我们提倡这种方法，因为它可以有效地对图进行编码，并提出了一种新颖的表示方法。基于这种表示，我们引入了图生成预训练转换器 (G2PT)，这是一种通过下一个标记预测来学习图结构的自回归模型。为了进一步利用 G2PT 作为通用基础模型的功能，我们探索了两个下游应用的微调策略：面向目标的生成和图属性预测。我们在多个数据集上进行了广泛的实验。结果表明，G2PT 在通用图和分子数据集上都实现了卓越的生成性能。此外，G2PT 在从分子设计到属性预测的下游任务中表现出很强的适应性和多功能性。

Title: EliGen: Entity-Level Controlled Image Generation with Regional Attention

Authors: Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01097
Pdf URL: https://arxiv.org/pdf/2501.01097
Copy Paste: [[2501.01097]] EliGen: Entity-Level Controlled Image Generation with Regional Attention(https://arxiv.org/abs/2501.01097)
Keywords: generation
Abstract: Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.
摘要：扩散模型的最新进展显著推进了文本到图像的生成，但仅靠全局文本提示仍然不足以实现对图像中各个实体的细粒度控制。为了解决这一限制，我们提出了 EliGen，这是一种用于实体级控制图像生成的新型框架。我们引入了区域注意，这是一种不需要额外参数的扩散变换器机制，可以无缝集成实体提示和任意形状的空间掩码。通过提供具有细粒度空间和语义实体级注释的高质量数据集，我们训练 EliGen 实现稳健而准确的实体级操作，在位置控制精度和图像质量方面都超越现有方法。此外，我们提出了一种修复融合管道，将 EliGen 扩展到多实体图像修复任务。我们通过将其与 IP-Adapter 和 MLLM 等社区模型集成，进一步展示了它的灵活性，释放了新的创意可能性。源代码、数据集和模型将公开发布。

Title: AIM: Additional Image Guided Generation of Transferable Adversarial Attacks

Authors: Teng Li, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01106
Pdf URL: https://arxiv.org/pdf/2501.01106
Copy Paste: [[2501.01106]] AIM: Additional Image Guided Generation of Transferable Adversarial Attacks(https://arxiv.org/abs/2501.01106)
Keywords: generation, generative
Abstract: Transferable adversarial examples highlight the vulnerability of deep neural networks (DNNs) to imperceptible perturbations across various real-world applications. While there have been notable advancements in untargeted transferable attacks, targeted transferable attacks remain a significant challenge. In this work, we focus on generative approaches for targeted transferable attacks. Current generative attacks focus on reducing overfitting to surrogate models and the source data domain, but they often overlook the importance of enhancing transferability through additional semantics. To address this issue, we introduce a novel plug-and-play module into the general generator architecture to enhance adversarial transferability. Specifically, we propose a \emph{Semantic Injection Module} (SIM) that utilizes the semantics contained in an additional guiding image to improve transferability. The guiding image provides a simple yet effective method to incorporate target semantics from the target class to create targeted and highly transferable attacks. Additionally, we propose new loss formulations that can integrate the semantic injection module more effectively for both targeted and untargeted attacks. We conduct comprehensive experiments under both targeted and untargeted attack settings to demonstrate the efficacy of our proposed approach.
摘要：可转移对抗样本凸显了深度神经网络 (DNN) 在各种实际应用中容易受到不可察觉的干扰。虽然非针对性可转移攻击取得了显著进展，但针对性可转移攻击仍然是一项重大挑战。在这项工作中，我们专注于针对性可转移攻击的生成方法。当前的生成攻击专注于减少对代理模型和源数据域的过度拟合，但它们往往忽视了通过附加语义增强可转移性的重要性。为了解决这个问题，我们在通用生成器架构中引入了一个新颖的即插即用模块，以增强对抗性可转移性。具体来说，我们提出了一个 \emph{语义注入模块} (SIM)，它利用附加引导图像中包含的语义来提高可转移性。引导图像提供了一种简单而有效的方法来合并目标类中的目标语义以创建有针对性和高度可转移的攻击。此外，我们提出了新的损失公式，可以更有效地集成语义注入模块以进行有针对性和无针对性的攻击。我们在有针对性和非针对性的攻击环境下进行了全面的实验，以证明我们提出的方法的有效性。

Title: BatStyler: Advancing Multi-category Style Generation for Source-free Domain Generalization

Authors: Xiusheng Xu, Lei Qi, Jingyang Zhou, Xin Geng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01109
Pdf URL: https://arxiv.org/pdf/2501.01109
Copy Paste: [[2501.01109]] BatStyler: Advancing Multi-category Style Generation for Source-free Domain Generalization(https://arxiv.org/abs/2501.01109)
Keywords: generation
Abstract: Source-Free Domain Generalization (SFDG) aims to develop a model that performs on unseen domains without relying on any source domains. However, the implementation remains constrained due to the unavailability of training data. Research on SFDG focus on knowledge transfer of multi-modal models and style synthesis based on joint space of multiple modalities, thus eliminating the dependency on source domain images. However, existing works primarily work for multi-domain and less-category configuration, but performance on multi-domain and multi-category configuration is relatively poor. In addition, the efficiency of style synthesis also deteriorates in multi-category scenarios. How to efficiently synthesize sufficiently diverse data and apply it to multi-category configuration is a direction with greater practical value. In this paper, we propose a method called BatStyler, which is utilized to improve the capability of style synthesis in multi-category scenarios. BatStyler consists of two modules: Coarse Semantic Generation and Uniform Style Generation modules. The Coarse Semantic Generation module extracts coarse-grained semantics to prevent the compression of space for style diversity learning in multi-category configuration, while the Uniform Style Generation module provides a template of styles that are uniformly distributed in space and implements parallel training. Extensive experiments demonstrate that our method exhibits comparable performance on less-category datasets, while surpassing state-of-the-art methods on multi-category datasets.
摘要：无源域泛化 (SFDG) 旨在开发一种不依赖任何源域而在未见过的域上执行的模型。然而，由于缺乏训练数据，实现仍然受到限制。SFDG 的研究侧重于多模态模型的知识迁移和基于多模态联合空间的风格合成，从而消除对源域图像的依赖。然而，现有的工作主要针对多域少类别配置，但在多域多类别配置上的表现相对较差。此外，在多类别场景中，风格合成的效率也会下降。如何有效地合成足够多样化的数据并将其应用于多类别配置是一个更具有实用价值的方向。在本文中，我们提出了一种称为 BatStyler 的方法，用于提高多类别场景中的风格合成能力。BatStyler 由两个模块组成：粗略语义生成和统一风格生成模块。粗粒度语义生成模块提取粗粒度语义，避免多类别配置下风格多样性学习的空间压缩；统一风格生成模块提供在空间中均匀分布的风格模板，实现并行训练。大量实验表明，我们的方法在少类别数据集上表现出相当的性能，而在多类别数据集上超越了最先进的方法。

Title: HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment

Authors: Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2501.01116
Pdf URL: https://arxiv.org/pdf/2501.01116
Copy Paste: [[2501.01116]] HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment(https://arxiv.org/abs/2501.01116)
Keywords: quality assessment
Abstract: Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor color or light inconsistency. To address the issue and facilitate the advancement of IHAs, we introduce the first Image Quality Assessment Database for image Harmony evaluation (HarmonyIQAD), which consists of 1,350 harmonized images generated by 9 different IHAs, and the corresponding human visual preference scores. Based on this database, we propose a Harmony Image Quality Assessment (HarmonyIQA), to predict human visual preference for harmonized images. Extensive experiments show that HarmonyIQA achieves state-of-the-art performance on human visual preference evaluation for harmonized images, and also achieves competing results on traditional IQA tasks. Furthermore, cross-dataset evaluation also shows that HarmonyIQA exhibits better generalization ability than self-supervised learning-based IQA methods. Both HarmonyIQAD and HarmonyIQA will be made publicly available upon paper publication.
摘要：图像合成涉及从一张图像中提取前景对象并通过图像协调算法 (IHA) 将其粘贴到另一张图像中，该算法旨在调整前景对象的外观以更好地匹配背景。现有的图像质量评估 (IQA) 方法对轻微的颜色或光线不一致不敏感，可能无法与人类视觉对图像协调的偏好保持一致。为了解决这个问题并促进 IHA 的发展，我们推出了第一个用于图像协调评估的图像质量评估数据库 (HarmonyIQAD)，它包含由 9 种不同的 IHA 生成的 1,350 张协调图像以及相应的人类视觉偏好分数。基于该数据库，我们提出了一种协调图像质量评估 (HarmonyIQA)，以预测人类对协调图像的视觉偏好。大量实验表明，HarmonyIQA 在人类对协调图像的视觉偏好评估方面实现了最佳性能，并且在传统 IQA 任务上也取得了相媲美的结果。此外，跨数据集评估还表明，HarmonyIQA 比基于自监督学习的 IQA 方法表现出更好的泛化能力。HarmonyIQAD 和 HarmonyIQA 都将在论文发表后公开发布。

Title: DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

Authors: Feng Han, Kai Chen, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01125
Pdf URL: https://arxiv.org/pdf/2501.01125
Copy Paste: [[2501.01125]] DuMo: Dual Encoder Modulation Network for Precise Concept Erasure(https://arxiv.org/abs/2501.01125)
Keywords: generation, generative
Abstract: The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods. Code is available at this https URL
摘要：文本转图像模型的出色生成能力引发了人们对生成不适合工作 (NSFW) 的内容和潜在版权侵权的重大安全担忧。为了解决这些问题，以前的方法通过消除不适当的概念来保护模型。尽管如此，这些模型改变了主干网络的参数，并对图像的结构 (低频) 成分产生了相当大的影响，从而削弱了模型保留非目标概念的能力。在这项工作中，我们提出了双编码器调制网络 (DuMo)，它可以精确擦除不适当的目标概念，同时将对非目标概念的损害降至最低。与以前的方法相比，DuMo 采用了具有先验知识的 Eraser (EPR) 模块，该模块修改了 U-NET 的跳跃连接特征，主要实现对图像细节 (高频) 成分的概念擦除。为了最大限度地减少擦除过程中对非目标概念的损害，主干 U-NET 的参数被冻结，并将来自原始跳跃连接特征的先验知识引入到擦除过程中。同时，观察到 EPR 在不同时间步和层上表现出对图像结构和细节的不同擦除偏好的现象。因此，我们采用了一种新颖的时间层调制过程 (TLMO)，该过程可调整 EPR 模块输出在不同层和时间步上的擦除比例，自动平衡擦除效果和模型的生成能力。我们的方法在显式内容擦除、卡通概念删除和艺术风格擦除方面实现了最先进的性能，明显优于其他方法。代码可从此 https URL 获取

Title: TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions

Authors: Vriksha Srihari, R. Bhavya, Shruti Jayaraman, V. Mary Anita Rajam
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01156
Pdf URL: https://arxiv.org/pdf/2501.01156
Copy Paste: [[2501.01156]] TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions(https://arxiv.org/abs/2501.01156)
Keywords: generative
Abstract: While generative models such as text-to-image, large language models and text-to-video have seen significant progress, the extension to text-to-virtual-reality remains largely unexplored, due to a deficit in training data and the complexity of achieving realistic depth and motion in virtual environments. This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text. Carried out in three main stages, we start with a base text-to-image model that captures context from an input text. We then employ Stable Diffusion on the rudimentary image produced, to generate frames with enhanced realism and overall quality. These frames are processed with depth estimation algorithms to create left-eye and right-eye views, which are stitched side-by-side to create an immersive viewing experience. Such systems would be highly beneficial in virtual reality production, since filming and scene building often require extensive hours of work and post-production effort. We utilize image evaluation techniques, specifically Fréchet Inception Distance and CLIP Score, to assess the visual quality of frames produced for the video. These quantitative measures establish the proficiency of the proposed method. Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations.
摘要：虽然文本转图像、大型语言模型和文本转视频等生成模型取得了重大进展，但由于训练数据不足以及在虚拟环境中实现逼真的深度和运动的复杂性，文本转虚拟现实的扩展仍未得到充分探索。本文提出了一种将现有生成系统合并起来以从文本形成立体虚拟现实视频的方法。该过程分为三个主要阶段，我们首先使用基本文本转图像模型来捕获输入文本中的上下文。然后，我们对生成的原始图像使用稳定扩散，以生成具有增强真实感和整体质量的帧。这些帧通过深度估计算法进行处理，以创建左眼和右眼视图，然后将它们并排拼接以创建身临其境的观看体验。这样的系统在虚拟现实制作中非常有用，因为拍摄和场景构建通常需要大量的工作时间和后期制作工作。我们利用图像评估技术，特别是 Fréchet Inception Distance 和 CLIP Score，来评估为视频生成的帧的视觉质量。这些定量指标确定了所提方法的熟练程度。我们的工作凸显了在虚拟现实模拟等领域使用自然语言驱动图形的激动人心的可能性。

Title: LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Authors: Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, Sunghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01197
Pdf URL: https://arxiv.org/pdf/2501.01197
Copy Paste: [[2501.01197]] LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge(https://arxiv.org/abs/2501.01197)
Keywords: generation, generative
Abstract: Layers have become indispensable tools for professional artists, allowing them to build a hierarchical structure that enables independent control over individual visual elements. In this paper, we propose LayeringDiff, a novel pipeline for the synthesis of layered images, which begins by generating a composite image using an off-the-shelf image generative model, followed by disassembling the image into its constituent foreground and background layers. By extracting layers from a composite image, rather than generating them from scratch, LayeringDiff bypasses the need for large-scale training to develop generative capabilities for individual layers. Furthermore, by utilizing a pretrained off-the-shelf generative model, our method can produce diverse contents and object scales in synthesized layers. For effective layer decomposition, we adapt a large-scale pretrained generative prior to estimate foreground and background layers. We also propose high-frequency alignment modules to refine the fine-details of the estimated layers. Our comprehensive experiments demonstrate that our approach effectively synthesizes layered images and supports various practical applications.
摘要：图层已成为专业艺术家不可或缺的工具，使他们能够构建分层结构，从而能够独立控制各个视觉元素。在本文中，我们提出了 LayeringDiff，这是一种用于合成分层图像的新型管道，首先使用现成的图像生成模型生成合成图像，然后将图像分解为其组成的前景层和背景层。通过从合成图像中提取图层，而不是从头开始生成图层，LayeringDiff 无需进行大规模训练即可开发各个图层的生成能力。此外，通过利用预先训练的现成生成模型，我们的方法可以在合成层中生成不同的内容和对象尺度。为了有效地进行层分解，我们采用大规模预训练生成先验来估计前景层和背景层。我们还提出了高频对齐模块来细化估计层的细节。我们的全面实验表明，我们的方法可以有效地合成分层图像并支持各种实际应用。

Title: TabTreeFormer: Tree Augmented Tabular Data Generation using Transformers

Authors: Jiayu Li, Bingyin Zhao, Zilong Zhao, Kevin Yee, Uzair Javaid, Yingjie Lao, Biplab Sikdar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.01216
Pdf URL: https://arxiv.org/pdf/2501.01216
Copy Paste: [[2501.01216]] TabTreeFormer: Tree Augmented Tabular Data Generation using Transformers(https://arxiv.org/abs/2501.01216)
Keywords: generation, generative
Abstract: Transformers have achieved remarkable success in tabular data generation. However, they lack domain-specific inductive biases which are critical to preserving the intrinsic characteristics of tabular data. Meanwhile, they suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that incorporates a tree-based model that retains tabular-specific inductive biases of non-smooth and potentially low-correlated patterns due to its discreteness and non-rotational invariance, and hence enhances the fidelity and utility of synthetic data. In addition, we devise a dual-quantization tokenizer to capture the multimodal continuous distribution and further facilitate the learning of numerical value distribution. Moreover, our proposed tokenizer reduces the vocabulary size and sequence length due to the limited dimension-wise semantic meaning and training set size of tabular data, rendering a significant model size shrink without sacrificing the capability of the transformer model. We evaluate TabTreeFormer on 10 datasets against multiple generative models on various metrics; our experimental results show that TabTreeFormer achieves superior fidelity, utility, privacy, and efficiency. Our best model yields a 40% utility improvement with 1/16 of the baseline model size.
摘要：Transformer 在表格数据生成方面取得了显著的成功。然而，它们缺乏领域特定的归纳偏差，而这对于保留表格数据的内在特征至关重要。同时，由于二次计算复杂度，它们的可扩展性和效率较差。在本文中，我们提出了 TabTreeFormer，这是一种混合 Transformer 架构，它结合了一个基于树的模型，该模型由于其离散性和非旋转不变性而保留了非平滑和潜在低相关模式的表格特定归纳偏差，从而提高了合成数据的保真度和实用性。此外，我们设计了一个双量化标记器来捕获多模态连续分布并进一步促进数值分布的学习。此外，由于表格数据的维度语义和训练集大小有限，我们提出的标记器减少了词汇量和序列长度，从而在不牺牲 Transformer 模型能力的情况下显著缩小了模型大小。我们在 10 个数据集上根据各种指标对 TabTreeFormer 与多个生成模型进行了评估；我们的实验结果表明 TabTreeFormer 实现了出色的保真度、实用性、隐私性和效率。我们的最佳模型在基线模型大小为 1/16 的情况下实现了 40% 的实用性提升。

Title: Conditional Consistency Guided Image Translation and Enhancement

Authors: A. V. Subramanyam, Amil Bhagat, Milind Jain
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01223
Pdf URL: https://arxiv.org/pdf/2501.01223
Copy Paste: [[2501.01223]] Conditional Consistency Guided Image Translation and Enhancement(https://arxiv.org/abs/2501.01223)
Keywords: generation, generative
Abstract: Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at this https URL.
摘要：一致性模型已成为扩散模型的有前途的替代方案，通过单步样本生成提供高质量的生成能力。然而，它们在多域图像转换任务中的应用，如跨模态转换和低光图像增强，在很大程度上仍未得到探索。在本文中，我们通过加入额外的条件输入，引入了用于多域图像转换的条件一致性模型 (CCM)。我们通过引入指导去噪过程的任务特定条件输入来实现这些修改，确保生成的输出保留来自相应输入域的结构和上下文信息。我们在 10 个不同的数据集上评估了 CCM，证明了它们在跨多个域生成高质量转换图像方面的有效性。代码可在此 https URL 上找到。

Title: SVFR: A Unified Framework for Generalized Video Face Restoration

Authors: Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.01235
Pdf URL: https://arxiv.org/pdf/2501.01235
Copy Paste: [[2501.01235]] SVFR: A Unified Framework for Generalized Video Face Restoration(https://arxiv.org/abs/2501.01235)
Keywords: restoration, generative
Abstract: Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
摘要：人脸恢复 (FR) 是图像和视频处理中的一个重要领域，专注于从退化的输入中重建高质量的肖像。尽管图像 FR 取得了进展，但视频 FR 仍然相对未被充分探索，这主要是由于与时间一致性、运动伪影和高质量视频数据有限可用性相关的挑战。此外，传统的人脸恢复通常优先考虑提高分辨率，可能不会过多考虑面部着色和修复等相关任务。在本文中，我们提出了一种用于广义视频人脸恢复 (GVFR) 任务的新方法，该方法集成了视频 BFR、修复和着色任务，我们通过经验证明这些任务相互受益。我们提出了一个统一的框架，称为稳定视频人脸恢复 (SVFR)，它利用稳定视频扩散 (SVD) 的生成和运动先验，并通过统一的人脸恢复框架整合特定于任务的信息。引入了可学习的任务嵌入以增强任务识别。同时，采用了一种新颖的统一潜在正则化 (ULR) 来鼓励不同子任务之间的共享特征表示学习。为了进一步提高恢复质量和时间稳定性，我们引入了面部先验学习和自参考细化作为用于训练和推理的辅助策略。所提出的框架有效地结合了这些任务的互补优势，增强了时间连贯性并实现了卓越的恢复质量。这项工作推动了视频 FR 的最新进展，并为广义视频人脸恢复建立了新的范式。

Title: SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Authors: Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01320
Pdf URL: https://arxiv.org/pdf/2501.01320
Copy Paste: [[2501.01320]] SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration(https://arxiv.org/abs/2501.01320)
Keywords: restoration, generation
Abstract: Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
摘要：视频修复在保持保真度的同时从未知的自然退化中恢复时间一致的细节方面提出了不小的挑战。尽管基于扩散的修复最近取得了进展，但这些方法往往面临生成能力和采样效率的限制。在这项工作中，我们提出了 SeedVR，这是一种扩散变换器，旨在处理具有任意长度和分辨率的真实世界视频修复。SeedVR 的核心设计在于移位窗口注意力，它有助于对长视频序列进行有效修复。SeedVR 还支持在空间和时间维度边界附近的可变大小窗口，克服了传统窗口注意力的分辨率限制。借助因果视频自动编码器、混合图像和视频训练以及渐进式训练等当代实践，SeedVR 在合成和真实世界基准以及 AI 生成的视频上都实现了极具竞争力的性能。大量实验证明了 SeedVR 优于现有的通用视频修复方法。

Title: Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement

Authors: Z. Zhang, B. Liu, J. Bao, L. Chen, S. Zhu, J. Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01368
Pdf URL: https://arxiv.org/pdf/2501.01368
Copy Paste: [[2501.01368]] Test-time Controllable Image Generation by Explicit Spatial Constraint Enforcement(https://arxiv.org/abs/2501.01368)
Keywords: generation
Abstract: Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30$\%$ relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.
摘要：最近的文本到图像生成偏向于各种形式的空间条件，例如蒙版、边界框和关键点。然而，大多数现有技术需要形式特定的注释来微调原始模型，导致测试时间通用性较差。同时，现有的无训练方法仅在简化的提示和空间条件下才能很好地工作。在这项工作中，我们提出了一种新颖但通用的测试时间可控生成方法，旨在解决自然文本提示和复杂条件。具体来说，我们将空间条件解耦为语义和几何条件，然后在图像生成过程中分别强制它们的一致性。对于前者，我们的目标是弥合语义条件和文本提示之间的差距，以及这种条件与扩散模型的注意力图之间的差距。为了实现这一点，我们建议首先完成相对于语义条件的提示，然后通过测量注意力图中的统计数据以及相对于该条件的词空间距离来消除分散注意力的提示词的负面影响。为了进一步应对复杂的几何条件，我们引入了一个几何变换模块，其中将在注意图中识别感兴趣区域，并进一步用于转换相对于几何条件的类别潜在值。更重要的是，我们提出了一种基于扩散的潜在值补充方法，以明确消除 RoI 处的潜在值的影响，从而减少生成图像上的伪影。在 Coco-stuff 数据集上的实验表明，与 SOTA 无需训练的方法相比，布局一致性评估指标相对提升了 30% 左右。

Title: On Unifying Video Generation and Camera Pose Estimation

Authors: Chun-Hao Paul Huang, Jae Shin Yoon, Hyeonho Jeong, Niloy Mitra, Duygu Ceylan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01409
Pdf URL: https://arxiv.org/pdf/2501.01409
Copy Paste: [[2501.01409]] On Unifying Video Generation and Camera Pose Estimation(https://arxiv.org/abs/2501.01409)
Keywords: generation
Abstract: Inspired by the emergent 3D capabilities in image generators, we explore whether video generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a benchmark for 3D tasks, we investigate if intermediate features from OpenSora, a video generation model, can support camera pose estimation. We first examine native 3D awareness in video generation features by routing raw intermediate outputs to SfM-prediction modules like DUSt3R. Then, we explore the impact of fine-tuning on camera pose estimation to enhance 3D awareness. Results indicate that while video generator features have limited inherent 3D awareness, task-specific supervision significantly boosts their accuracy for camera pose estimation, resulting in competitive performance. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality without degrading video generation quality.
摘要：受图像生成器中新兴的 3D 功能的启发，我们探索视频生成器是否同样表现出 3D 意识。使用运动结构 (SfM) 作为 3D 任务的基准，我们研究视频生成模型 OpenSora 的中间特征是否可以支持相机姿势估计。我们首先通过将原始中间输出路由到 SfM 预测模块（如 DUSt3R）来检查视频生成特征中的原生 3D 意识。然后，我们探索微调对相机姿势估计的影响以增强 3D 意识。结果表明，虽然视频生成器特征固有的 3D 意识有限，但特定于任务的监督显着提高了它们对相机姿势估计的准确性，从而产生了具有竞争力的性能。提出的统一模型名为 JOG3R，可以生成具有竞争力质量的相机姿势估计，而不会降低视频生成质量。

Title: Multi-Modal Video Feature Extraction for Popularity Prediction

Authors: Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01422
Pdf URL: https://arxiv.org/pdf/2501.01422
Copy Paste: [[2501.01422]] Multi-Modal Video Feature Extraction for Popularity Prediction(https://arxiv.org/abs/2501.01422)
Keywords: generation
Abstract: This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
摘要：本研究旨在利用视频本身及其相关特征来预测短视频的流行度。流行度通过四个关键的参与度指标来衡量：观看次数、点赞次数、评论次数和分享次数。本研究采用具有不同架构和训练方法的视频分类模型作为骨干网络来提取视频模态特征。同时，将清洗后的视频字幕与视频一起纳入精心设计的提示框架中，作为视频到文本生成模型的输入，生成基于文本的详细视频内容理解。然后使用预训练的 BERT 模型将这些文本编码为向量。基于上述六组向量，针对四个预测指标中的每一个训练一个神经网络。此外，本研究基于视频和表格数据进行数据挖掘和特征工程，构建了标签出现的总频率、提及出现的总频率、视频时长、帧数、帧率和总在线时间等实用特征。训练了多个机器学习模型，并选择了最稳定的模型 XGBoost。最后，对神经网络和XGBoost模型的预测取平均以获得最终结果。

Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Authors: Jingfeng Yao, Xinggang Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.01423
Pdf URL: https://arxiv.org/pdf/2501.01423
Copy Paste: [[2501.01423]] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models(https://arxiv.org/abs/2501.01423)
Keywords: generation
Abstract: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: this https URL.
摘要：采用 Transformer 架构的潜在扩散模型擅长生成高保真图像。然而，最近的研究揭示了这种两阶段设计中的一个优化难题：虽然增加视觉标记器中每个标记的特征维度可以提高重建质量，但需要更大的扩散模型和更多的训练迭代才能实现相当的生成性能。因此，现有系统通常只能满足于次优解决方案，要么由于标记器中的信息丢失而产生视觉伪影，要么由于昂贵的计算成本而无法完全收敛。我们认为这种困境源于学习不受约束的高维潜在空间的固有困难。为了解决这个问题，我们建议在训练视觉标记器时将潜在空间与预先训练的视觉基础模型对齐。我们提出的 VA-VAE（视觉基础模型对齐变分自动编码器）显著扩展了潜在扩散模型的重建生成边界，使高维潜在空间中的扩散变压器 (DiT) 能够更快地收敛。为了充分发挥 VA-VAE 的潜力，我们构建了一个增强型 DiT 基线，改进了训练策略和架构设计，称为 LightningDiT。集成系统在 ImageNet 256x256 生成上实现了最佳 (SOTA) 性能，FID 得分为 1.35，同时在短短 64 个时期内就达到了 2.11 的 FID 得分，展现了卓越的训练效率——与原始 DiT 相比，收敛速度提高了 21 倍以上。模型和代码可在以下网址获得：此 https URL。

Title: Object-level Visual Prompts for Compositional Image Generation

Authors: Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2501.01424
Pdf URL: https://arxiv.org/pdf/2501.01424
Copy Paste: [[2501.01424]] Object-level Visual Prompts for Compositional Image Generation(https://arxiv.org/abs/2501.01424)
Keywords: generation
Abstract: We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
摘要：我们介绍了一种在文本到图像扩散模型中组合对象级视觉提示的方法。我们的方法解决了在不同场景和风格中生成语义连贯的构图的任务，类似于文本提示提供的多功能性和表现力。这项任务的一个关键挑战是保留输入视觉提示中描绘的对象的身份，同时在不同图像中生成不同的构图。为了应对这一挑战，我们引入了一种新的 KV 混合交叉注意机制，其中键和值是从不同的视觉表示中学习的。键来自具有小瓶颈的编码器，用于布局控制，而值来自更大的瓶颈编码器，可捕获细粒度的外观细节。通过混合来自这些互补源的键和值，我们的模型保留了视觉提示的身份，同时支持对象排列、姿势和构图的灵活变化。在推理过程中，我们进一步提出了对象级构图指导，以提高方法的身份保存和布局正确性。结果表明，我们的技术可以产生多样化的场景构图，保留每个视觉提示的独特特征，从而扩展文本到图像生成的创造潜力。

Title: Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions

Authors: Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01425
Pdf URL: https://arxiv.org/pdf/2501.01425
Copy Paste: [[2501.01425]] Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions(https://arxiv.org/abs/2501.01425)
Keywords: generation
Abstract: Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive motion annotations, existing algorithms can not simultaneously control the motions of both camera and objects, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse objects and environments and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To validate the effectiveness and generalization of SynFMC, we further propose a method, Free-Form Motion Control (FMC). FMC enables independent or simultaneous control of object and camera movements, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.
摘要：在生成的视频中控制动态物体和相机的运动是一项有意义但具有挑战性的任务。由于缺乏具有全面运动注释的数据集，现有算法无法同时控制相机和物体的运动，导致对生成内容的可控性有限。为了解决这个问题并促进该领域的研究，我们引入了一个用于自由形式运动控制的合成数据集 (SynFMC)。所提出的 SynFMC 数据集包括不同的物体和环境，并根据特定规则涵盖各种运动模式，模拟常见和复杂的现实世界场景。完整的 6D 姿势信息有助于模型学习从视频中的物体和相机中分离运动效果。为了验证 SynFMC 的有效性和泛化能力，我们进一步提出了一种方法，即自由形式运动控制 (FMC)。FMC 可以独立或同时控制物体和相机的运动，从而生成高保真视频。此外，它与针对不同内容风格的各种个性化文本到图像 (T2I) 模型兼容。大量实验表明，所提出的 FMC 在多种场景中的表现优于以前的方法。

Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.01427
Pdf URL: https://arxiv.org/pdf/2501.01427
Copy Paste: [[2501.01427]] VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control(https://arxiv.org/abs/2501.01427)
Keywords: generation
Abstract: Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
摘要：尽管视频生成取得了重大进展，但将给定对象插入视频仍然是一项艰巨的任务。困难在于保留参考对象的外观细节并同时准确建模连贯的运动。在本文中，我们提出了 VideoAnydoor，这是一个零样本视频对象插入框架，具有高保真细节保存和精确运动控制。从文本到视频模型开始，我们使用 ID 提取器注入全局身份并利用框序列来控制整体运动。为了保留详细的外观并同时支持细粒度的运动控制，我们设计了一个像素扭曲器。它将具有任意关键点的参考图像和相应的关键点轨迹作为输入。它根据轨迹扭曲像素细节，并将扭曲的特征与扩散 U-Net 融合，从而改善细节保存并支持用户操纵运动轨迹。此外，我们提出了一种涉及视频和静态图像的训练策略，并采用重新加权重建损失来提高插入质量。 VideoAnydoor 表现出比现有方法明显的优势，并且自然支持各种下游应用（例如，说话头部生成、视频虚拟试穿、多区域编辑），无需针对特定任务进行微调。