2024-12-03

Title: LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting

Authors: Lingzheng Zhang, Lifeng Shen, Yimin Zheng, Shiyuan Piao, Ziyue Li, Fugee Tsung
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00053
Pdf URL: https://arxiv.org/pdf/2412.00053
Copy Paste: [[2412.00053]] LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting(https://arxiv.org/abs/2412.00053)
Keywords: generation
Abstract: Recent research has shown that large language models (LLMs) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, aligning time series into semantic spaces of LLMs comes with high computational costs and inference complexity, particularly for long-range time series generation. Building on recent advancements in using linear models for time series, this paper introduces an LLM-enhanced mixture of linear experts for precise and efficient time series forecasting. This approach involves developing a mixture of linear experts with multiple lookback lengths and a new multimodal fusion mechanism. The use of a mixture of linear experts is efficient due to its simplicity, while the multimodal fusion mechanism adaptively combines multiple linear experts based on the learned features of the text modality from pre-trained large language models. In experiments, we rethink the need to align time series to LLMs by existing time-series large language models and further discuss their efficiency and effectiveness in time series forecasting. Our experimental results show that the proposed LeMoLE model presents lower prediction errors and higher computational efficiency than existing LLM models.
摘要：最近的研究表明，大型语言模型 (LLM) 具有强大的自然语言理解能力，可有效用于现实世界的时间序列预测。然而，将时间序列对齐到 LLM 的语义空间会带来高计算成本和推理复杂性，尤其是对于长距离时间序列生成。基于最近使用线性模型进行时间序列预测的进展，本文介绍了一种 LLM 增强型线性专家混合模型，用于精确、高效的时间序列预测。这种方法涉及开发具有多个回溯长度的线性专家混合模型和一种新的多模态融合机制。使用线性专家混合模型由于其简单性而高效，而多模态融合机制则根据从预训练的大型语言模型中学习到的文本模态特征自适应地组合多个线性专家。在实验中，我们重新思考了现有时间序列大型语言模型将时间序列对齐到 LLM 的必要性，并进一步讨论了它们在时间序列预测中的效率和有效性。我们的实验结果表明，与现有的 LLM 模型相比，提出的 LeMoLE 模型具有更低的预测误差和更高的计算效率。

Title: Deep Learning-Based Electricity Price Forecast for Virtual Bidding in Wholesale Electricity Market

Authors: Xuesong Wang, Sharaf K. Magableh, Oraib Dawaghreh, Caisheng Wang, Jiaxuan Gong, Zhongyang Zhao, Michael H. Liao
Subjects: cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2412.00062
Pdf URL: https://arxiv.org/pdf/2412.00062
Copy Paste: [[2412.00062]] Deep Learning-Based Electricity Price Forecast for Virtual Bidding in Wholesale Electricity Market(https://arxiv.org/abs/2412.00062)
Keywords: generation
Abstract: Virtual bidding plays an important role in two-settlement electric power markets, as it can reduce discrepancies between day-ahead and real-time markets. Renewable energy penetration increases volatility in electricity prices, making accurate forecasting critical for virtual bidders, reducing uncertainty and maximizing profits. This study presents a Transformer-based deep learning model to forecast the price spread between real-time and day-ahead electricity prices in the ERCOT (Electric Reliability Council of Texas) market. The proposed model leverages various time-series features, including load forecasts, solar and wind generation forecasts, and temporal attributes. The model is trained under realistic constraints and validated using a walk-forward approach by updating the model every week. Based on the price spread prediction results, several trading strategies are proposed and the most effective strategy for maximizing cumulative profit under realistic market conditions is identified through backtesting. The results show that the strategy of trading only at the peak hour with a precision score of over 50% produces nearly consistent profit over the test period. The proposed method underscores the importance of an accurate electricity price forecasting model and introduces a new method of evaluating the price forecast model from a virtual bidder's perspective, providing valuable insights for future research.
摘要：虚拟竞价在双结算电力市场中发挥着重要作用，因为它可以减少日前市场和实时市场之间的差异。可再生能源的渗透增加了电价的波动性，因此准确的预测对于虚拟竞标者来说至关重要，可以减少不确定性并实现利润最大化。本研究提出了一种基于 Transformer 的深度学习模型，用于预测 ERCOT（德克萨斯州电力可靠性委员会）市场中实时电价和日前电价之间的价差。所提出的模型利用了各种时间序列特征，包括负荷预测、太阳能和风能发电预测以及时间属性。该模型在现实约束下进行训练，并使用前进法通过每周更新模型进行验证。根据价差预测结果，提出了几种交易策略，并通过回测确定了在现实市场条件下实现累积利润最大化的最有效策略。结果表明，仅在高峰时段交易且准确率超过 50% 的策略在测试期间产生了几乎一致的利润。所提出的方法强调了准确的电价预测模型的重要性，并介绍了一种从虚拟投标人的角度评估价格预测模型的新方法，为未来的研究提供了宝贵的见解。

Title: DiffGuard: Text-Based Safety Checker for Diffusion Models

Authors: Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00064
Pdf URL: https://arxiv.org/pdf/2412.00064
Copy Paste: [[2412.00064]] DiffGuard: Text-Based Safety Checker for Diffusion Models(https://arxiv.org/abs/2412.00064)
Keywords: generation
Abstract: Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.
摘要：扩散模型的最新进展使得从文本生成图像成为可能，其中 DALL-E 和 Midjourney 等强大的闭源模型处于领先地位。然而，开源替代方案，如 StabilityAI 的 Stable Diffusion，也提供了类似的功能。这些托管在 Hugging Face 上的开源模型配备了道德过滤器保护，旨在防止生成露骨图像。本文首先揭示了它们的局限性，然后提出了一种新颖的基于文本的安全过滤器，其性能优于现有解决方案。我们的研究是由解决人工智能生成内容滥用的迫切需求推动的，特别是在信息战的背景下。DiffGuard 提高了过滤效率，其性能比现有最佳过滤器高出 14% 以上。

Title: Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions

Authors: Justin Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00073
Pdf URL: https://arxiv.org/pdf/2412.00073
Copy Paste: [[2412.00073]] Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions(https://arxiv.org/abs/2412.00073)
Keywords: generative
Abstract: The rise of advanced AI models like Generative Adversarial Networks (GANs) and diffusion models such as Stable Diffusion has made the creation of highly realistic images accessible, posing risks of misuse in misinformation and manipulation. This study evaluates the effectiveness of convolutional neural networks (CNNs), as well as DenseNet architectures, for detecting AI-generated images. Using variations of the CIFAKE dataset, including images generated by different versions of Stable Diffusion, we analyze the impact of updates and modifications such as Gaussian blurring, prompt text changes, and Low-Rank Adaptation (LoRA) on detection accuracy. The findings highlight vulnerabilities in current detection methods and propose strategies to enhance the robustness and reliability of AI-image detection systems.
摘要：生成对抗网络 (GAN) 等先进 AI 模型和稳定扩散等扩散模型的兴起使得创建高度逼真的图像变得触手可及，但也带来了误用和操纵的风险。本研究评估了卷积神经网络 (CNN) 以及 DenseNet 架构在检测 AI 生成图像方面的有效性。使用 CIFAKE 数据集的变体（包括由不同版本的稳定扩散生成的图像），我们分析了高斯模糊、快速文本更改和低秩自适应 (LoRA) 等更新和修改对检测准确性的影响。研究结果突出了当前检测方法中的漏洞，并提出了增强 AI 图像检测系统的稳健性和可靠性的策略。

Title: Graph Canvas for Controllable 3D Scene Generation

Authors: Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, Lei Li
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00091
Pdf URL: https://arxiv.org/pdf/2412.00091
Copy Paste: [[2412.00091]] Graph Canvas for Controllable 3D Scene Generation(https://arxiv.org/abs/2412.00091)
Keywords: generation
Abstract: Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce \textbf{GraphCanvas3D}, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: this https URL.
摘要：空间智能是与物理世界交互的 AI 系统的基础，特别是在 3D 场景生成和空间理解方面。当前的 3D 场景生成方法通常严重依赖预定义的数据集，并且难以动态适应不断变化的空间关系。在本文中，我们介绍了 \textbf{GraphCanvas3D}，这是一个可编程、可扩展且适应性强的可控 3D 场景生成框架。利用上下文学习，GraphCanvas3D 无需重新训练即可实现动态适应性，支持灵活且可定制的场景创建。我们的框架采用分层的图形驱动场景描述，将空间元素表示为图形节点，并在 3D 环境中的对象之间建立连贯的关系。与传统方法不同，传统方法在适应性方面受到限制，并且通常需要预定义的输入掩码或重新训练以进行修改，而 GraphCanvas3D 允许无缝操作对象和动态调整场景。此外，GraphCanvas3D 支持 4D 场景生成，结合时间动态来模拟随时间的变化。实验结果和用户研究表明，GraphCanvas3D 增强了场景生成的可用性、灵活性和适应性。我们的代码和模型可在项目网站上找到：此 https URL。

Title: Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Authors: Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2412.00099
Pdf URL: https://arxiv.org/pdf/2412.00099
Copy Paste: [[2412.00099]] Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference(https://arxiv.org/abs/2412.00099)
Keywords: generation
Abstract: Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.
摘要：混合专家 (MoE) LLM 最近因其通过选择性地为每个输入启用专门的子网络或“专家”来提高性能的能力而受到关注。然而，在内存受限的设备上部署 MoE 仍然具有挑战性，特别是在按批大小顺序生成令牌时，而不是涉及长序列或大批的典型高吞吐量设置。在这项工作中，我们在内存受限的设备上优化了 MoE，其中只有一部分专家权重适合 DRAM。我们引入了一种新颖的缓存感知路由策略，该策略利用令牌生成过程中的专家重用来改善缓存局部性。我们在语言建模、MMLU 和 GSM8K 基准上评估了我们的方法，并展示了移动设备上 2$\times$ 加速的设备结果，提供了一种灵活、无需训练的解决方案，以扩展 MoE 在实际应用中的适用性。

Title: Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

Authors: Maitreya Patel, Song Wen, Dimitris N. Metaxas, Yezhou Yang
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00100
Pdf URL: https://arxiv.org/pdf/2412.00100
Copy Paste: [[2412.00100]] Steering Rectified Flow Models in the Vector Field for Controlled Image Generation(https://arxiv.org/abs/2412.00100)
Keywords: generation
Abstract: Diffusion models (DMs) excel in photorealism, image editing, and solving inverse problems, aided by classifier-free guidance and image inversion techniques. However, rectified flow models (RFMs) remain underexplored for these tasks. Existing DM-based methods often require additional training, lack generalization to pretrained latent models, underperform, and demand significant computational resources due to extensive backpropagation through ODE solvers and inversion processes. In this work, we first develop a theoretical and empirical understanding of the vector field dynamics of RFMs in efficiently guiding the denoising trajectory. Our findings reveal that we can navigate the vector field in a deterministic and gradient-free manner. Utilizing this property, we propose FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation tasks, facilitated by gradient skipping. FlowChef is a unified framework for controlled image generation that, for the first time, simultaneously addresses classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. Finally, we perform extensive evaluations and show that FlowChef significantly outperforms baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Project Page: \url{this https URL}.
摘要：借助无分类器引导和图像反演技术，扩散模型 (DM) 在照片级真实感、图像编辑和解决逆问题方面表现出色。然而，整流流模型 (RFM) 在这些任务中仍未得到充分探索。现有的基于 DM 的方法通常需要额外的训练，缺乏对预训练潜在模型的泛化，表现不佳，并且由于通过 ODE 求解器和反演过程进行大量反向传播而需要大量计算资源。在这项工作中，我们首先从理论和经验上理解 RFM 的矢量场动力学，以有效地引导去噪轨迹。我们的研究结果表明，我们可以以确定性和无梯度的方式导航矢量场。利用这一特性，我们提出了 FlowChef，它利用矢量场来引导受控图像生成任务的去噪轨迹，并通过梯度跳跃来实现。 FlowChef 是一个统一的受控图像生成框架，它首次同时解决了分类器引导、线性逆问题和图像编辑问题，而无需额外的训练、逆向或密集的反向传播。最后，我们进行了广泛的评估，并表明 FlowChef 在性能、内存和时间要求方面明显优于基线，取得了新的最佳结果。项目页面：\url{此 https URL}。

Title: BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Authors: Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00112
Pdf URL: https://arxiv.org/pdf/2412.00112
Copy Paste: [[2412.00112]] BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis(https://arxiv.org/abs/2412.00112)
Keywords: generation
Abstract: Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
摘要：从文本描述中生成自然且富有表现力的人体动作是一项挑战，因为协调全身动态和捕捉扩展序列中细微的运动模式（这些运动模式可以准确反映给定的文本）非常复杂。为了解决这个问题，我们引入了 BiPO（用于文本到运动合成的双向部分遮挡网络），这是一种新颖的模型，它通过将基于部分的生成与双向自回归架构相结合来增强文本到运动的合成。这种集成使 BiPO 能够在生成过程中同时考虑过去和未来的背景，同时增强对各个身体部位的详细控制，而无需真实运动长度。为了放松由集成引起的身体部位之间的相互依赖性，我们设计了部分遮挡技术，该技术在训练期间以概率方式遮挡某些运动部位的信息。在我们的综合实验中，BiPO 在 HumanML3D 数据集上实现了最先进的性能，在 FID 分数和整体运动质量方面优于 ParCo、MoMask 和 BAMM 等近期方法。值得注意的是，BiPO 不仅在文本到动作生成任务中表现出色，而且在基于部分生成的动作序列和文本描述合成动作的动作编辑任务中也表现出色。这些结果揭示了 BiPO 在推进文本到动作合成方面的有效性及其实际应用潜力。

Title: OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00115
Pdf URL: https://arxiv.org/pdf/2412.00115
Copy Paste: [[2412.00115]] OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation(https://arxiv.org/abs/2412.00115)
Keywords: generation
Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. The source code and the dataset are available at: \href{this https URL}{this https URL}.
摘要：视觉生成技术的最新进展显著提高了视频数据集的规模和可用性，这对于训练有效的视频生成模型至关重要。然而，高质量、以人为中心的视频数据集的严重缺乏对该领域的发展提出了挑战。为了弥补这一差距，我们引入了 \textbf{OpenHumanVid}，这是一个大规模、高质量的以人为中心的视频数据集，其特点是精确而详细的字幕，涵盖了人类的外观和运动状态，以及补充的人体运动条件，包括骨架序列和语音音频。为了验证该数据集和相关训练策略的有效性，我们提出了现有经典扩散变压器架构的扩展，并在拟议的数据集上对我们的模型进行进一步的预训练。我们的研究结果产生了两个关键见解：首先，大规模、高质量的数据集的结合大大增强了生成的人类视频的评估指标，同时保持了一般视频生成任务的性能。其次，将文本与人类外观、人体运动和面部运动有效地对齐对于产生高质量的视频输出至关重要。基于这些见解和相应的方法，在建议的数据集上训练的直接扩展网络在以人为中心的视频生成方面表现出明显的改进。源代码和数据集可在以下位置获得：\href{此 https URL}{此 https URL}。

Title: Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

Authors: Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00122
Pdf URL: https://arxiv.org/pdf/2412.00122
Copy Paste: [[2412.00122]] Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback(https://arxiv.org/abs/2412.00122)
Keywords: generation
Abstract: Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at this https URL.
摘要：在文本到图像的扩散模型中，从反馈中学习已被证明可以增强文本提示和图像之间的对齐。然而，由于反馈内容缺乏重点，特别是关于对象类型和数量，这些技术在面对指定的提示时难以准确匹配文本和图像。为了解决这个问题，我们提出了一种具有特定奖励目标的有效微调方法，包括三个阶段。首先，检测扩散模型生成的图像以获得对象类别和数量。同时，可以从检测结果和给定的提示中得出类别和数量的置信度。接下来，我们根据上述置信度定义一个新的匹配分数来衡量文本-图像对齐。它可以以奖励函数的形式指导模型进行反馈学习。最后，我们通过反向传播奖励函数梯度来微调扩散模型以生成语义相关的图像。与以前更注重整体匹配的反馈不同，我们更加注重实体类别和数量的准确性。此外，我们构建了一个文本到图像数据集，用于研究组合生成，其中包括 1.7 K 对文本-图像，其中包含各种实体和数量组合。在这个基准上的实验结果表明，我们的模型在对齐和保真度方面都优于其他 SOTA 方法。此外，我们的模型还可以作为评估其他模型中文本-图像对齐的指标。所有代码和数据集都可以在这个 https URL 上找到。

Title: Auto-Encoded Supervision for Perceptual Image Super-Resolution

Authors: MinKyu Lee, Sangeek Hyun, Woojin Jun, Jae-Pil Heo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00124
Pdf URL: https://arxiv.org/pdf/2412.00124
Copy Paste: [[2412.00124]] Auto-Encoded Supervision for Perceptual Image Super-Resolution(https://arxiv.org/abs/2412.00124)
Keywords: super-resolution
Abstract: This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of $L_\text{pix}$ that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with $L_\text{pix}$. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss ($L_\text{AESOP}$), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting $L_\text{pix}$ with $L_\text{AESOP}$, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task.
摘要：这项工作解决了感知超分辨率 (SR) 中的保真度目标。具体来说，我们解决了基于 GAN 的 SR 框架中像素级 $L_\text{p}$ 损失 ($\mathcal{L}_\text{pix}$) 的缺点。由于已知 $L_\text{pix}$ 与感知质量之间存在权衡关系，因此先前的方法通常会乘以一个小的比例因子或使用低通滤波器。然而，这项工作表明，这些规避措施未能解决导致模糊的根本因素。因此，我们关注两点：1) 精确区分导致模糊的 $L_\text{pix}$ 子成分，以及 2) 仅基于不受这种权衡关系影响的因素进行引导。我们表明，它们可以以一种令人惊讶的简单方式实现，即使用用 $L_\text{pix}$ 预训练的自动编码器 (AE)。因此，我们提出了自动编码监督最佳惩罚损失（$L_\text{AESOP}$），这是一种新颖的损失函数，它测量 AE 空间中的距离，而不是原始像素空间。请注意，AE 空间表示解码器后的空间，而不是瓶颈。只需用 $L_\text{AESOP}$ 替换 $L_\text{pix}$，我们就可以提供有效的重建指导，而不会影响感知质量。我们的方法设计简单，可以轻松集成到现有的 SR 框架中。实验结果验证了 AESOP 可以在感知 SR 任务中产生良好的结果。

Title: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Authors: Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00127
Pdf URL: https://arxiv.org/pdf/2412.00127
Copy Paste: [[2412.00127]] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads(https://arxiv.org/abs/2412.00127)
Keywords: generation
Abstract: We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.
摘要：我们介绍了一种自回归 (AR) 变换器 Orthus，它擅长根据文本提示生成图像、根据视觉输入回答问题，甚至制作冗长的图像文本交错内容。与统一多模态建模的现有技术不同，Orthus 在 AR 建模原理下同时处理离散文本标记和连续图像特征。对视觉信号的连续处理最大限度地减少了图像理解和生成的信息损失，而完全 AR 公式使模态之间相关性的表征变得简单明了。使 Orthus 能够利用这些优势的关键机制在于其特定于模态的头——一个常规语言建模 (LM) 头预测离散文本标记，一个扩散头根据主干的输出生成连续图像特征。我们设计了一种构建 Orthus 的有效策略——通过用软替代方案替代现有统一 AR 模型中的矢量量化 (VQ) 操作，引入扩散头，并调整添加的模块以重建图像，我们可以毫不费力地创建一个基于 Orthus 的模型（例如，仅在 72 个 A100 GPU 小时内）。Orthus-base 可以进一步接受后训练，以更好地模拟交错图像和文本。从经验上看，Orthus 在标准基准测试中超越了包括 Show-o 和 Chameleon 在内的竞争基线，使用 7B 参数实现了 0.58 的 GenEval 分数和 1265.8 的 MME-P 分数。Orthus 还展示了出色的混合模态生成能力，反映了处理复杂实际生成任务的潜力。

Title: Open-Sora Plan: Open-Source Large Video Generation Model

Authors: Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00131
Pdf URL: https://arxiv.org/pdf/2412.00131
Copy Paste: [[2412.00131]] Open-Sora Plan: Open-Source Large Video Generation Model(https://arxiv.org/abs/2412.00131)
Keywords: generation
Abstract: We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs. Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. Moreover, many assistant strategies for efficient training and inference are designed, and a multi-dimensional data curation pipeline is proposed for obtaining desired high-quality data. Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations. We hope our careful design and practical experience can inspire the video generation research community. All our codes and model weights are publicly available at \url{this https URL}.
摘要：我们介绍了一个开源项目 Open-Sora Plan，旨在贡献一个大型生成模型，用于根据各种用户输入生成所需的高分辨率长视频。我们的项目包含整个视频生成过程的多个组件，包括小波流变分自动编码器、联合图像-视频 Skiparse 去噪器和各种条件控制器。此外，还设计了许多用于高效训练和推理的辅助策略，并提出了多维数据管理流程以获取所需的高质量数据。得益于高效的思想，我们的 Open-Sora Plan 在定性和定量评估中都取得了令人印象深刻的视频生成结果。我们希望我们精心的设计和实践经验能够启发视频生成研究界。我们所有的代码和模型权重都可以在 \url{this https URL} 上公开获得。

Title: Event-based Tracking of Any Point with Motion-Robust Correlation Features

Authors: Friedhelm Hamann, Daniel Gehrig, Filbert Febryanto, Kostas Daniilidis, Guillermo Gallego
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00133
Pdf URL: https://arxiv.org/pdf/2412.00133
Copy Paste: [[2412.00133]] Event-based Tracking of Any Point with Motion-Robust Correlation Features(https://arxiv.org/abs/2412.00133)
Keywords: generation
Abstract: Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion - thereby addressing an open challenge in purely event-based tracking - with a novel feature alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 135% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 19% improvement over the previous best event-only method and even surpasses the previous best events-and-frames method by 3.7%.
摘要：跟踪任意点 (TAP) 最近将运动估计范式从关注具有局部模板的单个显着点转变为跟踪具有全局图像上下文的任意点。然而，虽然研究主要集中在提高标准设置下模型的准确性，但由于传感器的限制，解决困难照明条件和高速运动的场景仍然遥不可及。这项工作通过第一个基于事件相机的 TAP 方法解决了这一挑战。它利用事件相机的高时间分辨率和高动态范围进行稳健的高速跟踪，并利用 TAP 方法中的全局上下文来处理异步和稀疏事件测量。我们进一步扩展了 TAP 框架以处理由运动引起的事件特征变化 - 从而解决了纯基于事件的跟踪中的开放挑战 - 具有新颖的特征对齐损失，可确保学习运动稳健的特征。我们的方法使用来自新数据生成管道的数据进行训练，并在所有设计决策中进行系统性消融。我们的方法表现出强大的跨数据集泛化能力，并且在平均 Jaccard 指标上的表现比基线高出 135%。此外，在已建立的特征跟踪基准上，它比以前最好的仅事件方法提高了 19%，甚至比以前最好的事件和帧方法提高了 3.7%。

Title: Differentiable Topology Estimating from Curvatures for 3D Shapes

Authors: Yihao Luo
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00140
Pdf URL: https://arxiv.org/pdf/2412.00140
Copy Paste: [[2412.00140]] Differentiable Topology Estimating from Curvatures for 3D Shapes(https://arxiv.org/abs/2412.00140)
Keywords: generation
Abstract: In the field of data-driven 3D shape analysis and generation, the estimation of global topological features from localized representations such as point clouds, voxels, and neural implicit fields is a longstanding challenge. This paper introduces a novel, differentiable algorithm tailored to accurately estimate the global topology of 3D shapes, overcoming the limitations of traditional methods rooted in mesh reconstruction and topological data analysis. The proposed method ensures high accuracy, efficiency, and instant computation with GPU compatibility. It begins with an efficient calculation of the self-adjoint Weingarten map for point clouds and its adaptations for other modalities. The curvatures are then extracted, and their integration over tangent differentiable Voronoi elements is utilized to estimate key topological invariants, including the Euler number and Genus. Additionally, an auto-optimization mechanism is implemented to refine the local moving frames and area elements based on the integrity of topological invariants. Experimental results demonstrate the method's superior performance across various datasets. The robustness and differentiability of the algorithm ensure its seamless integration into deep learning frameworks, offering vast potential for downstream tasks in 3D shape analysis.
摘要：在数据驱动的 3D 形状分析和生成领域，从点云、体素和神经隐式场等局部表示中估计全局拓扑特征是一项长期挑战。本文介绍了一种新颖的可微分算法，该算法专门用于准确估计 3D 形状的全局拓扑，克服了传统方法在网格重建和拓扑数据分析方面的局限性。所提出的方法确保了高精度、高效率和即时计算，并具有 GPU 兼容性。它首先对点云的自伴 Weingarten 图进行有效计算，并将其适用于其他模态。然后提取曲率，并利用它们对切线可微分 Voronoi 元素的积分来估计关键的拓扑不变量，包括欧拉数和属。此外，还实施了一种自动优化机制，以基于拓扑不变量的完整性来细化局部移动框架和区域元素。实验结果表明该方法在各种数据集上均具有优异的性能。该算法的鲁棒性和可区分性确保了它与深度学习框架的无缝集成，为 3D 形状分析的下游任务提供了巨大的潜力。

Title: Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Authors: Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.00142
Pdf URL: https://arxiv.org/pdf/2412.00142
Copy Paste: [[2412.00142]] Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers(https://arxiv.org/abs/2412.00142)
Keywords: generative
Abstract: Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.
摘要：像 LLaVA 和 Qwen-VL 这样的生成式大型多模态模型 (LMM) 在各种视觉语言 (VL) 任务（例如图像字幕或视觉问答）方面表现出色。尽管性能强劲，但 LMM 并不直接适用于基础判别性视觉语言任务（即需要离散标签预测的任务），例如图像分类和多项选择 VQA。将 LMM 用于判别性任务的一个关键挑战是从生成模型中提取有用的特征。为了解决这个问题，我们提出了一种在模型的潜在空间中查找特征的方法，以更有效地利用 LMM 进行判别性任务。为此，我们提出了稀疏注意向量 (SAV)——一种无需微调的方法，它利用 LMM 中稀疏的注意力头激活（少于 1\% 的头部）作为 VL 任务的强特征。仅使用少量样本，SAV 就表现出了最佳性能，与一系列判别性任务上的各种少量和微调基线相比，性能也处于领先地位。我们的实验还表明，SAV 可以通过增加样本来提高性能，并推广到类似的任务，从而确立了 SAV 既有效又强大的多模态特征表示。

Title: Motion Modes: What Could Happen Next?

Authors: Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, Paul Guerrero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00148
Pdf URL: https://arxiv.org/pdf/2412.00148
Copy Paste: [[2412.00148]] Motion Modes: What Could Happen Next?(https://arxiv.org/abs/2412.00148)
Keywords: generation
Abstract: Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator's latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and even human predictions regarding plausibility and diversity. Project Webpage: this https URL
摘要：从单个静态图像预测各种物体运动仍然具有挑战性，因为当前的视频生成模型经常将物体运动与相机运动和其他场景变化纠缠在一起。虽然最近的方法可以根据运动箭头输入预测特定运动，但它们依赖于合成数据和预定义运动，从而限制了它们在复杂场景中的应用。我们引入了运动模式，这是一种无需训练的方法，它探索预先训练的图像到视频生成器的潜在分布，以发现专注于静态图像中选定物体的各种独特且合理的运动。我们通过使用由能量函数引导的流生成器来实现这一点，该能量函数旨在解开物体和相机运动。此外，我们使用受粒子引导启发的能量来使生成的运动多样化，而无需明确的训练数据。实验结果表明，运动模式可以生成逼真且多样的物体动画，超越了以前的方法，甚至超越了人类对可信度和多样性的预测。项目网页：此 https URL

Title: ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

Authors: Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00153
Pdf URL: https://arxiv.org/pdf/2412.00153
Copy Paste: [[2412.00153]] ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model(https://arxiv.org/abs/2412.00153)
Keywords: generation
Abstract: Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
摘要：CLIP 和大型多模态模型 (LMM) 的进步已经实现了开放词汇和自由文本分割，但现有模型仍然需要预定义的类别提示，从而限制了自由形式类别的自我生成。大多数分割 LMM 也仍然局限于稀疏预测，限制了它们在开放集环境中的适用性。相比之下，我们提出了 ROSE，一种革命性的开放集密集分割 LMM，它通过逐块感知实现密集掩码预测和开放类别生成。我们的方法将每个图像块视为一个独立的感兴趣区域候选，使模型能够同时预测密集和稀疏掩码。此外，新设计的指令-响应范式充分利用了 LMM 的生成和泛化能力，实现了独立于闭集约束或预定义类别的类别预测。为了进一步增强掩码细节和类别精度，我们引入了一种基于对话的细化范式，将上一步的预测结果与文本提示相结合以供修订。大量实验表明，ROSE 在统一框架中实现了在各种分割任务中具有竞争力的性能。代码即将发布。

Title: VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

Authors: Taesung Kwon, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00156
Pdf URL: https://arxiv.org/pdf/2412.00156
Copy Paste: [[2412.00156]] VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models(https://arxiv.org/abs/2412.00156)
Keywords: super-resolution
Abstract: In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present batch-consistent inversion, an initialization technique that incorporates informative latents from the measurement frame. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a single NVIDIA 4090 GPU. Project page: this https URL.
摘要：在本文中，我们提出了一种使用潜像扩散模型解决高清视频逆问题的新框架。基于使用图像扩散模型对视频逆问题进行时空优化的最新进展，我们的方法利用潜空间扩散模型来实现增强的视频质量和分辨率。为了满足处理高分辨率帧的高计算需求，我们引入了一种伪批量一致性采样策略，允许在单个 GPU 上高效运行。此外，为了提高时间一致性，我们提出了批量一致性反演，这是一种结合了来自测量帧的信息潜像的初始化技术。通过与 SDXL 集成，我们的框架在广泛的时空逆问题中实现了最先进的视频重建，包括帧平均和各种空间退化的复杂组合，例如去模糊、超分辨率和修复。与以前的方法不同，我们的方法支持多种纵横比（横向、纵向和正方形），并在单个 NVIDIA 4090 GPU 上在 2.5 分钟内提供高清分辨率重建（超过 1280x720）。项目页面：此 https URL。

Title: AerialGo: Walking-through City View Generation from Aerial Perspectives

Authors: Fuqiang Zhao, Yijing Guo, Siyuan Yang, Xi Chen, Luo Wang, Lan Xu, Yingliang Zhang, Yujiao Shi, Jingyi Yu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00157
Pdf URL: https://arxiv.org/pdf/2412.00157
Copy Paste: [[2412.00157]] AerialGo: Walking-through City View Generation from Aerial Perspectives(https://arxiv.org/abs/2412.00157)
Keywords: generation, generative
Abstract: High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.
摘要：高质量的 3D 城市重建对于城市规划、导航和 AR/VR 应用至关重要。然而，在城市中捕获详细的地面数据既耗费人力，又会引发与敏感信息（如车牌、人脸和其他个人识别信息）相关的重大隐私问题。为了应对这些挑战，我们提出了 AerialGo，这是一个新颖的框架，它从航拍图像生成逼真的步行城市视图，利用多视图扩散模型实现可扩展的、逼真的城市重建，而无需直接收集地面数据。通过在可访问的航拍数据上调节地面视图合成，AerialGo 绕过了地面图像固有的隐私风险。为了支持模型训练，我们引入了 AerialGo 数据集，这是一个包含各种航拍和地面视图图像的大型数据集，并配有相机和深度信息，旨在支持生成性城市重建。实验表明，AerialGo 显著增强了地面真实感和结构连贯性，为城市规模的 3D 建模提供了一种注重隐私的可扩展解决方案。

Title: Origin-Destination Demand Prediction: An Urban Radiation and Attraction Perspective

Authors: Xuan Ma, Zepeng Bao, Ming Zhong, Yuanyuan Zhu, Chenliang Li, Jiawei Jiang, Qing Li, Tieyun Qian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00167
Pdf URL: https://arxiv.org/pdf/2412.00167
Copy Paste: [[2412.00167]] Origin-Destination Demand Prediction: An Urban Radiation and Attraction Perspective(https://arxiv.org/abs/2412.00167)
Keywords: generation
Abstract: In recent years, origin-destination (OD) demand prediction has gained significant attention for its profound implications in urban development. Existing data-driven deep learning methods primarily focus on the spatial or temporal dependency between regions yet neglecting regions' fundamental functional difference. Though knowledge-driven physical methods have characterised regions' functions by their radiation and attraction capacities, these functions are defined on numerical factors like population without considering regions' intrinsic nominal attributes, e.g., a region is a residential or industrial district. Moreover, the complicated relationships between two types of capacities, e.g., the radiation capacity of a residential district in the morning will be transformed into the attraction capacity in the evening, are totally missing from physical methods. In this paper, we not only generalize the physical radiation and attraction capacities into the deep learning framework with the extended capability to fulfil regions' functions, but also present a new model that captures the relationships between two types of capacities. Specifically, we first model regions' radiation and attraction capacities using a bilateral branch network, each equipped with regions' attribute representations. We then describe the transformation relationship of different capacities of the same region using a hypergraph-based parameter generation method. We finally unveil the competition relationship of different regions with the same attraction capacity through cluster-based adversarial learning. Extensive experiments on two datasets demonstrate the consistent improvements of our method over the state-of-the-art baselines, as well as the good explainability of regions' functions using their nominal attributes.
摘要：近年来，起点-目的地 (OD) 需求预测因其在城市发展中的深远影响而受到广泛关注。现有的数据驱动的深度学习方法主要关注区域之间的空间或时间依赖性，而忽略了区域的基本功能差异。虽然知识驱动的物理方法已经通过辐射和吸引能力来表征区域的功能，但这些功能是基于人口等数值因素定义的，而没有考虑区域的内在名义属性，例如，一个区域是住宅区还是工业区。此外，物理方法完全忽略了两种容量之间的复杂关系，例如，早上住宅区的辐射能力将在晚上转化为吸引能力。在本文中，我们不仅将物理辐射和吸引能力推广到具有扩展能力以实现区域功能的深度学习框架中，而且还提出了一种捕捉两种容量之间关系的新模型。具体而言，我们首先使用双边分支网络对区域的辐射和吸引能力进行建模，每个分支网络都配备了区域的属性表示。然后，我们使用基于超图的参数生成方法描述同一区域不同容量之间的转换关系。我们最终通过基于聚类的对抗性学习揭示了具有相同吸引力的不同区域之间的竞争关系。在两个数据集上进行的大量实验表明，我们的方法与最先进的基线相比有持续的改进，并且使用其名义属性对区域函数具有良好的可解释性。

Title: Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Authors: Hui Ren, Joanna Materzynska, Rohit Gandikota, David Bau, Antonio Torralba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00176
Pdf URL: https://arxiv.org/pdf/2412.00176
Copy Paste: [[2412.00176]] Art-Free Generative Models: Art Creation Without Graphic Art Knowledge(https://arxiv.org/abs/2412.00176)
Keywords: generation, generative
Abstract: We explore the question: "How much prior art knowledge is needed to create art?" To investigate this, we propose a text-to-image generation model trained without access to art-related content. We then introduce a simple yet effective method to learn an art adapter using only a few examples of selected artistic styles. Our experiments show that art generated using our method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Finally, through data attribution techniques, we illustrate how examples from both artistic and non-artistic datasets contributed to the creation of new artistic styles.
摘要：我们探讨了这样一个问题：“创作艺术作品需要多少先前的艺术知识？”为了研究这个问题，我们提出了一种文本到图像生成模型，该模型无需访问与艺术相关的内容即可进行训练。然后，我们介绍了一种简单而有效的方法来学习艺术适配器，该方法仅使用选定艺术风格的几个示例。我们的实验表明，用户认为使用我们的方法生成的艺术作品与在大型、艺术丰富的数据集上训练的模型制作的艺术作品相当。最后，通过数据归因技术，我们说明了来自艺术和非艺术数据集的示例如何有助于创造新的艺术风格。

Title: LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Authors: Xiaoyan Xing, Konrad Groh, Sezer Karagolu, Theo Gevers, Anand Bhattad
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00177
Pdf URL: https://arxiv.org/pdf/2412.00177
Copy Paste: [[2412.00177]] LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting(https://arxiv.org/abs/2412.00177)
Keywords: generative
Abstract: We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.
摘要：我们推出了 LumiNet，这是一种利用生成模型和潜在内在表征实现有效照明传输的新型架构。给定源图像和目标照明图像，LumiNet 会合成源场景的重新照明版本，以捕捉目标的照明。我们的方法做出了两个关键贡献：一种基于 StyleGAN 的重新照明模型的数据管理策略，用于我们的训练，以及一种改进的基于扩散的 ControlNet，它处理源图像的潜在内在属性和目标图像的潜在外在属性。我们通过学习适配器 (MLP) 进一步改进照明传输，该适配器通过交叉注意和微调注入目标的潜在外在属性。与使用条件图从单个场景生成图像的传统 ControlNet 不同，LumiNet 处理来自两个不同图像的潜在表征 - 保留源的几何和反照率，同时传输目标的照明特性。实验表明，我们的方法成功地将包括镜面高光和间接照明在内的复杂照明现象传输到具有不同空间布局和材质的场景中，在仅使用图像作为输入的具有挑战性的室内场景中，其表现优于现有的方法。

Title: Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation

Authors: Michele De Vita, Vasileios Belagiannis
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00205
Pdf URL: https://arxiv.org/pdf/2412.00205
Copy Paste: [[2412.00205]] Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation(https://arxiv.org/abs/2412.00205)
Keywords: generation, generative
Abstract: Despite the remarkable progress in generative modelling, current diffusion models lack a quantitative approach to assess image quality. To address this limitation, we propose to estimate the pixel-wise aleatoric uncertainty during the sampling phase of diffusion models and utilise the uncertainty to improve the sample generation quality. The uncertainty is computed as the variance of the denoising scores with a perturbation scheme that is specifically designed for diffusion models. We then show that the aleatoric uncertainty estimates are related to the second-order derivative of the diffusion noise distribution. We evaluate our uncertainty estimation algorithm and the uncertainty-guided sampling on the ImageNet and CIFAR-10 datasets. In our comparisons with the related work, we demonstrate promising results in filtering out low quality samples. Furthermore, we show that our guided approach leads to better sample generation in terms of FID scores.
摘要：尽管生成建模取得了显著进展，但当前的扩散模型缺乏评估图像质量的定量方法。为了解决这一限制，我们建议在扩散模型的采样阶段估计逐像素的随机不确定性，并利用不确定性来提高样本生成质量。不确定性计算为去噪分数的方差，采用专门为扩散模型设计的扰动方案。然后，我们表明，随机不确定性估计与扩散噪声分布的二阶导数有关。我们在 ImageNet 和 CIFAR-10 数据集上评估了我们的不确定性估计算法和不确定性引导采样。在与相关工作的比较中，我们展示了在过滤低质量样本方面有希望的结果。此外，我们表明我们的引导方法可以生成更好的 FID 分数样本。

Title: Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Authors: Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian Price, Scott Cohen, Jianming Zhang, Daniel Aliaga
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00306
Pdf URL: https://arxiv.org/pdf/2412.00306
Copy Paste: [[2412.00306]] Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment(https://arxiv.org/abs/2412.00306)
Keywords: generation, generative
Abstract: Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.
摘要：个性化图像生成源自生成模型的最新进展。然而，这些生成的个性化图像通常会受到局部伪影（例如错误的徽标）的影响，从而降低生成结果的保真度和细粒度身份细节。此外，之前很少有研究解决这个问题。为了帮助改善个性化图像生成中的这些身份细节，我们引入了一项新任务：参考引导的伪影细化。我们提出了 Refine-by-Align，这是一种首创的模型，它采用基于扩散的框架来应对这一挑战。我们的模型由两个阶段组成：对齐阶段和细化阶段，它们共享统一神经网络模型的权重。给定生成的图像、被掩盖的伪影区域和参考图像，对齐阶段识别并提取参考中的相应区域特征，然后细化阶段使用这些特征来修复伪影。我们的模型无关管道不需要测试时间调整或优化。它会自动增强生成图像中的图像保真度和参考身份，并很好地推广到现有模型的各种任务，包括但不限于定制、生成合成、视图合成和虚拟试穿。大量实验和比较表明，我们的流程极大地推动了图像合成模型中精细细节的界限。

Title: Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

Authors: Feiyang Liu, Dan Guo, Jingyuan Xu, Zihao He, Shengeng Tang, Kun Li, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00309
Pdf URL: https://arxiv.org/pdf/2412.00309
Copy Paste: [[2412.00309]] Towards Pixel-Level Prediction for Gaze Following: Benchmark and Approach(https://arxiv.org/abs/2412.00309)
Keywords: generation
Abstract: Following the gaze of other people and analyzing the target they are looking at can help us understand what they are thinking, and doing, and predict the actions that may follow. Existing methods for gaze following struggle to perform well in natural scenes with diverse objects, and focus on gaze points rather than objects, making it difficult to deliver clear semantics and accurate scope of the targets. To address this shortcoming, we propose a novel gaze target prediction solution named GazeSeg, that can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process. Specifically, a prompt-based visual foundation model serves as the encoder, working in conjunction with three distinct decoding modules (e.g. FoV perception, heatmap generation, and segmentation) to form the framework for gaze target prediction. Then, with the head bounding box performed as an initial prompt, GazeSeg obtains the FoV map, heatmap, and segmentation map progressively, leading to a unified framework for multiple tasks (e.g. direction estimation, gaze target segmentation, and recognition). In particular, to facilitate this research, we construct and release a new dataset, comprising 72k images with pixel-level annotations and 270 categories of gaze targets, built upon the GazeFollow dataset. The quantitative evaluation shows that our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition. Meanwhile, our approach also outperforms previous state-of-the-art methods, achieving 0.953 in AUC on the gaze-following task. The dataset and code will be released.
摘要：追踪他人的视线并分析他们所注视的目标可以帮助我们了解他们在想什么、做什么，并预测他们接下来可能采取的行动。现有的视线追踪方法在物体多样的自然场景中表现不佳，并且聚焦于注视点而非物体，难以提供清晰的语义和准确的目标范围。针对这一缺陷，我们提出了一种新颖的注视目标预测解决方案 GazeSeg，该方案可以充分利用人的空间视野作为引导信息，实现由粗到细的注视目标分割和识别过程。具体来说，以基于提示的视觉基础模型作为编码器，配合三个不同的解码模块（例如 FoV 感知、热图生成和分割）形成注视目标预测框架。然后，以头部边界框作为初始提示，GazeSeg 逐步获得 FoV 图、热图和分割图，从而实现用于多任务（例如方向估计、注视目标分割和识别）的统一框架。具体来说，为了促进这项研究，我们构建并发布了一个新数据集，该数据集包含 72k 张带有像素级注释的图像和 270 个凝视目标类别，这些数据集基于 GazeFollow 数据集构建。定量评估表明，我们的方法在凝视目标分割中实现了 0.325 的 Dice 值，并实现了 71.7% 的 top-5 识别率。同时，我们的方法也优于以前的最先进方法，在凝视跟踪任务中实现了 0.953 的 AUC。数据集和代码即将发布。

Title: Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications

Authors: Hana Satou, Alan Mitkiy
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00341
Pdf URL: https://arxiv.org/pdf/2412.00341
Copy Paste: [[2412.00341]] Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications(https://arxiv.org/abs/2412.00341)
Keywords: generation
Abstract: The convergence of cross-modal adversarial learning and physics-driven methods represents a cutting-edge direction for tackling challenges in complex multi-modal tasks and scientific computing. This review focuses on systematically analyzing how these two approaches can be synergistically integrated to enhance performance and robustness across diverse application domains. By addressing key obstacles such as modality discrepancies, limited data availability, and insufficient model robustness, this paper highlights the role of physics-based optimization frameworks in facilitating efficient and interpretable adversarial perturbation generation. The review also explores significant advancements in cross-modal adversarial learning, including applications in tasks such as image cross-modal retrieval (e.g., infrared and RGB matching), scientific computing (e.g., solving partial differential equations), and optimization under physical consistency constraints in vision systems. By examining theoretical foundations and experimental outcomes, this study demonstrates the potential of combining these approaches to handle complex scenarios and improve the security of multi-modal systems. Finally, we outline future directions, proposing a novel framework that unifies physical principles with adversarial optimization, providing a pathway for researchers to develop robust and adaptable cross-modal learning methods with both theoretical and practical significance.
摘要：跨模态对抗学习与物理驱动方法的融合代表了解决复杂多模态任务和科学计算挑战的前沿方向。本综述重点系统分析了如何将这两种方法协同集成以提高不同应用领域的性能和鲁棒性。通过解决模态差异、数据可用性有限和模型鲁棒性不足等关键障碍，本文强调了基于物理的优化框架在促进高效和可解释的对抗扰动生成方面的作用。本综述还探讨了跨模态对抗学习的重大进展，包括图像跨模态检索（例如红外和 RGB 匹配）、科学计算（例如求解偏微分方程）和视觉系统中物理一致性约束下的优化等任务中的应用。通过研究理论基础和实验结果，本研究展示了结合这些方法来处理复杂场景和提高多模态系统安全性的潜力。最后，我们概述了未来的方向，提出了一个将物理原理与对抗性优化相结合的新框架，为研究人员提供了开发具有理论和实践意义的稳健且适应性强的跨模态学习方法的途径。

Title: Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Authors: Wei Zhou, Lei Zhao, Runyu Zhang, Yifan Cui, Hongpu Huang, Kun Qie, Chen Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00348
Pdf URL: https://arxiv.org/pdf/2412.00348
Copy Paste: [[2412.00348]] Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey(https://arxiv.org/abs/2412.00348)
Keywords: generation
Abstract: Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities.
摘要：交通监控系统 (TSS) 在现代智能交通系统中变得越来越重要，其中基于视觉的技术在场景感知和理解中发挥着核心作用。虽然现有调查通常侧重于 TSS 的孤立方面，但仍然缺乏将低级和高级感知任务联系起来的全面分析，尤其是考虑到新兴技术。本文系统地回顾了 TSS 中基于视觉的技术，研究了低级感知任务（对象检测、分类和跟踪）和高级感知应用（参数估计、异常检测和行为理解）。具体来说，我们首先为每个任务提供详细的方法分类和全面的性能评估。我们的调查揭示了当前 TSS 的五个基本限制：复杂场景中的感知数据退化、数据驱动的学习约束、语义理解差距、感知覆盖范围限制和计算资源需求。为了应对这些挑战，我们系统地分析了五类潜在解决方案：高级感知增强、高效学习范式、知识增强理解、协作感知框架和高效计算框架。此外，我们评估了基础模型在 TSS 中的变革潜力，展示了它们在零样本学习、语义理解和场景生成方面的独特能力。本综述提供了一个连接低级和高级感知任务的统一框架，系统地分析了当前的局限性和解决方案，并提出了整合新兴技术（特别是基础模型）以增强 TSS 能力的结构化路线图。

Title: Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment

Authors: Dongfang Zhao
Subjects: cs.LG, cs.AI, math.AG
Abstract URL: https://arxiv.org/abs/2412.00373
Pdf URL: https://arxiv.org/pdf/2412.00373
Copy Paste: [[2412.00373]] Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment(https://arxiv.org/abs/2412.00373)
Keywords: generation
Abstract: Multimodal tasks, such as image-text retrieval and generation, require embedding data from diverse modalities into a shared representation space. Aligning embeddings from heterogeneous sources while preserving shared and modality-specific information is a fundamental challenge. This paper provides an initial attempt to integrate algebraic geometry into multimodal representation learning, offering a foundational perspective for further exploration. We model image and text data as polynomials over discrete rings, $ \mathbb{Z}_{256}[x] $ and $ \mathbb{Z}_{|V|}[x] $, respectively, enabling the use of algebraic tools like fiber products to analyze alignment properties. To accommodate real-world variability, we extend the classical fiber product to an approximate fiber product with a tolerance parameter $ \epsilon $, balancing precision and noise tolerance. We study its dependence on $ \epsilon $, revealing asymptotic behavior, robustness to perturbations, and sensitivity to embedding dimensionality. Additionally, we propose a decomposition of the shared embedding space into orthogonal subspaces, $ Z = Z_s \oplus Z_I \oplus Z_T $, where $ Z_s $ captures shared semantics, and $ Z_I $, $ Z_T $ encode modality-specific features. This decomposition is geometrically interpreted via manifolds and fiber bundles, offering insights into embedding structure and optimization. This framework establishes a principled foundation for analyzing multimodal alignment, uncovering connections between robustness, dimensionality allocation, and algebraic structure. It lays the groundwork for further research on embedding spaces in multimodal learning using algebraic geometry.
摘要：多模态任务（例如图像文本检索和生成）需要将来自不同模态的数据嵌入到共享表示空间中。对齐来自异构源的嵌入，同时保留共享和特定于模态的信息是一项基本挑战。本文首次尝试将代数几何融入多模态表示学习，为进一步探索提供了基础视角。我们分别将图像和文本数据建模为离散环上的多项式 \（ \mathbb{Z}_{256}[x] \）和 \（ \mathbb{Z}_{|V|}[x] \），从而可以使用纤维积等代数工具来分析对齐属性。为了适应现实世界的变化，我们将经典纤维积扩展为具有容差参数 \（ \epsilon \）的近似纤维积，以平衡精度和噪声容差。我们研究了它对 \（ \epsilon \）的依赖性，揭示了渐近行为、对扰动的鲁棒性和对嵌入维数的敏感性。此外，我们提出将共享嵌入空间分解为正交子空间，\（Z = Z_s \oplus Z_I \oplus Z_T \），其中 \（Z_s \）捕获共享语义，\（Z_I \），\（Z_T \）编码特定于模态的特征。这种分解通过流形和纤维束进行几何解释，为嵌入结构和优化提供了见解。该框架为分析多模态对齐建立了原则基础，揭示了鲁棒性、维数分配和代数结构之间的联系。它为使用代数几何进一步研究多模态学习中的嵌入空间奠定了基础。

Title: DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation

Authors: Zhaoxing Gan, Guangnan Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00381
Pdf URL: https://arxiv.org/pdf/2412.00381
Copy Paste: [[2412.00381]] DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation(https://arxiv.org/abs/2412.00381)
Keywords: generation, generative
Abstract: Layout Generation aims to synthesize plausible arrangements from given elements. Currently, the predominant methods in layout generation are Generative Adversarial Networks (GANs) and diffusion models, each presenting its own set of challenges. GANs typically struggle with handling discrete data due to their requirement for differentiable generated samples and have historically circumvented the direct generation of discrete labels by treating them as fixed conditions. Conversely, diffusion-based models, despite achieving state-of-the-art performance across several metrics, require extensive sampling steps which lead to significant time costs. To address these limitations, we propose \textbf{DogLayout} (\textbf{D}en\textbf{o}ising Diffusion \textbf{G}AN \textbf{Layout} model), which integrates a diffusion process into GANs to enable the generation of discrete label data and significantly reduce diffusion's sampling time. Experiments demonstrate that DogLayout considerably reduces sampling costs by up to 175 times and cuts overlap from 16.43 to 9.59 compared to existing diffusion models, while also surpassing GAN based and other layout methods. Code is available at this https URL.
摘要：布局生成旨在从给定元素中合成合理的排列。目前，布局生成的主要方法是生成对抗网络 (GAN) 和扩散模型，每种方法都存在各自的挑战。由于需要可区分的生成样本，GAN 通常难以处理离散数据，并且历来通过将离散标签的直接生成视为固定条件来规避离散标签的直接生成。相反，基于扩散的模型尽管在多个指标上实现了最先进的性能，但需要大量的采样步骤，这会导致大量的时间成本。为了解决这些限制，我们提出了 \textbf{DogLayout}（\textbf{D}en\textbf{o}ising Diffusion \textbf{G}AN \textbf{Layout} 模型），它将扩散过程集成到 GAN 中，以便生成离散标签数据并显着减少扩散的采样时间。实验表明，与现有的扩散模型相比，DogLayout 可将采样成本降低多达 175 倍，并将重叠度从 16.43 降至 9.59，同时也超越了基于 GAN 和其他布局方法。代码可在此 https URL 上找到。

Title: On autoregressive deep learning models for day-ahead wind power forecasting with irregular shutdowns due to redispatching

Authors: Stefan Meisenbacher, Silas Aaron Selzer, Mehdi Dado, Maximilian Beichter, Tim Martin, Markus Zdrallek, Peter Bretschneider, Veit Hagenmeyer, Ralf Mikut
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00423
Pdf URL: https://arxiv.org/pdf/2412.00423
Copy Paste: [[2412.00423]] On autoregressive deep learning models for day-ahead wind power forecasting with irregular shutdowns due to redispatching(https://arxiv.org/abs/2412.00423)
Keywords: generation
Abstract: Renewable energies and their operation are becoming increasingly vital for the stability of electrical power grids since conventional power plants are progressively being displaced, and their contribution to redispatch interventions is thereby diminishing. In order to consider renewable energies like Wind Power (WP) for such interventions as a substitute, day-ahead forecasts are necessary to communicate their availability for redispatch planning. In this context, automated and scalable forecasting models are required for the deployment to thousands of locally-distributed onshore WP turbines. Furthermore, the irregular interventions into the WP generation capabilities due to redispatch shutdowns pose challenges in the design and operation of WP forecasting models. Since state-of-the-art forecasting methods consider past WP generation values alongside day-ahead weather forecasts, redispatch shutdowns may impact the forecast. Therefore, the present paper highlights these challenges and analyzes state-of-the-art forecasting methods on data sets with both regular and irregular shutdowns. Specifically, we compare the forecasting accuracy of three autoregressive Deep Learning (DL) methods to methods based on WP curve modeling. Interestingly, the latter achieve lower forecasting errors, have fewer requirements for data cleaning during modeling and operation while being computationally more efficient, suggesting their advantages in practical applications.
摘要：由于传统发电厂正在逐渐被取代，因此它们对重新调度干预的贡献正在减少，可再生能源及其运行对于电网的稳定性变得越来越重要。为了考虑使用风能 (WP) 等可再生能源作为此类干预的替代品，需要提前一天进行预测，以传达它们在重新调度计划中的可用性。在这种情况下，需要自动化和可扩展的预测模型来部署到数千台本地分布的陆上 WP 涡轮机。此外，由于重新调度停机而对 WP 发电能力的不定期干预对 WP 预测模型的设计和运行提出了挑战。由于最先进的预测方法考虑了过去的 WP 发电值以及提前一天的天气预报，因此重新调度停机可能会影响预测。因此，本文重点介绍了这些挑战，并分析了具有定期和不定期停机的数据集上最先进的预测方法。具体来说，我们将三种自回归深度学习 (DL) 方法的预测精度与基于 WP 曲线建模的方法进行了比较。有趣的是，后者实现了更低的预测误差，在建模和操作过程中对数据清理的要求更少，同时计算效率更高，这表明它们在实际应用中具有优势。

Title: FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00427
Pdf URL: https://arxiv.org/pdf/2412.00427
Copy Paste: [[2412.00427]] FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting(https://arxiv.org/abs/2412.00427)
Keywords: generation
Abstract: In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.
摘要：在本研究中，我们旨在确定并解决稳定扩散修复 (SDI) 在遵循提示和掩码指令方面的不足。由于掩码的训练偏差，当提示指令和图像条件不相关时，修复质量会受到阻碍。因此，我们对 SDI 学习到的内部表示进行了详细分析，重点研究了掩码输入如何影响交叉注意层。我们观察到，将文本关键标记调整为输入掩码可以使模型在给定区域内有选择地绘制。利用这些见解，我们提出了 FreeCond，它只调整输入掩码条件和图像条件。通过增加潜在掩码值并修改图像条件的频率，我们将交叉注意特征与模型的训练偏差对齐，以提高生成质量而无需额外的计算，特别是当用户输入复杂且偏离训练设置时。大量实验表明，FreeCond 可以增强任何基于 SDI 的模型，例如，可以使 CLIP 分数中的 SDI 和 SDXLI 分别提高 60% 和 58%。

Title: A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge

Authors: Atharva Deshpande, Kaushik Gopalan, Jeet Shah, Hrishikesh Simu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00451
Pdf URL: https://arxiv.org/pdf/2412.00451
Copy Paste: [[2412.00451]] A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge(https://arxiv.org/abs/2412.00451)
Keywords: generative
Abstract: This study explores the application of deep learning for rainfall prediction, leveraging the Spinning Enhanced Visible and Infrared Imager (SEVIRI) High rate information transmission (HRIT) data as input and the Operational Program on the Exchange of weather RAdar information (OPERA) ground-radar reflectivity data as ground truth. We use the mean of 4 InfraRed frequency channels as the input. The radiance images are forecasted up to 4 hours into the future using a dense optical flow algorithm. A conditional generative adversarial network (GAN) model is employed to transform the predicted radiance images into rainfall images which are aggregated over the 4 hour forecast period to generate cumulative rainfall values. This model scored a value of approximately 7.5 as the Continuous Ranked Probability Score (CRPS) in the Weather4Cast 2024 competition and placed 1st on the core challenge leaderboard.
摘要：本研究探索了深度学习在降雨预测中的应用，利用旋转增强型可见光和红外成像仪 (SEVIRI) 高速信息传输 (HRIT) 数据作为输入，以天气雷达信息交换业务计划 (OPERA) 地面雷达反射率数据作为地面实况。我们使用 4 个红外频道的平均值作为输入。使用密集光流算法预测未来 4 小时的辐射图像。采用条件生成对抗网络 (GAN) 模型将预测的辐射图像转换为降雨图像，并在 4 小时预测期内汇总以生成累积降雨值。该模型在 Weather4Cast 2024 竞赛中的连续排序概率得分 (CRPS) 约为 7.5，并在核心挑战排行榜上名列第一。

Title: Homeostazis and Sparsity in Transformer

Authors: Leonid Kotyuzanskiy, Artem Klimov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00503
Pdf URL: https://arxiv.org/pdf/2412.00503
Copy Paste: [[2412.00503]] Homeostazis and Sparsity in Transformer(https://arxiv.org/abs/2412.00503)
Keywords: generation
Abstract: The transformer architecture has become an integral part of the field of modern neural networks, playing a crucial role in a variety of tasks, such as text generation, machine translation, image and audio processing, among others. There is also an alternative approach to building intelligent systems, proposed by Jeff Hawkins and inspired by the processes occurring in the neocortex. In our article we want to combine some of these ideas and to propose the use of homeostazis mechanisms, such as RFB-kWTA and "Smart" Inhibition, in the attention mechanism of the transformer and at the output of the transformer block, as well as conducting an experiment involving the introduction of sparse distributed representations of the transformer at various points. RFB-kWTA utilizes statistics of layer activations across time to adjust the entire layer, enhancing the values of rare activations while reducing those of frequent ones. "Smart" Inhibition also uses activation statistics to sample sparsity masks, with rarer activation times are more likely to be activated. Our proposed mechanisms significantly outperform the classical transformer 0.2768 BLEU and a model that only makes use of dropout in the attention mechanism and output of the transformer block 0.3007 BLEU, achieving a score of 0.3062 on the Multi30K dataset.
摘要：Transformer 架构已成为现代神经网络领域不可或缺的一部分，在文本生成、机器翻译、图像和音频处理等各种任务中发挥着至关重要的作用。Jeff Hawkins 还提出了一种构建智能系统的替代方法，该方法受到大脑皮层中发生的过程的启发。在我们的文章中，我们想结合这些想法中的一些，并提出在 Transformer 的注意力机制和 Transformer 块的输出中使用稳态机制，例如 RFB-kWTA 和“智能”抑制，以及进行一项实验，涉及在各个点引入 Transformer 的稀疏分布式表示。RFB-kWTA 利用跨时间的层激活统计数据来调整整个层，增强罕见激活的值，同时降低频繁激活的值。“智能”抑制还使用激活统计数据来采样稀疏掩码，激活时间越少，激活的可能性就越大。我们提出的机制明显优于经典 Transformer 的 0.2768 BLEU 和仅在注意力机制中使用 dropout 以及 Transformer 块输出的 0.3007 BLEU 的模型，在 Multi30K 数据集上取得了 0.3062 的分数。

Title: Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

Authors: Jona Ballé, Luca Versari, Emilien Dupont, Hyunjik Kim, Matthias Bauer
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.00505
Pdf URL: https://arxiv.org/pdf/2412.00505
Copy Paste: [[2412.00505]] Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion(https://arxiv.org/abs/2412.00505)
Keywords: generative
Abstract: Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
摘要：受生成图像模型成功的启发，最近关于学习图像压缩的研究越来越多地关注自然图像分布的更好的概率模型，从而实现出色的图像质量。然而，这是以计算复杂度为代价的，计算复杂度比当今的商业编解码器高出几个数量级，因此对于大多数实际应用而言是无法承受的。通过本文，我们证明，通过专注于对视觉感知进行建模而不是对数据分布进行建模，我们可以在视觉质量和比特率之间实现非常好的平衡，类似于 HiFiC 等“生成”压缩模型，同时解压缩所需的乘法累加运算 (MAC) 不到 1%。我们通过针对 Wasserstein 失真 (WD) 优化过度拟合的图像编解码器 C3，并通过人工评估研究评估图像重建来实现这一点。研究还表明，无论是作为优化目标还是作为人工评分的预测指标，WD 的表现都优于其他感知质量指标，如 LPIPS、DISTS 和 MS-SSIM，与 Elo 分数的 Pearson 相关性达到 94% 以上。

Title: Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Authors: Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2412.00508
Pdf URL: https://arxiv.org/pdf/2412.00508
Copy Paste: [[2412.00508]] Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence(https://arxiv.org/abs/2412.00508)
Keywords: generative
Abstract: Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.
摘要：控制结构设计是 P&ID 开发中一个重要但繁琐的步骤。生成式人工智能 (AI) 有望通过支持工程师来缩短 P&ID 开发时间。之前对化学过程设计中生成式人工智能的研究主要通过序列来表示过程。然而，由于图具有置换不变性，它提供了一种很有前途的替代方案。我们提出了 Graph-to-SFILES 模型，这是一种生成式人工智能方法，可以从流程图拓扑中预测控制结构。Graph-to-SFILES 模型将流程图拓扑作为图输入，并以 SFILES 2.0 符号中的序列返回控制扩展流程图。我们比较了四种不同的图编码器架构，其中一种是本文提出的图神经网络 (GNN)。在 10,000 个流程图拓扑上进行训练时，Graph-to-SFILES 模型实现了 73.2% 的 top-5 准确率。此外，所提出的 GNN 在编码器架构中表现最佳。与纯基于序列的方法相比，Graph-to-SFILES 模型将 1,000 个流程图的相对较小的训练数据集的前 5 名准确率从 0.9% 提高到 28.4%。然而，基于序列的方法在 100,000 个流程图的大规模数据集上表现更好。这些结果凸显了基于图形的 AI 模型在小数据范围内加速 P&ID 开发的潜力，但它们在行业相关案例研究中的有效性仍有待研究。

Title: Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

Authors: Amir Barda, Matheus Gadelha, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00518
Pdf URL: https://arxiv.org/pdf/2412.00518
Copy Paste: [[2412.00518]] Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects(https://arxiv.org/abs/2412.00518)
Keywords: generation, generative
Abstract: We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.
摘要：我们提出了一种生成技术，用于在大约 3 秒内编辑 3D 形状（表示为网格、NeRF 或高斯 Splats），而无需运行 SDS 类型的优化。我们的主要见解是将 3D 编辑视为多视图图像修复问题，因为这种表示是通用的，可以使用可用的大型重建模型库映射回任何 3D 表示。我们探索了不同的微调策略，以在同一个扩散模型中获得多视图生成和修复功能。特别是，修复蒙版的设计是训练修复模型的重要因素，我们提出了几种蒙版策略来模拟用户对 3D 形状执行的编辑类型。我们的方法将 3D 生成编辑的时间从几小时缩短到几秒钟，并且与之前的工作相比产生了更高质量的结果。

Title: Human Action CLIPS: Detecting AI-generated Human Motion

Authors: Matyas Bohacek, Hany Farid
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00526
Pdf URL: https://arxiv.org/pdf/2412.00526
Copy Paste: [[2412.00526]] Human Action CLIPS: Detecting AI-generated Human Motion(https://arxiv.org/abs/2412.00526)
Keywords: generation
Abstract: Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.
摘要：成熟的人工智能视频生成技术继续穿越恐怖谷，生成与现实在感知上难以区分的内容。许多令人兴奋且富有创意的应用程序与恶意应用程序混杂在一起，这些应用程序会危害个人、组织和民主。我们描述了一种有效且强大的技术，用于区分真实的人类动作和人工智能生成的人类动作。该技术利用多模态语义嵌入，使其能够抵御通常会混淆低级到中级方法的洗钱类型。该方法针对定制的视频剪辑数据集进行评估，该数据集包含由七个文本到视频的人工智能模型生成的人类动作，并匹配真实镜头。

Title: Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning

Authors: Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Yuying Chen, Lihui Jiang, Bingbing Liu, Yingcong Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00547
Pdf URL: https://arxiv.org/pdf/2412.00547
Copy Paste: [[2412.00547]] Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning(https://arxiv.org/abs/2412.00547)
Keywords: generation
Abstract: Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner.
摘要：最近，许多视频生成模型（也称为世界模型）已经展示了生成可信的真实世界视频的能力。然而，许多研究表明，这些模型产生的运动结果往往缺乏逻辑或物理连贯性。在本文中，我们重新审视了视频生成模型，发现单阶段方法难以在保持连贯的运动推理的同时产生高质量的结果。为了解决这个问题，我们提出了 \textbf{Motion Dreamer}，一个两阶段的视频生成框架。在第一阶段，模型根据输入图像和运动条件生成中间运动表示（例如分割图或深度图），仅关注运动本身。在第二阶段，模型使用这个中间运动表示作为生成高细节视频的条件。通过将运动推理与高保真视频合成分离，我们的方法可以生成更准确、更符合物理规律的运动。我们在 Physion 数据集和自动驾驶场景中验证了我们方法的有效性。例如，只需轻轻一推，我们的模型就能模拟出一组多米诺骨牌的连续倒塌。同样，通过改变自我汽车的运动，我们的模型可以对其他车辆产生不同的影响。我们的工作开辟了新的途径，可以创建能够以更连贯、更现实的方式推理物理相互作用的模型。

Title: Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00557
Pdf URL: https://arxiv.org/pdf/2412.00557
Copy Paste: [[2412.00557]] Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion(https://arxiv.org/abs/2412.00557)
Keywords: restoration
Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.
摘要：盲逆问题对于许多计算机视觉应用至关重要，其中目标数据和前向算子都是未知的。现有方法通常依赖于限制性假设，例如额外训练、算子线性或窄图像分布，从而限制了它们的通用性。在这项工作中，我们提出了 LADiBI，这是一个无需训练的框架，它使用大规模文本到图像扩散模型以最少的假设解决盲逆问题。通过利用自然语言提示，LADiBI 联合建模目标图像和算子的先验，允许在各种任务中灵活调整。此外，我们提出了一种新颖的后验采样方法，将有效的算子初始化与迭代细化相结合，使 LADiBI 无需预定义的算子形式即可运行。我们的实验表明，LADiBI 能够解决各种图像恢复任务，包括不同目标图像分布上的线性和非线性问题。

Title: Contextual Bandits in Payment Processing: Non-uniform Exploration and Supervised Learning at Adyen

Authors: Akhila Vangara, Alex Egg
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2412.00569
Pdf URL: https://arxiv.org/pdf/2412.00569
Copy Paste: [[2412.00569]] Contextual Bandits in Payment Processing: Non-uniform Exploration and Supervised Learning at Adyen(https://arxiv.org/abs/2412.00569)
Keywords: generation
Abstract: Uniform random exploration in decision-making systems supports off-policy learning via supervision but incurs high regret, making it impractical for many applications. Conversely, non-uniform exploration offers better immediate performance but lacks support for off-policy learning. Recent research suggests that regression oracles can bridge this gap by combining non-uniform exploration with supervised learning. In this paper, we analyze these approaches within a real-world industrial context at Adyen, a large global payments processor characterized by batch logged delayed feedback, short-term memory, and dynamic action spaces under the Empirical Risk Minimization (ERM) framework. Our analysis reveals that while regression oracles significantly improve performance, they introduce challenges due to rigid algorithmic assumptions. Specifically, we observe that as a policy improves, subsequent generations may perform worse due to shifts in the reward distribution and increased class imbalance in the training data. This degradation occurs de spite improvements in other aspects of the training data, leading to decreased performance in successive policy iterations. We further explore the long-term impact of regression oracles, identifying a potential "oscillation effect." This effect arises when regression oracles influence probability estimates and the realizability of subsequent policy models, leading to fluctuations in performance across iterations. Our findings highlight the need for more adaptable algorithms that can leverage the benefits of regression oracles without introducing instability in policy performance over time.
摘要：决策系统中的均匀随机探索支持通过监督进行离策略学习，但会产生很高的遗憾，因此对于许多应用来说并不实用。相反，非均匀探索提供了更好的即时性能，但缺乏对离策略学习的支持。最近的研究表明，回归预言可以通过将非均匀探索与监督学习相结合来弥补这一差距。在本文中，我们在 Adyen 的现实工业环境中分析了这些方法，Adyen 是一家大型全球支付处理器，其特点是批量记录延迟反馈、短期记忆和经验风险最小化 (ERM) 框架下的动态动作空间。我们的分析表明，虽然回归预言显著提高了性能，但由于算法假设僵化，它们带来了挑战。具体来说，我们观察到，随着策略的改进，由于奖励分布的变化和训练数据中的类别不平衡增加，后续几代的表现可能会更差。尽管训练数据的其他方面有所改进，但这种退化仍然会发生，导致连续策略迭代中的性能下降。我们进一步探讨了回归预言的长期影响，确定了潜在的“振荡效应”。当回归预言影响概率估计和后续策略模型的可实现性时，就会出现这种影响，从而导致迭代过程中的性能波动。我们的研究结果强调了对更具适应性的算法的需求，这些算法可以利用回归预言的优势，而不会随着时间的推移导致策略性能不稳定。

Title: Continuous Concepts Removal in Text-to-image Diffusion Models

Authors: Tingxu Han, Weisong Sun, Yanrong Hu, Chunrong Fang, Yonglong Zhang, Shiqing Ma, Tao Zheng, Zhenyu Chen, Zhenting Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00580
Pdf URL: https://arxiv.org/pdf/2412.00580
Copy Paste: [[2412.00580]] Continuous Concepts Removal in Text-to-image Diffusion Models(https://arxiv.org/abs/2412.00580)
Keywords: generation
Abstract: Text-to-image diffusion models have shown an impressive ability to generate high-quality images from input textual descriptions. However, concerns have been raised about the potential for these models to create content that infringes on copyrights or depicts disturbing subject matter. Removing specific concepts from these models is a promising potential solution to this problem. However, existing methods for concept removal do not work well in practical but challenging scenarios where concepts need to be continuously removed. Specifically, these methods lead to poor alignment between the text prompts and the generated image after the continuous removal process. To address this issue, we propose a novel approach called CCRT that includes a designed knowledge distillation paradigm. It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts generated through our genetic algorithm, which employs a designed fuzzing strategy. We conduct extensive experiments involving the removal of various concepts. The results evaluated through both algorithmic metrics and human studies demonstrate that our CCRT can effectively remove the targeted concepts in a continuous manner while maintaining the high generation quality (e.g., text-image alignment) of the model.
摘要：文本到图像的扩散模型已经展现出从输入文本描述生成高质量图像的出色能力。然而，有人担心这些模型可能会创建侵犯版权或描绘令人不安的主题的内容。从这些模型中删除特定概念是解决此问题的一个有希望的潜在解决方案。然而，现有的概念删除方法在需要连续删除概念的实际但具有挑战性的场景中效果不佳。具体而言，这些方法导致在连续删除过程之后文本提示和生成的图像之间的对齐不佳。为了解决这个问题，我们提出了一种称为 CCRT 的新方法，其中包括一个设计的知识蒸馏范式。它通过使用通过我们的遗传算法生成的一组文本提示来限制连续概念删除过程中的文本-图像对齐行为，该算法采用了设计好的模糊测试策略。我们进行了广泛的实验，涉及删除各种概念。通过算法指标和人工研究评估的结果表明，我们的 CCRT 可以有效地连续删除目标概念，同时保持模型的高生成质量（例如，文本-图像对齐）。

Title: Generative LiDAR Editing with Controllable Novel Object Layouts

Authors: Shing-Hei Ho, Bao Thach, Minghan Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00592
Pdf URL: https://arxiv.org/pdf/2412.00592
Copy Paste: [[2412.00592]] Generative LiDAR Editing with Controllable Novel Object Layouts(https://arxiv.org/abs/2412.00592)
Keywords: generation, generative
Abstract: We propose a framework to edit real-world Lidar scans with novel object layouts while preserving a realistic background environment. Compared to the synthetic data generation frameworks where Lidar point clouds are generated from scratch, our framework focuses on new scenario generation in a given background environment, and our method also provides labels for the generated data. This approach ensures the generated data remains relevant to the specific environment, aiding both the development and the evaluation of algorithms in real-world scenarios. Compared with novel view synthesis, our framework allows the creation of counterfactual scenarios with significant changes in the object layout and does not rely on multi-frame optimization. In our framework, the object removal and insertion are supported by generative background inpainting and object point cloud completion, and the entire pipeline is built upon spherical voxelization, which realizes the correct Lidar projective geometry by construction. Experiments show that our framework generates realistic Lidar scans with object layout changes and benefits the development of Lidar-based self-driving systems.
摘要：我们提出了一个框架，用于编辑具有新颖物体布局的真实世界激光雷达扫描，同时保留真实的背景环境。与从头开始生成激光雷达点云的合成数据生成框架相比，我们的框架专注于在给定背景环境中生成新场景，我们的方法还为生成的数据提供标签。这种方法确保生成的数据与特定环境保持相关，有助于在真实场景中开发和评估算法。与新颖的视图合成相比，我们的框架允许创建具有显著物体布局变化的反事实场景，并且不依赖于多帧优化。在我们的框架中，生成背景修复和物体点云完成支持物体移除和插入，整个管道建立在球面体素化之上，通过构造实现正确的激光雷达射影几何。实验表明，我们的框架可以生成具有物体布局变化的真实激光雷达扫描，有利于基于激光雷达的自动驾驶系统的开发。

Title: PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Authors: Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00596
Pdf URL: https://arxiv.org/pdf/2412.00596
Copy Paste: [[2412.00596]] PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation(https://arxiv.org/abs/2412.00596)
Keywords: generation
Abstract: Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: this https URL.
摘要：文本转视频 (T2V) 生成最近已通过基于变换器的扩散模型实现，但当前的 T2V 模型缺乏遵守现实世界常识和物理规则的能力，因为它们对物理现实的理解有限，并且缺乏时间建模。现有的解决方案要么是数据驱动的，要么需要额外的模型输入，但不能推广到分布外的领域。在本文中，我们提出了 PhyT2V，这是一种新的数据独立的 T2V 技术，通过在 T2V 提示中启用思路链和后退推理，将当前 T2V 模型的视频生成能力扩展到分布外的领域。我们的实验表明，PhyT2V 将现有 T2V 模型对现实世界物理规则的遵守率提高了 2.3 倍，与 T2V 提示增强器相比，实现了 35% 的改进。源代码可在以下网址获得：此 https URL。

Title: A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Authors: Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00623
Pdf URL: https://arxiv.org/pdf/2412.00623
Copy Paste: [[2412.00623]] A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision(https://arxiv.org/abs/2412.00623)
Keywords: generation, generative
Abstract: We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.
摘要：我们引入了一种高斯 Splats 扩散模型 SplatDiffusion，以便从单个图像生成三维结构，解决了将 2D 输入提升到 3D 的不适定性。现有方法依赖于确定性的前馈预测，这限制了它们处理从 2D 数据进行 3D 推理的固有模糊性的能力。扩散模型最近显示出作为 3D 数据（包括高斯 Splats）的强大生成模型的前景；然而，标准扩散框架通常要求目标信号和去噪信号处于相同模态，这在 3D 数据稀缺的情况下具有挑战性。为了克服这个问题，我们提出了一种新颖的训练策略，将去噪模态与监督模态分离。通过使用确定性模型作为噪声老师来创建噪声信号，并从单步过渡到由图像渲染损失监督的多步去噪，我们的方法与确定性老师相比显着提高了性能。此外，我们的方法非常灵活，因为它可以从各种 3D Gaussian Splat (3DGS) 教师那里学习，并且只需进行很少的调整；我们通过超越两个不同的确定性教师模型的性能来证明这一点，突出了我们框架的潜在通用性。我们的方法进一步结合了一种指导机制来聚合来自多个视图的信息，从而在有多个视图可用时提高重建质量。在对象级和场景级数据集上的实验结果证明了我们框架的有效性。

Title: Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

Authors: Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.00638
Pdf URL: https://arxiv.org/pdf/2412.00638
Copy Paste: [[2412.00638]] Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis(https://arxiv.org/abs/2412.00638)
Keywords: generation
Abstract: Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.
摘要：设计风格化的动图具有挑战性，因为难以定制复杂而富有表现力的流动运动。为了对生成的动图进行直观和详细的控制，手绘草图可以提供比仅使用文本输入更好的解决方案来传达个性化的设计要求。在本文中，我们提出了 Sketch2Cinemagraph，这是一个草图引导框架，可以有条件地从手绘草图生成风格化的动图。Sketch2Cinemagraph 采用文本提示进行初始内容生成，并为空间和运动提示提供手绘草图控制。采用潜在扩散模型生成目标风格化景观图像以及逼真的版本。然后，利用预先训练的物体检测模型对流动区域进行分割并获得蒙版。我们提出了一种新颖的潜在运动扩散模型来估计生成的景观图像流体区域中的运动场。输入的运动草图作为条件，使用提示控制蒙版流体区域中生成的矢量场。为了合成动态摄影帧，随后使用帧生成器将流体区域内的像素扭曲到每个时间步长的目标位置。结果验证了 Sketch2Cinemagraph 可以从直观的草图输入生成具有连续时间流的高保真、美观的风格化动态摄影。我们通过与最先进的生成方法进行定量比较来展示 Sketch2Cinemagraph 的优势。

Title: Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection

Authors: Yingjian Chen, Lei Zhang, Yakun Niu, Lei Tan, Pei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00665
Pdf URL: https://arxiv.org/pdf/2412.00665
Copy Paste: [[2412.00665]] Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection(https://arxiv.org/abs/2412.00665)
Keywords: generation
Abstract: Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust. Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods. To address this issue, we rethink the effectiveness of pre-trained models trained on large-scale, real-world images. Our findings indicate that: 1) Pre-trained models can cluster the features of real images effectively. 2) Models with pre-trained weights can approximate an optimal generalization solution at a specific training step, but it is extremely unstable. Based on these facts, we propose a simple yet effective training method called Learning on Less (LoL). LoL utilizes a random masking mechanism to constrain the model's learning of the unique patterns specific to a certain type of diffusion model, allowing it to focus on less image content. This leverages the inherent strengths of pre-trained weights while enabling a more stable approach to optimal generalization, which results in the extraction of a universal feature that differentiates various diffusion-generated images from real images. Extensive experiments on the GenImage benchmark demonstrate the remarkable generalization capability of our proposed LoL. With just 1% training data, LoL significantly outperforms the current state-of-the-art, achieving a 13.6% improvement in average ACC across images generated by eight different models.
摘要：扩散模型能够生成逼真的图像，但同时也增加了错误信息和损害公众信任的风险。目前，由于现有方法的泛化能力有限，检测由未见过的扩散模型生成的图像仍然具有挑战性。为了解决这个问题，我们重新思考了在真实世界大规模图像上训练的预训练模型的有效性。我们的研究结果表明：1）预训练模型可以有效地聚类真实图像的特征。2）具有预训练权重的模型可以在特定的训练步骤中近似最佳泛化解决方案，但它极不稳定。基于这些事实，我们提出了一种简单而有效的训练方法，称为少学习（LoL）。LoL 利用随机掩蔽机制来限制模型对特定于某种扩散模型的独特模式的学习，使其能够专注于较少的图像内容。这充分利用了预训练权重的固有优势，同时实现了更稳定的最佳泛化方法，从而提取出一种通用特征，将各种扩散生成的图像与真实图像区分开来。在 GenImage 基准上进行的大量实验证明了我们提出的 LoL 具有出色的泛化能力。仅使用 1% 的训练数据，LoL 的表现就明显优于目前最先进的技术，在由八个不同模型生成的图像中实现了 13.6% 的平均 ACC 提升。

Title: Explaining Object Detectors via Collective Contribution of Pixels

Authors: Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00666
Pdf URL: https://arxiv.org/pdf/2412.00666
Copy Paste: [[2412.00666]] Explaining Object Detectors via Collective Contribution of Pixels(https://arxiv.org/abs/2412.00666)
Keywords: generation
Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Since object detectors identify and localize instances by assessing multiple features collectively, generating explanations that capture these collective contributions is critical. However, existing methods focus solely on individual pixel contributions, ignoring the collective contribution of multiple pixels. To address this, we proposed a method for object detectors that considers the collective contribution of multiple pixels. Our approach leverages game-theoretic concepts, specifically Shapley values and interactions, to provide explanations. These explanations cover both bounding box generation and class determination, considering both individual and collective pixel contributions. Extensive quantitative and qualitative experiments demonstrate that the proposed method more accurately identifies important regions in detection results compared to current state-of-the-art methods. The code will be publicly available soon.
摘要：物体检测器的视觉解释对于提高其可靠性至关重要。由于物体检测器通过集体评估多个特征来识别和定位实例，因此生成能够捕捉这些集体贡献的解释至关重要。然而，现有的方法只关注单个像素的贡献，而忽略了多个像素的集体贡献。为了解决这个问题，我们提出了一种考虑多个像素集体贡献的物体检测器方法。我们的方法利用博弈论概念，特别是 Shapley 值和交互来提供解释。这些解释涵盖了边界框生成和类别确定，同时考虑了单个和集体像素的贡献。大量的定量和定性实验表明，与目前最先进的方法相比，所提出的方法能够更准确地识别检测结果中的重要区域。代码将很快公开。

Title: FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Authors: Yunpeng Bai, Qixing Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00671
Pdf URL: https://arxiv.org/pdf/2412.00671
Copy Paste: [[2412.00671]] FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation(https://arxiv.org/abs/2412.00671)
Keywords: generative
Abstract: Monocular Depth Estimation (MDE) is essential for applications like 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust MDE remains challenging due to noisy real-world data and distribution gaps in synthetic datasets. Existing methods often struggle with low efficiency, reduced accuracy, and lack of detail. To address this, we propose an efficient approach for leveraging diffusion priors and introduce FiffDepth, a framework that transforms diffusion-based image generators into a feedforward architecture for detailed depth estimation. By preserving key generative features and integrating the strong generalization capabilities of models like dinov2, FiffDepth achieves enhanced accuracy, stability, and fine-grained detail, offering a significant improvement in MDE performance across diverse real-world scenarios.
摘要：单目深度估计 (MDE) 对于 3D 场景重建、自主导航和 AI 内容创建等应用至关重要。然而，由于现实世界数据嘈杂且合成数据集存在分布差距，稳健的 MDE 仍然具有挑战性。现有方法通常效率低、准确度低且缺乏细节。为了解决这个问题，我们提出了一种利用扩散先验的有效方法，并引入了 FiffDepth，这是一个将基于扩散的图像生成器转换为前馈架构以进行详细深度估计的框架。通过保留关键的生成特征并集成 dinov2 等模型的强大泛化能力，FiffDepth 实现了增强的准确性、稳定性和细粒度细节，从而显著提高了各种现实世界场景中的 MDE 性能。

Title: Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

Authors: Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00684
Pdf URL: https://arxiv.org/pdf/2412.00684
Copy Paste: [[2412.00684]] Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding(https://arxiv.org/abs/2412.00684)
Keywords: generative
Abstract: Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address data scarcity, we propose a novel framework, POBF (Paint Outside the Box, then Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to identify the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Experimental results show that POBF achieves superior performance across four datasets, delivering an average improvement of 5.83% and outperforming leading baselines by 2.29% to 3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, data ratios, and model architectures.
摘要：视觉接地旨在根据文本查询定位图像区域。鉴于大规模数据管理的难度，我们在本文中研究了如何在数据稀缺的环境下有效地学习视觉接地。为了解决数据稀缺问题，我们提出了一个新框架 POBF（先在框外绘制，然后过滤）。POBF 通过在框外进行修复来合成图像，解决了以前工作中遇到的标签错位问题。此外，POBF 利用创新的过滤方案来识别最有效的训练数据。该方案结合了硬度分数和过度拟合分数，并通过惩罚项进行平衡。实验结果表明，POBF 在四个数据集上取得了优异的性能，平均提高了 5.83%，准确率比领先基线高出 2.29% 至 3.85%。此外，我们验证了 POBF 在各种生成模型、数据比率和模型架构中的稳健性和通用性。

Title: Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Authors: Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00719
Pdf URL: https://arxiv.org/pdf/2412.00719
Copy Paste: [[2412.00719]] Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation(https://arxiv.org/abs/2412.00719)
Keywords: generation
Abstract: Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns. Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework. Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors.
摘要：说话的头部视频生成旨在生成逼真的说话的头部视频，该视频可保留源图像中的人的身份和驾驶视频中的运动。尽管该领域取得了令人鼓舞的进展，但同时生成具有准确姿势和细粒度面部细节的视频仍然是一个具有挑战性和关键的问题。本质上，面部运动通常非常复杂，难以精确建模，并且由于动态姿势变化，一次性源面部图像无法在生成过程中提供足够的外观指导。为了解决这个问题，我们建议联合学习运动和外观码本并执行多尺度码本补偿，以有效地改进面部运动条件和外观特征以进行说话的面部图像解码。具体而言，设计的多尺度运动和外观码本在统一框架中同时学习，以存储代表性的全局面部运动流和外观模式。然后，我们提出了一种新颖的多尺度运动和外观补偿模块，该模块利用基于变换器的码本检索策略从两个码本中查询互补信息以进行联合运动和外观补偿。整个过程可生成更灵活、外观特征更明显的运动流，不同尺度上的失真更少，从而生成高质量的头部特写视频生成框架。在各种基准上进行的大量实验验证了我们方法的有效性，并且与最先进的竞争对手相比，从定性和定量角度都展示了更出色的生成结果。

Title: Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

Authors: Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00733
Pdf URL: https://arxiv.org/pdf/2412.00733
Copy Paste: [[2412.00733]] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks(https://arxiv.org/abs/2412.00733)
Keywords: generation, generative
Abstract: Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: this https URL.
摘要：现有的肖像动画方法面临着巨大的挑战，特别是在处理非正面视角、在肖像周围渲染动态物体以及生成沉浸式逼真背景方面。在本文中，我们介绍了预训练的基于 Transformer 的视频生成模型的首次应用，该模型具有强大的泛化能力，可以为肖像动画生成高度动态、逼真的视频，有效地解决了这些挑战。采用新的视频主干模型使得以前基于 U-Net 的身份维护、音频调节和视频外推方法变得不适用。为了解决这一限制，我们设计了一个身份参考网络，该网络由因果 3D VAE 与一系列堆叠的 Transformer 层组合而成，可确保视频序列中的面部身份一致。此外，我们研究了各种语音音频调节和运动帧机制，以便生成由语音音频驱动的连续视频。我们的方法通过基准和新提出的野生数据集上的实验进行了验证，与以前的方法相比，它在生成以动态和沉浸式场景中的不同方向为特征的逼真肖像方面有显著的改进。进一步的可视化和源代码可从此 https URL 获得。

Title: CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

Authors: Jian Liu, Zhen Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00754
Pdf URL: https://arxiv.org/pdf/2412.00754
Copy Paste: [[2412.00754]] CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images(https://arxiv.org/abs/2412.00754)
Keywords: generation, generative
Abstract: The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.
摘要：神经辐射场 (NERF) 主张通过多层感知器 (MLP) 学习 3D 几何的连续表示。通过将其集成到生成模型中，生成神经辐射场 (GRAF) 能够在没有 3D 监督的情况下从随机噪声 z 中生成图像。在实践中，形状和外观分别由 z_s 和 z_a 建模，以便在推理过程中分别操纵它们。然而，使用单独的 MLP 表示多个场景并精确控制 3D 几何形状和外观的生成是一项挑战。在本文中，我们介绍了一个可控的生成模型（即 \textbf{CtrlNeRF}），它使用单个 MLP 网络以共享权重表示多个场景。因此，我们操纵形状和外观代码来实现具有 3D 一致性的高保真图像的可控生成。此外，该模型还能够通过相机姿势改变和特征插值来合成训练集中不存在的新视图。我们进行了大量的实验来证明它在 3D 感知图像生成方面相对于同类方法的优势。

Title: Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting

Authors: Linhai Zhuo, Zheng Wang, Yuqian Fu, Tianwen Qian
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00767
Pdf URL: https://arxiv.org/pdf/2412.00767
Copy Paste: [[2412.00767]] Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting(https://arxiv.org/abs/2412.00767)
Keywords: generation
Abstract: The source-free cross-domain few-shot learning (CD-FSL) task aims to transfer pretrained models to target domains utilizing minimal samples, eliminating the need for source domain data. Addressing this issue requires models to have robust generalization abilities and strong feature representation, aligning with the characteristics of large-scale pretrained models. However, large-scale models tend to lose representational ability in cross-domain scenarios due to limited sample diversity. \zlh{Given the abundant diversity provided by semantic modality, this paper leverages textual modality to enhance training sample diversity with CLP model}, meanwhile improving model transfer efficiency. Specifically, we propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity. Furthermore, we use diversity descriptions of classes to guide semantically meaningful learning of diversity prompts, proposing random combinations and selections of texts to increase textual diversity. Additionally, deep prompt tuning is introduced to enhance the model's transfer capability. After training of the first step, support samples with different diversity prompts are input into the CLIP backbone to generate enhanced features. After generation, the second phase trains classifiers using the generated features. Extensive experimental results across several benchmarks verify our method is comparable to SOTA source-utilized models and attain the best performance under the source-free CD-FSL setting.
摘要：无源跨域小样本学习（CD-FSL）任务旨在利用最少的样本将预训练模型迁移到目标域，无需源域数据。解决这个问题需要模型具有强大的泛化能力和强大的特征表示，符合大规模预训练模型的特点。然而，由于样本多样性有限，大规模模型往往会在跨域场景中丧失表示能力。\zlh{鉴于语义模态提供的丰富多样性，本文利用文本模态通过 CLP 模型增强训练样本多样性}，同时提高模型迁移效率。具体而言，我们提出了 SeGD-VPT 框架，该框架分为两个阶段。第一步旨在通过为每个支持样本添加多样性提示来增加特征多样性，从而生成不同的输入并增强样本多样性。此外，我们使用类别的多样性描述来指导多样性提示的语义有意义学习，提出随机组合和选择文本以增加文本多样性。此外，引入深度提示调整以增强模型的迁移能力。经过第一步训练后，将具有不同多样性提示的支持样本输入到 CLIP 主干中以生成增强特征。生成后，第二阶段使用生成的特征训练分类器。在多个基准测试中的大量实验结果验证了我们的方法与 SOTA 源利用模型相当，并在无源 CD-FSL 设置下获得最佳性能。

Title: DIVD: Deblurring with Improved Video Diffusion Model

Authors: Haoyang Long, Yan Wang, Wendong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00773
Pdf URL: https://arxiv.org/pdf/2412.00773
Copy Paste: [[2412.00773]] DIVD: Deblurring with Improved Video Diffusion Model(https://arxiv.org/abs/2412.00773)
Keywords: generation
Abstract: Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.
摘要：视频去模糊由于模糊的复杂性而面临相当大的挑战，模糊通常是由相机抖动和物体运动共同造成的。在视频去模糊领域，许多以前的研究主要集中在基于失真的指标上，例如 PSNR。然而，这种方法往往与人类感知的相关性较弱，并且重建结果缺乏真实感。扩散模型和视频扩散模型分别在图像和视频生成领域表现出色，特别是在图像真实性和真实感知方面取得了显著的成果。然而，由于计算复杂性和适应扩散模型固有的挑战，视频扩散模型在视频去模糊任务中的潜力仍然存在不确定性。为了探索视频扩散模型在视频去模糊任务中的可行性，我们专门为此目的引入了一个扩散模型。在这个领域，利用相邻帧之间高度相关的信息和解决时间错位的挑战是至关重要的研究方向。为了应对这些挑战，本文引入了许多基于视频扩散模型的改进。因此，我们的模型优于现有模型，并在一系列感知指标上取得了最先进的结果。我们的模型在保持有竞争力的失真指标的同时，保留了图像中的大量细节。此外，据我们所知，这是扩散模型首次应用于视频去模糊，以克服上述限制。

Title: Memories of Forgotten Concepts

Authors: Matan Rusanovsky, Shimon Malnick, Amir Jevnisek, Ohad Fried, Shai Avidan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00782
Pdf URL: https://arxiv.org/pdf/2412.00782
Copy Paste: [[2412.00782]] Memories of Forgotten Concepts(https://arxiv.org/abs/2412.00782)
Keywords: generation
Abstract: Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.
摘要：扩散模型主导着文本到图像的生成领域，但它们可能会产生不良输出，包括露骨内容或私人数据。为了缓解这种情况，人们探索了概念消融技术来限制某些概念的生成。在本文中，我们揭示了被擦除的概念信息仍然存在于模型中，并且可以使用正确的潜在信息生成被擦除的概念图像。利用反演方法，我们表明存在能够生成高质量被擦除概念图像的潜在种子。此外，我们表明这些潜在信息的可能性与被擦除概念之外的图像的可能性重叠。我们对此进行了扩展，以证明对于来自被擦除概念集的每个图像，我们都可以生成许多生成被擦除概念的种子。鉴于能够生成消融概念图像的潜在信息空间巨大，我们的结果表明完全擦除概念信息可能是难以解决的，这凸显了当前概念消融技术中可能存在的漏洞。

Title: Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach

Authors: Jingyi Zhao, Yuxuan Ou, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.00807
Pdf URL: https://arxiv.org/pdf/2412.00807
Copy Paste: [[2412.00807]] Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach(https://arxiv.org/abs/2412.00807)
Keywords: generative
Abstract: Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.
摘要：可电离脂质对于开发用于有效传递信使 RNA (mRNA) 的脂质纳米粒子 (LNP) 至关重要。虽然设计新型可电离脂质的传统方法通常很耗时，但深度生成模型已成为一种强大的解决方案，大大加快了分子发现过程。然而，由于生成的分子结构通常难以合成或不可行，因此出现了一个实际挑战。该项目探索了基于蒙特卡洛树搜索 (MCTS) 的可合成可电离脂质生成模型。利用可合成的脂质构建块数据集和两个专门的预测因子来指导化学空间搜索，我们引入了一种策略网络引导的 MCTS 生成模型，该模型能够生成具有可用合成途径的新型可电离脂质。

Title: EventGPT: Event Stream Understanding with Multimodal Large Language Models

Authors: Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, Ming Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00832
Pdf URL: https://arxiv.org/pdf/2412.00832
Copy Paste: [[2412.00832]] EventGPT: Event Stream Understanding with Multimodal Large Language Models(https://arxiv.org/abs/2412.00832)
Keywords: generation
Abstract: Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. To mitigate the huge domain gaps, we develop a three-stage optimization paradigm to gradually equip a pre-trained LLM with the capability of understanding event-based scenes. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.
摘要：事件相机将视觉信息记录为异步像素变化流，在光线不足或高动态条件下擅长场景感知。现有的多模态大型语言模型 (MLLM) 专注于自然 RGB 图像，在事件数据更适合的场景中会失败。在本文中，我们介绍了 EventGPT，这是据我们所知第一个用于事件流理解的 MLLM，标志着将大型语言模型 (LLM) 与事件流理解相结合的开创性尝试。为了弥补巨大的领域差距，我们开发了一个三阶段优化范式，逐步使预训练的 LLM 具备理解基于事件的场景的能力。我们的 EventGPT 包括一个事件编码器，然后是一个时空聚合器、一个线性投影仪、一个事件语言适配器和一个 LLM。首先，利用 GPT 生成的 RGB 图像-文本对来预热线性投影仪，称为 LLaVA，因为自然图像和语言模态之间的差距相对较小。其次，我们构建了一个合成的大型数据集 N-ImageNet-Chat，其中包含事件框架和相应的文本，以便使用时空聚合器并训练事件语言适配器，从而使事件特征与语言空间更加紧密地对齐。最后，我们收集了一个指令数据集 Event-Chat，其中包含大量真实世界数据，以微调整个模型，进一步增强其泛化能力。我们构建了一个全面的基准，实验表明 EventGPT 在生成质量、描述准确性和推理能力方面超越了以前最先进的 MLLM。

Title: Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

Authors: Christian Möller, Niklas Funk, Jan Peters
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00835
Pdf URL: https://arxiv.org/pdf/2412.00835
Copy Paste: [[2412.00835]] Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models(https://arxiv.org/abs/2412.00835)
Keywords: generative
Abstract: Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at this https URL .
摘要：从单一视角进行物体姿态估计仍然是一个具有挑战性的问题。特别是，部分可观测性、遮挡和物体对称性最终会导致姿态模糊。为了解释这种多模态性，这项工作提出了训练基于扩散的生成模型来进行 6D 物体姿态估计。在推理过程中，训练后的生成模型允许采样多个粒子，即姿态假设。为了将这些信息提炼成单个姿态估计，我们提出了两种新颖有效的姿态选择策略，这些策略不需要任何额外的训练或计算密集型操作。此外，虽然许多现有的姿态估计方法主要关注图像域，并且仅结合深度信息进行最终姿态细化，但我们的模型仅对点云数据进行操作。因此，该模型利用了点云处理方面的最新进展，并在 SE(3) 等变潜在空间上运行，该空间构成了粒子选择策略的基础并允许缩短推理时间。我们详尽的实验结果证明了我们的方法在 Linemod 数据集上的竞争性能，并展示了我们设计选择的有效性。代码可在此 https URL 上获得。

Title: AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

Authors: Jin Lyu, Tianyi Zhu, Yi Gu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang, Liang An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00837
Pdf URL: https://arxiv.org/pdf/2412.00837
Copy Paste: [[2412.00837]] AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer(https://arxiv.org/abs/2412.00837)
Keywords: generation
Abstract: Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is this https URL.
摘要：动物行为和生物力学的定量分析需要跨物种准确估计动物姿势和形状，这对动物福利和生物学研究非常重要。然而，以前方法的网络容量小，多物种数据集有限，导致这一问题尚未得到充分探索。为此，本文提出了 AniMer，使用家族感知 Transformer 来估计动物姿势和形状，提高了不同四足动物家族的重建精度。AniMer 的一个关键见解是它集成了基于高容量 Transformer 的主干和动物家族监督对比学习方案，将对各种四足动物形状的判别理解统一在一个框架内。为了有效训练，我们汇总了大多数可用的开源四足动物数据集，带有 3D 或 2D 标签。为了提高 3D 标记数据的多样性，我们引入了 CtrlAni3D，这是一个通过新的基于扩散的条件图像生成管道创建的新型大规模合成数据集。CtrlAni3D 由大约 10k 张带有像素对齐的 SMAL 标签的图像组成。总共，我们获得了 41.3k 张带注释的图像用于训练和验证。因此，家庭感知 Transformer 网络和扩展数据集的结合使 AniMer 不仅在 Animal3D 和 CtrlAni3D 等 3D 数据集上的表现优于现有方法，而且在分布外的 Animal Kingdom 数据集上的表现也优于现有方法。消融研究进一步证明了我们的网络设计和 CtrlAni3D 在增强 AniMer 在野外应用性能方面的有效性。AniMer 的项目页面是这个 https URL。

Title: Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion

Authors: Bohai Gu, Hao Luo, Song Guo, Peiran Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00857
Pdf URL: https://arxiv.org/pdf/2412.00857
Copy Paste: [[2412.00857]] Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion(https://arxiv.org/abs/2412.00857)
Keywords: restoration
Abstract: Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.
摘要：最近，基于扩散的方法在视频修复任务中取得了很大的进步。然而，这些方法仍然面临许多挑战，例如保持时间一致性和耗时问题。本文提出了一种使用光流引导高效扩散的高级视频修复框架，称为 FloED。具体而言，FloED 采用双分支架构，其中流分支首先恢复损坏的流，多尺度流适配器为主修复分支提供运动指导。此外，还提出了一种无需训练的潜在插值方法来加速使用流扭曲的多步去噪过程。进一步引入流注意缓存机制，FLoED 有效降低了引入光流带来的计算成本。在背景恢复和物体去除任务中的综合实验表明，FloED 在性能和效率方面均优于最先进的方法。

Title: Deep evolving semi-supervised anomaly detection

Authors: Jack Belham, Aryan Bhosale, Samrat Mukherjee, Biplab Banerjee, Fabio Cuzzolin
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.00860
Pdf URL: https://arxiv.org/pdf/2412.00860
Copy Paste: [[2412.00860]] Deep evolving semi-supervised anomaly detection(https://arxiv.org/abs/2412.00860)
Keywords: generative
Abstract: The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD), with the aim of highlighting the importance of such a problem formulation which assumes as close to real-world conditions as possible. After an overview of the relevant definitions of continual semi-supervised learning, its components, anomaly detection extension, and the training protocols; the paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection. The results show that such a use of extreme value theory (EVT) applied to anomaly detection can provide promising results even in comparison to an upper baseline of joint training. The results explore the effects of how much labelled and unlabelled data is present, of which class, and where it is located in the data stream. Outlier rejection shows promising initial results where it often surpasses a baseline method of Elastic Weight Consolidation (EWC). A baseline for CSAD is put forward along with the specific dataset setups used for reproducability and testability for other practitioners. Future research directions include other CSAD settings and further research into efficient continual hyperparameter tuning.
摘要：本文旨在形式化持续半监督异常检测 (CSAD) 任务，旨在强调这种假设尽可能接近真实世界条件的问题表述的重要性。在概述了持续半监督学习的相关定义、其组成部分、异常检测扩展和训练协议之后，本文介绍了一种用于处理半监督数据的变分自动编码器 (VAE) 基线模型，以及一种具有异常值拒绝功能的深度生成重放持续学习方法。结果表明，将这种极值理论 (EVT) 应用于异常检测可以提供有希望的结果，即使与联合训练的上限相比也是如此。结果探讨了标记和未标记数据的数量、属于哪个类别以及它在数据流中的位置的影响。异常值拒绝显示出有希望的初步结果，它通常超越了弹性权重合并 (EWC) 的基线方法。提出了 CSAD 的基线以及用于其他从业者的可重复性和可测试性的特定数据集设置。未来的研究方向包括其他 CSAD 设置和对高效持续超参数调整的进一步研究。

Title: Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.00876
Pdf URL: https://arxiv.org/pdf/2412.00876
Copy Paste: [[2412.00876]] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification(https://arxiv.org/abs/2412.00876)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at this https URL .
摘要：多模态大型语言模型 (MLLM) 在视觉理解、推理和交互方面取得了显著的成功。然而，在解码过程中，推理计算和内存随着输出 token 的生成而逐渐增加，直接影响 MLLM 的效率。现有的方法试图减少视觉上下文冗余以实现高效的 MLLM。不幸的是，在预填充阶段减少视觉上下文的效率优势在解码阶段逐渐减弱。为了解决这个问题，我们提出了一个动态视觉语言上下文稀疏化框架 Dynamic-LLaVA，它动态地减少预填充阶段视觉上下文的冗余，并减少解码过程中生成的语言上下文的内存和计算开销。Dynamic-LLaVA 为不同的推理模式（即预填充、有和没有 KV 缓存的解码）设计了量身定制的稀疏化推理方案，以实现 MLLM 的高效推理。在实践中，Dynamic-LLaVA 可以在预填充阶段将计算消耗减少 $\sim$75\%。同时，由于视觉语言上下文稀疏化，在 MLLM 的整个生成过程中，Dynamic-LLaVA 在没有 KV 缓存的情况下解码时减少了 $\sim$50\% 的计算消耗，而在使用 KV 缓存解码时节省了 $\sim$50\% 的 GPU 内存开销。大量实验还表明，与全上下文推理基线相比，Dynamic-LLaVA 实现了对 MLLM 的高效推理，理解和生成能力几乎不下降，甚至性能有所提升。代码可从此 https URL 获得。

Title: Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

Authors: Haoze Sun, Wenbo Li, Jiayue Liu, Kaiwen Zhou, Yongqiang Chen, Yong Guo, Yanwei Li, Renjing Pei, Long Peng, Yujiu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00878
Pdf URL: https://arxiv.org/pdf/2412.00878
Copy Paste: [[2412.00878]] Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration(https://arxiv.org/abs/2412.00878)
Keywords: restoration, generative
Abstract: Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.
摘要：泛化一直是现实世界图像恢复的核心挑战。虽然最近基于扩散的恢复方法利用文本到图像模型的生成先验，在恢复更真实的细节方面取得了进展，但它们在应用于分布外的现实世界数据时仍然会遇到“生成能力失活”的问题。为了解决这个问题，我们建议使用文本作为辅助不变表示来重新激活这些模型的生成能力。我们首先确定文本输入的两个关键属性：丰富性和相关性，并检查它们对模型性能的各自影响。基于这些见解，我们引入了 Res-Captioner，这是一个模块，可生成针对图像内容和退化程度的增强文本描述，从而有效缓解响应失败。此外，我们还提出了 RealIR，这是一个旨在捕捉各种现实世界场景的新基准。大量实验表明，Res-Captioner 显著增强了基于扩散的恢复模型的泛化能力，同时保持了完全即插即用。

Title: A Deep Generative Model for the Design of Synthesizable Ionizable Lipids

Authors: Yuxuan Ou, Jingyi Zhao, Austin Tripp, Morteza Rasoulianboroujeni, José Miguel Hernández-Lobato
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.00928
Pdf URL: https://arxiv.org/pdf/2412.00928
Copy Paste: [[2412.00928]] A Deep Generative Model for the Design of Synthesizable Ionizable Lipids(https://arxiv.org/abs/2412.00928)
Keywords: generation, generative
Abstract: Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the effective delivery of mRNA for vaccines and therapies by protecting it from rapid degradation. Among the components of LNPs, ionizable lipids play a key role in RNA protection and facilitate its delivery into the cytoplasm. However, designing ionizable lipids is complex. Deep generative models can accelerate this process and explore a larger candidate space compared to traditional methods. Due to the structural differences between lipids and small molecules, existing generative models used for small molecule generation are unsuitable for lipid generation. To address this, we developed a deep generative model specifically tailored for the discovery of ionizable lipids. Our model generates novel ionizable lipid structures and provides synthesis paths using synthetically accessible building blocks, addressing synthesizability. This advancement holds promise for streamlining the development of lipid-based delivery systems, potentially accelerating the deployment of new therapeutic agents, including mRNA vaccines and gene therapies.
摘要：脂质纳米颗粒 (LNP) 在现代生物医学中至关重要，它通过保护 mRNA 免于快速降解，从而实现疫苗和疗法的有效递送。在 LNP 的成分中，可电离脂质在 RNA 保护中起着关键作用，并促进其进入细胞质。然而，设计可电离脂质很复杂。与传统方法相比，深度生成模型可以加速这一过程并探索更大的候选空间。由于脂质和小分子之间的结构差异，用于小分子生成的现有生成模型不适合脂质生成。为了解决这个问题，我们开发了一种专门为发现可电离脂质而定制的深度生成模型。我们的模型生成新的可电离脂质结构，并使用可合成的构建块提供合成路径，解决了可合成性问题。这一进步有望简化基于脂质的递送系统的开发，有可能加速新治疗剂（包括 mRNA 疫苗和基因疗法）的部署。

Title: STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Authors: Nicholas Lenzen, Amogh Raut, Andrew Melnik
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2412.00949
Pdf URL: https://arxiv.org/pdf/2412.00949
Copy Paste: [[2412.00949]] STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft(https://arxiv.org/abs/2412.00949)
Keywords: generative
Abstract: Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.
摘要：最近，STEVE-1 方法已被引入作为训练生成代理遵循潜在 CLIP 嵌入形式的指令的方法。在这项工作中，我们提出了一种通过学习从新输入模态到代理的潜在目标空间的映射来扩展控制模态的方法。我们将我们的方法应用于具有挑战性的 Minecraft 领域，并扩展目标条件以包括音频模态。由此产生的音频调节代理能够达到与原始文本调节和视觉调节代理相当的水平。具体来说，我们为 Minecraft 创建了一个音频-视频 CLIP 基础模型和一个音频先验网络，它们一起将音频样本映射到 STEVE-1 策略的潜在目标空间。此外，我们强调了在不同模态上进行调节时发生的权衡。我们的训练代码、评估代码和 Minecraft 的音频-视频 CLIP 基础模型都是开源的，以帮助促进对多模态通用顺序决策代理的进一步研究。

Title: WAFFLE: Multimodal Floorplan Understanding in the Wild

Authors: Keren Ganon, Morris Alper, Rachel Mikulinsky, Hadar Averbuch-Elor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.00955
Pdf URL: https://arxiv.org/pdf/2412.00955
Copy Paste: [[2412.00955]] WAFFLE: Multimodal Floorplan Understanding in the Wild(https://arxiv.org/abs/2412.00955)
Keywords: generative
Abstract: Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building's structure -- its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.
摘要：建筑是人类文化的核心特征，越来越多地使用计算方法来分析。然而，最近关于计算建筑理解的研究主要集中在建筑的自然图像上，而忽略了定义建筑结构的基本元素——建筑平面图。相反，现有的关于平面图理解的研究范围极其有限，通常侧重于单一语义类别和地区的平面图（例如来自单个国家的公寓平面图）。在这项工作中，我们介绍了 WAFFLE，这是一个新颖的多模态平面图理解数据集，包含近 20K 张平面图图像和从互联网数据中整理出来的元数据，涵盖各种建筑类型、位置和数据格式。通过使用大型语言模型和多模态基础模型，我们从这些图像及其附带的噪声元数据中整理和提取语义信息。我们表明，WAFFLE 能够在新的建筑理解任务（包括判别任务和生成任务）上取得进展，而这些任务使用以前的数据集是无法实现的。我们将公开发布 WAFFLE 以及我们的代码和经过训练的模型，为研究界提供学习建筑语义的新基础。

Title: Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive

Authors: Zhe Wang, Haozhu Wang, Yanjun Qi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.00979
Pdf URL: https://arxiv.org/pdf/2412.00979
Copy Paste: [[2412.00979]] Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive(https://arxiv.org/abs/2412.00979)
Keywords: generation
Abstract: Decision transformers recast reinforcement learning as a conditional sequence generation problem, offering a simple but effective alternative to traditional value or policy-based methods. A recent key development in this area is the integration of prompting in decision transformers to facilitate few-shot policy generalization. However, current methods mainly use static prompt segments to guide rollouts, limiting their ability to provide context-specific guidance. Addressing this, we introduce a hierarchical prompting approach enabled by retrieval augmentation. Our method learns two layers of soft tokens as guiding prompts: (1) global tokens encapsulating task-level information about trajectories, and (2) adaptive tokens that deliver focused, timestep-specific instructions. The adaptive tokens are dynamically retrieved from a curated set of demonstration segments, ensuring context-aware guidance. Experiments across seven benchmark tasks in the MuJoCo and MetaWorld environments demonstrate the proposed approach consistently outperforms all baseline methods, suggesting that hierarchical prompting for decision transformers is an effective strategy to enable few-shot policy generalization.
摘要：决策转换器将强化学习重塑为条件序列生成问题，为传统价值或基于策略的方法提供了一种简单但有效的替代方案。该领域最近的一个关键发展是将提示集成到决策转换器中，以促进小样本策略泛化。然而，当前的方法主要使用静态提示段来指导部署，限制了它们提供特定于上下文的指导的能力。为了解决这个问题，我们引入了一种通过检索增强实现的分层提示方法。我们的方法学习两层软标记作为指导提示：（1）封装有关轨迹的任务级信息的全局标记，以及（2）提供有针对性的、特定于时间步的指令的自适应标记。自适应标记是从一组精选的演示片段中动态检索的，确保上下文感知的指导。在 MuJoCo 和 MetaWorld 环境中的七个基准测试任务中的实验表明，所提出的方法始终优于所有基线方法，这表明决策转换器的分层提示是实现小样本策略泛化的有效策略。

Title: Detecting Memorization in Large Language Models

Authors: Eduardo Slonski
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01014
Pdf URL: https://arxiv.org/pdf/2412.01014
Copy Paste: [[2412.01014]] Detecting Memorization in Large Language Models(https://arxiv.org/abs/2412.01014)
Keywords: generation
Abstract: Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations allows us to suppress memorization without degrading overall performance, enhancing evaluation integrity by ensuring metrics reflect genuine generalization. Additionally, our method supports large-scale labeling of tokens and sequences, crucial for next-generation AI models, improving training efficiency and results. Our findings contribute to model interpretability and offer practical tools for analyzing and controlling internal mechanisms in LLMs.
摘要：大型语言模型 (LLM) 在自然语言处理方面取得了令人瞩目的成果，但容易记住部分训练数据，这可能会损害评估指标、引发隐私问题并限制泛化。检测记忆的传统方法依赖于输出概率或损失函数，由于常见语言模式等混杂因素，通常缺乏精度。在本文中，我们介绍了一种通过检查 LLM 内的神经元激活来精确检测记忆的分析方法。通过识别区分记忆和未记忆标记的特定激活模式，我们训练了实现近乎完美准确度的分类探针。该方法还可以应用于其他机制，例如重复，如本研究所示，突出了其多功能性。干预这些激活使我们能够在不降低整体性能的情况下抑制记忆，通过确保指标反映真正的泛化来增强评估完整性。此外，我们的方法支持对标记和序列进行大规模标记，这对于下一代 AI 模型至关重要，可以提高训练效率和结果。我们的发现有助于模型的可解释性，并为分析和控制 LLM 中的内部机制提供了实用工具。

Title: FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Authors: Taekyung Ki, Dongchan Min, Gyoungsu Chae
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01064
Pdf URL: https://arxiv.org/pdf/2412.01064
Copy Paste: [[2412.01064]] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait(https://arxiv.org/abs/2412.01064)
Keywords: generation, generative
Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
摘要：随着基于扩散的生成模型的快速发展，肖像动画取得了令人瞩目的成果。然而，由于其迭代采样特性，它仍然面临着时间一致性视频生成和快速采样方面的挑战。本文提出了一种基于流匹配生成模型的音频驱动的说话肖像视频生成方法FLOAT。我们将生成模型从基于像素的潜在空间转移到学习到的运动潜在空间，从而能够高效地设计时间一致的运动。为了实现这一点，我们引入了一个基于变压器的矢量场预测器，它具有一个简单而有效的逐帧调节机制。此外，我们的方法支持语音驱动的情感增强，从而能够自然地融入富有表现力的动作。大量实验表明，我们的方法在视觉质量、运动保真度和效率方面均优于最先进的音频驱动的说话肖像方法。

Title: Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection

Authors: Delong Zhu, Yuezun Li, Baoyuan Wu, Jiaran Zhou, Zhibo Wang, Siwei Lyu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01101
Pdf URL: https://arxiv.org/pdf/2412.01101
Copy Paste: [[2412.01101]] Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection(https://arxiv.org/abs/2412.01101)
Keywords: generation
Abstract: This paper investigates the feasibility of a proactive DeepFake defense framework, {\em FacePosion}, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.
摘要：本文探讨了主动式 DeepFake 防御框架 FacePosion 的可行性，该框架通过破坏人脸检测来防止个人成为 DeepFake 视频的受害者。其动机源于大多数 DeepFake 方法都依赖于人脸检测器来自动从视频中提取受害者人脸进行训练或合成（测试）。一旦人脸检测器发生故障，提取的人脸就会扭曲或不正确，从而破坏 DeepFake 模型的训练或合成。为了实现这一点，我们为此目的专门设计了各种对抗性攻击，并彻底分析了它们的可行性。基于 FacePoison，我们引入了 VideoFacePoison，这是一种在视频帧之间传播 FacePoison 而不是将它们单独应用于每个帧的策略。该策略可以大大减少计算开销，同时保持良好的攻击性能。我们的方法在五个人脸检测器上得到了验证，针对十一个不同的 DeepFake 模型的大量实验证明了破坏人脸检测器以阻止 DeepFake 生成的有效性。

Title: DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Authors: Hao Wu, Zhihang Zhong, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01115
Pdf URL: https://arxiv.org/pdf/2412.01115
Copy Paste: [[2412.01115]] DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding(https://arxiv.org/abs/2412.01115)
Keywords: generation
Abstract: Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.
摘要：图像字幕模型在应用于新数据集时通常会遭受性能下降的困扰，因为它们通常是在特定领域的数据上进行训练的。为了增强在领域外场景中的泛化能力，检索增强方法引起了越来越多的关注。然而，当前的方法面临两个关键挑战：（1）用于检索的图像特征通常基于地面实况 (GT) 字幕进行优化，这些字幕代表了图像的特定视角，并受到注释者偏见的影响；（2）它们没有充分利用检索文本的全部潜力，通常依赖于原始字幕或解析对象，而这些对象无法捕捉数据的完整语义丰富性。在本文中，我们提出了深入检索 (DIR) 方法，旨在增强图像到文本的检索过程和检索文本的利用率，以更全面地理解视觉内容。我们的方法引入了两个关键创新：（1）扩散引导检索增强，其中预训练的扩散模型通过重建噪声图像来引导图像特征学习，从而使模型能够捕获标准注释字幕之外更全面、更细粒度的视觉信息；（2）高质量检索数据库，它提供全面的语义信息以增强字幕生成，尤其是在域外场景中。大量实验表明，DIR 不仅保持了具有竞争力的域内性能，而且还显着提高了域外泛化能力，所有这些都不会增加推理成本。

Title: Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM

Authors: Alejandro Fontan, Javier Civera, Tobias Fischer, Michael Milford
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.01116
Pdf URL: https://arxiv.org/pdf/2412.01116
Copy Paste: [[2412.01116]] Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM(https://arxiv.org/abs/2412.01116)
Keywords: generative
Abstract: Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.
摘要：评估对于开发和调整运动恢复结构 (SfM) 和视觉 SLAM (VSLAM) 系统都至关重要，但普遍依赖于高质量的几何地面实况——这种资源不仅成本高昂且耗时，而且在许多情况下完全无法获得。这种对地面实况的依赖限制了 SfM 和 SLAM 在不同环境中的应用，并限制了其在真实场景中的可扩展性。在这项工作中，我们提出了一种新颖的无地面实况 (GTF) 评估方法，该方法消除了对几何地面实况的需求，而是通过从输入图像的原始版本和噪声版本中进行采样来使用敏感度估计。我们的方法与传统的基于地面实况的基准显示出很强的相关性，并支持 GTF 超参数调整。消除对地面实况的需求为利用大量数据集源以及自我监督和在线调整开辟了新的机会，有可能实现类似于生成式 AI 中发生的数据驱动突破。

Title: LoyalDiffusion: A Diffusion Model Guarding Against Data Replication

Authors: Chenghao Li, Yuke Zhang, Dake Chen, Jingqi Xu, Peter A. Beerel
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.01118
Pdf URL: https://arxiv.org/pdf/2412.01118
Copy Paste: [[2412.01118]] LoyalDiffusion: A Diffusion Model Guarding Against Data Replication(https://arxiv.org/abs/2412.01118)
Keywords: generation
Abstract: Diffusion models have demonstrated significant potential in image generation. However, their ability to replicate training data presents a privacy risk, particularly when the training data includes confidential information. Existing mitigation strategies primarily focus on augmenting the training dataset, leaving the impact of diffusion model architecture under explored. In this paper, we address this gap by examining and mitigating the impact of the model structure, specifically the skip connections in the diffusion model's U-Net model. We first present our observation on a trade-off in the skip connections. While they enhance image generation quality, they also reinforce the memorization of training data, increasing the risk of replication. To address this, we propose a replication-aware U-Net (RAU-Net) architecture that incorporates information transfer blocks into skip connections that are less essential for image quality. Recognizing the potential impact of RAU-Net on generation quality, we further investigate and identify specific timesteps during which the impact on memorization is most pronounced. By applying RAU-Net selectively at these critical timesteps, we couple our novel diffusion model with a targeted training and inference strategy, forming a framework we refer to as LoyalDiffusion. Extensive experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.
摘要：扩散模型在图像生成方面表现出了巨大的潜力。然而，它们复制训练数据的能力带来了隐私风险，尤其是当训练数据包含机密信息时。现有的缓解策略主要侧重于扩充训练数据集，而扩散模型架构的影响尚未得到充分探索。在本文中，我们通过检查和减轻模型结构的影响来解决这一问题，特别是扩散模型的 U-Net 模型中的跳过连接。我们首先介绍我们对跳过连接的权衡的观察。虽然它们提高了图像生成质量，但它们也加强了对训练数据的记忆，增加了复制的风险。为了解决这个问题，我们提出了一种复制感知 U-Net (RAU-Net) 架构，将信息传输块合并到对图像质量不太重要的跳过连接中。认识到 RAU-Net 对生成质量的潜在影响，我们进一步研究并确定了对记忆影响最明显的特定时间步骤。通过在这些关键时间步骤选择性地应用 RAU-Net，我们将新颖的扩散模型与有针对性的训练和推理策略相结合，形成一个我们称之为 LoyalDiffusion 的框架。大量实验表明，LoyalDiffusion 优于最先进的复制缓解方法，在保持同等图像质量的同时，实现了 48.63% 的复制减少。

Title: Object Tracking in a $360^o$ View: A Novel Perspective on Bridging the Gap to Biomedical Advancements

Authors: Mojtaba S. Fazli, Shannon Quinn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01119
Pdf URL: https://arxiv.org/pdf/2412.01119
Copy Paste: [[2412.01119]] Object Tracking in a $360^o$ View: A Novel Perspective on Bridging the Gap to Biomedical Advancements(https://arxiv.org/abs/2412.01119)
Keywords: generation
Abstract: Object tracking is a fundamental tool in modern innovation, with applications in defense systems, autonomous vehicles, and biomedical research. It enables precise identification, monitoring, and spatiotemporal analysis of objects across sequential frames, providing insights into dynamic behaviors. In cell biology, object tracking is vital for uncovering cellular mechanisms, such as migration, interactions, and responses to drugs or pathogens. These insights drive breakthroughs in understanding disease progression and therapeutic interventions. Over time, object tracking methods have evolved from traditional feature-based approaches to advanced machine learning and deep learning frameworks. While classical methods are reliable in controlled settings, they struggle in complex environments with occlusions, variable lighting, and high object density. Deep learning models address these challenges by delivering greater accuracy, adaptability, and robustness. This review categorizes object tracking techniques into traditional, statistical, feature-based, and machine learning paradigms, with a focus on biomedical applications. These methods are essential for tracking cells and subcellular structures, advancing our understanding of health and disease. Key performance metrics, including accuracy, efficiency, and adaptability, are discussed. The paper explores limitations of current methods and highlights emerging trends to guide the development of next-generation tracking systems for biomedical research and broader scientific domains.
摘要：物体跟踪是现代创新的基本工具，可应用于防御系统、自动驾驶汽车和生物医学研究。它能够跨连续帧对物体进行精确识别、监控和时空分析，从而深入了解动态行为。在细胞生物学中，物体跟踪对于揭示细胞机制（例如迁移、相互作用以及对药物或病原体的反应）至关重要。这些见解推动了对疾病进展和治疗干预的理解取得突破。随着时间的推移，物体跟踪方法已经从传统的基于特征的方法发展为先进的机器学习和深度学习框架。虽然传统方法在受控环境中是可靠的，但它们在具有遮挡、可变照明和高物体密度的复杂环境中会遇到困难。深度学习模型通过提供更高的准确性、适应性和稳健性来解决这些挑战。本综述将物体跟踪技术分为传统、统计、基于特征和机器学习范式，重点关注生物医学应用。这些方法对于跟踪细胞和亚细胞结构至关重要，有助于增进我们对健康和疾病的理解。本文讨论了关键性能指标，包括准确性、效率和适应性。该论文探讨了当前方法的局限性，并强调了新兴趋势，以指导生物医学研究和更广泛的科学领域的下一代跟踪系统的开发。

Title: SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics

Authors: Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2412.01124
Pdf URL: https://arxiv.org/pdf/2412.01124
Copy Paste: [[2412.01124]] SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics(https://arxiv.org/abs/2412.01124)
Keywords: super-resolution
Abstract: Spatial Transcriptomics (ST) is a method that captures spatial gene expression profiles within histological sections. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can improve both the spatial resolution and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms, SUICA outperforms both conventional INR variants and SOTA methods for ST super-resolution regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis. The code is available at this https URL.
摘要：空间转录组学 (ST) 是一种捕获组织切片内空间基因表达谱的方法。离散的空间分布和超高维测序结果使得 ST 数据难以有效建模。在本文中，我们设法通过所提出的工具 SUICA 以连续和紧凑的方式对 ST 进行建模，该工具借助隐式神经表征 (INR) 的强大近似能力，可以提高空间分辨率和基因表达。具体而言，在所提出的 SUICA 中，我们结合了图形增强自动编码器，以有效地对非结构化点的上下文信息进行建模，并提供具有结构感知的信息嵌入以进行空间映射。我们还以分类回归的方式处理极度倾斜的分布，并强制使用基于分类的损失函数来优化 SUICA。通过对各种常见 ST 平台进行大量实验，SUICA 在数值保真度、统计相关性和生物保护方面优于传统的 INR 变体和用于 ST 超分辨率的 SOTA 方法。 SUICA 的预测还展示了扩增的基因特征，丰富了原始数据的生物保护性并有利于后续分析。代码可在此 https URL 上找到。

Title: Referring Video Object Segmentation via Language-aligned Track Selection

Authors: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01136
Pdf URL: https://arxiv.org/pdf/2412.01136
Copy Paste: [[2412.01136]] Referring Video Object Segmentation via Language-aligned Track Selection(https://arxiv.org/abs/2412.01136)
Keywords: generation
Abstract: Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at this https URL
摘要：引用视频对象分割 (RVOS) 旨在根据自然语言表达对整个视频中的对象进行分割。虽然现有方法在视觉语言对齐方面取得了长足进步，但它们往往忽视了强大的视频对象跟踪的重要性，其中不一致的掩码轨道会破坏视觉语言对齐，导致性能不佳。在这项工作中，我们提出了对象语言对齐选择 (SOLA)，这是一个新颖的框架，它将 RVOS 重新表述为两个子问题，即轨道生成和轨道选择。在轨道生成中，我们利用视觉基础模型 Segment Anything Model 2 (SAM2)，该模型可跨帧生成一致的掩码轨道，从而为前景和背景对象生成可靠的候选对象。对于轨道选择，我们提出了一个轻量但有效的选择模块，它可以对齐视觉和文本特征，同时对视频序列中的对象外观和运动进行建模。这种设计可以实现精确的运动建模和视觉语言对齐。我们的方法在具有挑战性的 MeViS 数据集上实现了最先进的性能，并在 Ref-Youtube-VOS 和 Ref-DAVIS 数据集的零样本设置中展示了出色的结果。此外，SOLA 在损坏的设置（例如添加了高斯噪声或运动模糊的设置）中表现出强大的泛化能力和鲁棒性。我们的项目页面位于此 https URL

Title: TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Authors: Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01137
Pdf URL: https://arxiv.org/pdf/2412.01137
Copy Paste: [[2412.01137]] TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition(https://arxiv.org/abs/2412.01137)
Keywords: generation
Abstract: Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.
摘要：场景文本识别 (STR) 面临的挑战是，要么合成训练数据不够逼真，要么难以收集足够的高质量真实世界数据，这限制了训练后的 STR 模型的有效性。同时，尽管基于扩散的文本图像生成方法可以生成整体上有吸引力的文本图像，但它们难以大规模生成准确而逼真的实例级文本。为了解决这个问题，我们引入了 TextSSR：一种通过基于扩散的通用文本区域合成模型合成场景文本识别数据的新框架。它专注于在指定的图像区域内生成文本，并利用丰富的字形和位置信息来创建与整个图像相比不太复杂的文本区域，从而确保准确性。此外，我们利用区域内的相邻文本作为提示来捕获真实世界的字体样式和布局模式，从而引导生成的文本与实际场景相似。最后，由于其无提示性和字符级合成能力，TextSSR 具有出色的可扩展性，我们构建了一个基于字谜的 TextSSR-F 数据集，其中包含 40 万个具有复杂性和真实性的文本实例。实验表明，使用附加的 TextSSR-F 数据训练的模型比使用 400 万现有合成数据训练的模型具有更高的准确率。此外，其准确率与完全使用真实数据集训练的模型的差距不到 3.7%，这证实了 TextSSR 的有效性及其在场景文本图像合成方面的巨大潜力。我们的代码可在此 https URL 上找到。

Title: ControlFace: Harnessing Facial Parametric Control for Face Rigging

Authors: Wooseok Jang, Youngjun Hong, Gunho Cha, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01160
Pdf URL: https://arxiv.org/pdf/2412.01160
Copy Paste: [[2412.01160]] ControlFace: Harnessing Facial Parametric Control for Face Rigging(https://arxiv.org/abs/2412.01160)
Keywords: generation
Abstract: Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also known as face rigging, is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, the control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet's rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace's superior performance in identity preservation and control precision, highlighting its practicality. Please see the project website: this https URL.
摘要：处理面部图像以满足姿势、表情和光照等特定控制（也称为面部绑定）是计算机视觉中的一项复杂任务。现有方法受限于对图像数据集的依赖，这需要针对个人进行微调，并限制了它们保留细粒度身份和语义细节的能力，从而降低了实际可用性。为了克服这些限制，我们引入了 ControlFace，这是一种基于 3DMM 渲染的新型面部绑定方法，可实现灵活、高保真控制。我们采用双分支 U-Net：一个称为 FaceNet，用于捕获身份和精细细节，而另一个则专注于生成。为了提高控制精度，控制混合器模块对目标对齐控制和参考对齐控制之间的相关特征进行编码，并且一种新颖的指导方法（参考控制指导）指导生成过程以更好地遵守控制。通过在面部视频数据集上进行训练，我们充分利用了 FaceNet 的丰富表示，同时确保了控制遵守性。大量实验证明了ControlFace在身份保存和控制精度方面的卓越表现，凸显了它的实用性。请参阅项目网站：这个https网址。

Title: Graph Community Augmentation with GMM-based Modeling in Latent Space

Authors: Shintaro Fukushima, Kenji Yamanishi
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01163
Pdf URL: https://arxiv.org/pdf/2412.01163
Copy Paste: [[2412.01163]] Graph Community Augmentation with GMM-based Modeling in Latent Space(https://arxiv.org/abs/2412.01163)
Keywords: generation, generative
Abstract: This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets.
摘要：本研究探讨了使用生成模型生成图的问题。我们特别关注图社区增强问题，即从给定图数据集估计的概率分布中生成具有新社区的未见过或不熟悉的图的问题。图社区增强意味着生成的图具有新社区。有可能发现具有新社区的未见过但重要的图结构，例如在购买者网络等社交网络中。在难以收集足够多的真实图数据的情况下，图社区增强也可能有助于数据挖掘模型的泛化。事实上，有很多方法可以在现有图中生成新社区。我们希望在给定图之外发现具有新社区的新图，同时在一定程度上保留原始图的结构，以使生成的图具有现实性。为此，我们提出了一种称为图社区增强 (GCA) 的算法。 GCA 的核心思想是 (i) 将高斯混合模型 (GMM) 拟合到原始图中节点嵌入的潜在空间中的数据点，以及 (ii) 根据最小描述长度 (MDL) 原则将新簇中的数据点添加到潜在空间中以生成新社区。我们通过实证证明了 GCA 在合成和真实数据集上生成具有新社区结构的图的有效性。

Title: Rectified Flow For Structure Based Drug Design

Authors: Daiheng Zhang, Chengyue Gong, Qiang Liu
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.01174
Pdf URL: https://arxiv.org/pdf/2412.01174
Copy Paste: [[2412.01174]] Rectified Flow For Structure Based Drug Design(https://arxiv.org/abs/2412.01174)
Keywords: generation, generative
Abstract: Deep generative models have achieved tremendous success in structure-based drug design in recent years, especially for generating 3D ligand molecules that bind to specific protein pocket. Notably, diffusion models have transformed ligand generation by providing exceptional quality and creativity. However, traditional diffusion models are restricted by their conventional learning objectives, which limit their broader applicability. In this work, we propose a new framework FlowSBDD, which is based on rectified flow model, allows us to flexibly incorporate additional loss to optimize specific target and introduce additional condition either as an extra input condition or replacing the initial Gaussian distribution. Extensive experiments on CrossDocked2020 show that our approach could achieve state-of-the-art performance on generating high-affinity molecules while maintaining proper molecular properties without specifically designing binding site, with up to -8.50 Avg. Vina Dock score and 75.0% Diversity.
摘要：近年来，深度生成模型在基于结构的药物设计中取得了巨大成功，尤其是在生成与特定蛋白质口袋结合的 3D 配体分子方面。值得注意的是，扩散模型通过提供卓越的质量和创造力改变了配体的生成。然而，传统的扩散模型受到其传统学习目标的限制，这限制了它们的广泛适用性。在这项工作中，我们提出了一个基于整流流模型的新框架 FlowSBDD，它允许我们灵活地纳入额外的损失来优化特定目标并引入额外的条件作为额外的输入条件或替换初始高斯分布。在 CrossDocked2020 上进行的大量实验表明，我们的方法可以在生成高亲和力分子方面实现最先进的性能，同时保持适当的分子特性，而无需专门设计结合位点，平均 Vina Dock 得分高达 -8.50，多样性高达 75.0%。

Title: MeasureNet: Measurement Based Celiac Disease Identification

Authors: Aayush Kumar Tyagi, Vaibhav Mishra, Ashok Tiwari, Lalita Mehra, Prasenjit Das, Govind Makharia, Prathosh AP, Mausam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01182
Pdf URL: https://arxiv.org/pdf/2412.01182
Copy Paste: [[2412.01182]] MeasureNet: Measurement Based Celiac Disease Identification(https://arxiv.org/abs/2412.01182)
Keywords: generative
Abstract: Celiac disease is an autoimmune disorder triggered by the consumption of gluten. It causes damage to the villi, the finger-like projections in the small intestine that are responsible for nutrient absorption. Additionally, the crypts, which form the base of the villi, are also affected, impairing the regenerative process. The deterioration in villi length, computed as the villi-to-crypt length ratio, indicates the severity of celiac disease. However, manual measurement of villi-crypt length can be both time-consuming and susceptible to inter-observer variability, leading to inconsistencies in diagnosis. While some methods can perform measurement as a post-hoc process, they are prone to errors in the initial stages. This gap underscores the need for pathologically driven solutions that enhance measurement accuracy and reduce human error in celiac disease assessments. Our proposed method, MeasureNet, is a pathologically driven polyline detection framework incorporating polyline localization and object-driven losses specifically designed for measurement tasks. Furthermore, we leverage segmentation model to provide auxiliary guidance about crypt location when crypt are partially visible. To ensure that model is not overdependent on segmentation mask we enhance model robustness through a mask feature mixup technique. Additionally, we introduce a novel dataset for grading celiac disease, consisting of 750 annotated duodenum biopsy images. MeasureNet achieves an 82.66% classification accuracy for binary classification and 81% accuracy for multi-class grading of celiac disease. Code: this https URL
摘要：乳糜泻是一种由麸质摄入引发的自身免疫性疾病。它会损害小肠绒毛，即负责营养吸收的手指状突起。此外，构成绒毛底部的隐窝也会受到影响，从而损害再生过程。绒毛长度的恶化（以绒毛与隐窝长度比计算）表明乳糜泻的严重程度。然而，手动测量绒毛-隐窝长度既耗时又容易受到观察者间差异的影响，导致诊断不一致。虽然有些方法可以将测量作为事后过程进行，但它们在初始阶段容易出错。这一差距凸显了对病理驱动解决方案的需求，这些解决方案可以提高测量准确性并减少乳糜泻评估中的人为错误。我们提出的方法 MeasureNet 是一个病理驱动的折线检测框架，结合了折线定位和对象驱动的损失，专为测量任务而设计。此外，当隐窝部分可见时，我们利用分割模型提供有关隐窝位置的辅助指导。为了确保模型不会过度依赖分割掩码，我们通过掩码特征混合技术增强了模型的鲁棒性。此外，我们引入了一个用于分级乳糜泻的新数据集，该数据集由 750 张带注释的十二指肠活检图像组成。MeasureNet 在二分类中实现了 82.66% 的分类准确率，在乳糜泻的多类分级中实现了 81% 的准确率。代码：此 https URL

Title: TinyFusion: Diffusion Transformers Learned Shallow

Authors: Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01199
Pdf URL: https://arxiv.org/pdf/2412.01199
Copy Paste: [[2412.01199]] TinyFusion: Diffusion Transformers Learned Shallow(https://arxiv.org/abs/2412.01199)
Keywords: generation
Abstract: Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at this https URL.
摘要：扩散变压器在图像生成方面表现出了卓越的能力，但通常伴随着过多的参数化，导致实际应用中的推理开销相当大。在这项工作中，我们提出了 TinyFusion，这是一种深度修剪方法，旨在通过端到端学习从扩散变压器中去除冗余层。我们方法的核心原则是创建一个具有高可恢复性的修剪模型，使其在微调后恢复强大的性能。为了实现这一点，我们引入了一种可微分采样技术，使修剪可学习，并结合一个共同优化的参数来模拟未来的微调。虽然之前的工作重点是最大限度地减少修剪后的损失或错误，但我们的方法明确地建模和优化了修剪模型的后微调性能。实验结果表明，这种可学习的范式为扩散变压器的层修剪提供了巨大的好处，超越了现有的基于重要性和基于错误的方法。此外，TinyFusion 在 DiT、MAR 和 SiT 等不同架构中表现出很强的泛化能力。使用 DiT-XL 进行的实验表明，TinyFusion 可以以不到预训练成本 7% 的成本制作浅层扩散变压器，实现 2$\times$ 的加速，FID 得分为 2.86，优于具有同等效率的竞争对手。代码可在此 https URL 上找到。

Title: Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Authors: Wenxin Su, Song Tang, Xiaofeng Liu, Xiaojing Yi, Mao Ye, Chunxiao Zu, Jiahao Li, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01203
Pdf URL: https://arxiv.org/pdf/2412.01203
Copy Paste: [[2412.01203]] Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data(https://arxiv.org/abs/2412.01203)
Keywords: generation, generative
Abstract: Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.
摘要：域转移（源域和目标域之间的差异）对临床应用（例如糖尿病视网膜病变 (DR) 分级）提出了重大挑战。尽管考虑到某些临床要求（例如源数据隐私），但传统的传输方法主要是以模型为中心，并且通常难以防止针对模型的攻击。在本文中，我们解决了由临床环境需求驱动的具有挑战性的在线模型不可知域自适应 (OMG-DA) 设置。此设置的特点是没有模型和目标数据流。为了应对新的挑战，我们提出了一种新方法，即生成非对抗性示例 (GUES)，它能够从以数据为中心的角度进行适应。具体而言，我们首先从理论上以生成方式重新表述传统的扰动优化——学习具有潜在输入变量的扰动生成函数。在模型实例化期间，我们利用变分自动编码器来表达此函数。使用重新参数化技巧的编码器预测潜在输入，而解码器负责生成。此外，显著性图被选为伪扰动标签。因为它不仅可以捕获潜在病变，而且理论上还可以为函数输入提供上限，从而能够识别潜在变量。在 DR 基准上使用冻结预训练模型和可训练模型进行的大量比较实验证明了 GUES 的优越性，即使在小批量的情况下也表现出稳健性。

Title: PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

Authors: Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01223
Pdf URL: https://arxiv.org/pdf/2412.01223
Copy Paste: [[2412.01223]] PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control(https://arxiv.org/abs/2412.01223)
Keywords: generation
Abstract: Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
摘要：近年来，扩散模型在图像修复领域表现出色。基于扩散模型的修复方法通常可以针对被遮盖的区域生成逼真的高质量图像内容。然而，由于扩散模型的局限性，现有方法通常会遇到图像与文本语义一致性以及用户编辑习惯方面的问题。针对这些问题，我们提出了 PainterNet 插件，可以灵活地嵌入到各种扩散模型中。为了在被遮盖的区域中生成与用户输入提示高度一致的图像内容，我们提出了局部提示输入、注意力控制点 (ACP) 和实际标记注意力损失 (ATAL) 来增强模型对局部区域的关注。此外，我们重新设计了训练和测试数据集中的 MASK 生成算法以模拟用户应用 MASK 的习惯，并引入了定制的新训练数据集 PainterData 和基准数据集 PainterBench。我们广泛的实验分析表明，PainterNet 在图像质量和全局/局部文本一致性等关键指标上超越了现有的最先进模型。

Title: Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Authors: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01240
Pdf URL: https://arxiv.org/pdf/2412.01240
Copy Paste: [[2412.01240]] Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes(https://arxiv.org/abs/2412.01240)
Keywords: generation
Abstract: As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.
摘要：作为基础模型，SAM 显著影响了计算机视觉的多个领域，其升级版本 SAM 2 增强了视频分割功能，有望再次产生重大影响。虽然 SAM（SAM 和 SAM 2）在分割人、车和道路等与上下文无关的概念方面表现出色，但它们忽略了更具挑战性的上下文相关 (CD) 概念，例如视觉显著性、伪装、产品缺陷和医学病变。CD 概念严重依赖于全局和局部上下文信息，因此容易受到不同上下文变化的影响，这需要模型具有强大的判别能力。缺乏对 SAM 的全面评估限制了对其性能边界的理解，这可能会阻碍未来模型的设计。在本文中，我们对自然、医疗和工业场景中各种视觉模态的 2D 和 3D 图像和视频中的 11 个 CD 概念的 SAM 进行了彻底的定量评估。我们为 SAM 和 SAM 2 开发了一个统一的评估框架，该框架支持手动、自动和中级自我提示，并借助我们特定的提示生成和交互策略。我们进一步探索 SAM 2 在上下文学习中的潜力，并引入提示稳健性测试来模拟现实世界中不完美的提示。最后，我们分析了 SAM 在理解 CD 概念方面的优势和局限性，并讨论了它们在分割任务中的未来发展。这项工作旨在提供有价值的见解，以指导未来在上下文无关和上下文相关概念分割方面的研究，并可能为下一个版本 SAM 3 的开发提供参考。

Title: Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Authors: Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01243
Pdf URL: https://arxiv.org/pdf/2412.01243
Copy Paste: [[2412.01243]] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation(https://arxiv.org/abs/2412.01243)
Keywords: generation
Abstract: Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
摘要：扩散和流动模型在文本到图像生成等各种应用中取得了显著的成功。然而，这些模型在推理过程中通常依赖于相同的预定去噪计划，这可能会限制推理效率以及处理不同提示时的灵活性。在本文中，我们认为最佳噪声计划应该适应每个推理实例，并引入时间预测扩散模型 (TPDM) 来实现这一点。TPDM 采用即插即用的时间预测模块 (TPM)，可根据每个去噪步骤的当前潜在特征预测下一个噪声级别。我们使用强化学习训练 TPM，旨在最大化奖励，该奖励会根据去噪步骤的数量来降低最终图像质量。借助这种自适应调度程序，TPDM 不仅可以生成与人类偏好紧密相关的高质量图像，还可以动态调整去噪步骤的数量和时间，从而提高性能和效率。我们在多个扩散模型基准上训练 TPDM。凭借稳定扩散 3 中等架构，TPDM 的美学得分达到 5.44，人类偏好得分 (HPS) 达到 29.59，同时使用约 50% 更少的去噪步骤来实现更好的性能。我们将与本文一起发布我们的最佳模型。

Title: Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01244
Pdf URL: https://arxiv.org/pdf/2412.01244
Copy Paste: [[2412.01244]] Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization(https://arxiv.org/abs/2412.01244)
Keywords: generation
Abstract: As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.
摘要：随着大规模扩散模型的不断发展，它们擅长生成高质量的图像，但通常会生成不受欢迎的内容，例如色情或暴力内容。现有的概念去除方法通常会指导图像生成过程，但可能会无意中修改不相关的区域，从而导致与原始模型不一致。我们提出了一种在扩散模型中进行有针对性的概念替换的新方法，可以在不影响非目标区域的情况下删除特定概念。我们的方法引入了一个专用的概念定位器，用于在去噪过程中精确识别目标概念，该定位器经过少样本学习训练，需要的标记数据最少。在识别的区域内，我们引入了一个无需训练的双提示交叉注意 (DPCA) 模块来替代目标概念，确保对周围内容的干扰最小。我们评估了我们的方法的概念定位精度和替换效率。实验结果表明，我们的方法在定位目标概念方面实现了卓越的精度，并且在对非目标区域的影响最小的情况下执行了连贯的概念替换，优于现有方法。

Title: Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

Authors: Jinouwen Zhang, Rongkun Xue, Yazhe Niu, Yun Chen, Jing Yang, Hongsheng Li, Yu Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01245
Pdf URL: https://arxiv.org/pdf/2412.01245
Copy Paste: [[2412.01245]] Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective(https://arxiv.org/abs/2412.01245)
Keywords: generative
Abstract: Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.
摘要：生成模型，尤其是扩散模型，在多模态数据密度估计方面取得了显著成功，引起了强化学习 (RL) 社区的极大兴趣，尤其是在连续动作空间中的策略建模方面。然而，现有的研究在训练方案和 RL 优化目标方面表现出显著差异，并且一些方法仅适用于扩散模型。在本研究中，我们比较和分析了各种生成策略训练和部署技术，确定并验证了生成策略算法的有效设计。具体来说，我们重新审视现有的训练目标并将它们分为两类，每类都与一种更简单的方法相关。第一种方法，生成模型策略优化 (GMPO)，采用原生优势加权回归公式作为训练目标，这比以前的方法简单得多。第二种方法，生成模型策略梯度 (GMPG)，提供了原生策略梯度方法的数值稳定实现。我们引入了一个名为 GenerativeRL 的标准化实验框架。我们的实验表明，所提出的方法在各种离线 RL 数据集上实现了最先进的性能，为训练和部署生成策略提供了统一且实用的指南。

Title: EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

Authors: Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01254
Pdf URL: https://arxiv.org/pdf/2412.01254
Copy Paste: [[2412.01254]] EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation(https://arxiv.org/abs/2412.01254)
Keywords: generation
Abstract: This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
摘要：本文旨在将细粒度的表情控制引入身份保留肖像生成。现有方法倾向于合成具有中性或刻板表情的肖像。即使补充了面部特征等控制信号，这些模型也难以按照用户指令生成准确而生动的表情。为了解决这个问题，我们引入了 EmojiDiff，这是一种端到端解决方案，可同时实现精细表情和身份的双重控制。与使用粗控制信号的传统方法不同，我们的方法直接接受 RGB 表情图像作为输入模板，以在扩散过程中提供极其准确和细粒度的表情控制。作为其核心，提出了一种创新的解耦方案，将表情模板中的表情特征与其他无关信息（例如身份、皮肤和风格）分离出来。一方面，我们引入 \textbf{I}D-不相关 \textbf{D}ata \textbf{I}teration (IDI) 来合成极高质量的跨身份表情对以进行解耦训练，这是过滤隐藏在表情中的身份信息的重要基础。另一方面，我们精心研究网络层功能并选择表情敏感层来注入参考表情特征，有效防止表情信号中的风格泄漏。为了进一步提高身份保真度，我们提出了一种名为 \textbf{I}D 增强 \textbf{C}ontrast \textbf{A}lignment (ICA) 的新型微调策略，它消除了表情控制对原始身份保持的负面影响。实验结果表明，我们的方法明显优于同类方法，在高度保持身份的情况下实现了精确的表情控制，并且很好地推广到各种扩散模型。

Title: MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Authors: Shan Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01284
Pdf URL: https://arxiv.org/pdf/2412.01284
Copy Paste: [[2412.01284]] MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model(https://arxiv.org/abs/2412.01284)
Keywords: generation
Abstract: Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.
摘要：文本到图像生成模型已成为变革性工具。然而，基于扩散的视觉语言模型仍然缺乏仅使用文本指导来精确控制生成图像中对象的形状、外观和位置放置的能力。全局图像编辑模型通常依靠额外的掩码或图像作为指导来实现全局布局控制，这通常需要模型训练。虽然局部对象编辑模型能够修改对象形状，但它们无法控制这些对象的位置放置。为了解决这些限制，我们提出了 MFTF 模型，该模型无需额外的掩码或图像即可精确控制对象定位。MFTF 模型支持单对象和多对象位置控制（例如平移、旋转等），并允许并发布局控制和对象语义编辑。这是通过并行去噪控制扩散模型的去噪过程来实现的。注意掩码是从源扩散模型的交叉注意层动态生成的，并应用于来自自注意层的查询以隔离对象。然后根据布局控制参数修改这些查询并将其注入回目标扩散模型的自注意层，以实现精确的位置控制。

Title: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Authors: Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01316
Pdf URL: https://arxiv.org/pdf/2412.01316
Copy Paste: [[2412.01316]] Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation(https://arxiv.org/abs/2412.01316)
Keywords: generation
Abstract: We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: this https URL.
摘要：我们推出了 Presto，这是一种新颖的视频扩散模型，旨在生成具有长距离连贯性和丰富内容的 15 秒视频。扩展视频生成方法以在长时间内保持场景多样性是一项重大挑战。为了解决这个问题，我们提出了一种分段交叉注意 (SCA) 策略，该策略将隐藏状态沿时间维度分成多个段，允许每个段交叉关注相应的子字幕。SCA 不需要额外的参数，可以无缝融入当前基于 DiT 的架构中。为了促进高质量的长视频生成，我们构建了 LongTake-HD 数据集，该数据集包含 261k 个内容丰富、具有场景连贯性的视频，并带有整体视频字幕和五个渐进式子字幕。实验表明，我们的 Presto 在 VBench 语义分数上达到了 78.5%，在动态度上达到了 100%，优于现有的最先进的视频生成方法。这表明我们提出的 Presto 显著提高了内容丰富度、保持了长距离连贯性并捕捉了复杂的文本细节。更多详细信息显示在我们的项目页面上：此 https URL。

Title: Negative Token Merging: Image-based Adversarial Feature Guidance

Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01339
Pdf URL: https://arxiv.org/pdf/2412.01339
Copy Paste: [[2412.01339]] Negative Token Merging: Image-based Adversarial Feature Guidance(https://arxiv.org/abs/2412.01339)
Keywords: generation
Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at this https URL
摘要：使用负面提示的基于文本的对抗性指导已成为一种广泛采用的方法，用于将输出特征推离不需要的概念。虽然有用，但仅使用文本执行对抗性指导不足以捕捉复杂的视觉概念并避免不需要的视觉元素，如受版权保护的字符。在本文中，我们首次探索了这方面的替代模式，即直接使用参考图像或批次中其他图像的视觉特征执行对抗性指导。具体来说，我们引入了负面标记合并 (NegToMe)，这是一种简单但有效的无需训练的方法，它通过在反向扩散过程中选择性地推开匹配的语义特征（在参考和输出生成之间）来执行对抗性指导。当与同一批次中的其他图像一起使用时，我们观察到 NegToMe 显着增加了输出多样性（种族、性别、视觉），而不会牺牲输出图像质量。同样，当与参考版权资产一起使用时，NegToMe 有助于将与版权内容的视觉相似性降低 34.57%。 NegToMe 只需几行代码即可轻松实现，推理时间仅略高（<4%），并可推广到不同的扩散架构（如 Flux），这些架构本身不支持使用单独的负提示。代码可从此 https URL 获取

Title: MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Authors: Xiaomin Li, Xu Jia, Qinghe Wang, Haiwen Diao, Mengmeng Ge, Pengxiang Li, You He, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01343
Pdf URL: https://arxiv.org/pdf/2412.01343
Copy Paste: [[2412.01343]] MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models(https://arxiv.org/abs/2412.01343)
Keywords: generation
Abstract: Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.
摘要：现有的预训练文本转视频 (T2V) 模型在生成具有基本动作或相机运动的逼真视频方面表现出令人印象深刻的能力。然而，这些模型在生成复杂的以人为中心的动作时表现出明显的局限性。当前的努力主要集中在对包含特定动作的一小组视频进行微调。它们通常无法在有限的参考视频中有效地将动作与外观分离，从而削弱了运动模式的建模能力。为此，我们提出了 MoTrans，这是一种定制的运动传输方法，可以在新的环境中生成类似的运动视频。具体来说，我们引入了一个基于多模态大语言模型 (MLLM) 的重新捕捉器来扩展初始提示以更多地关注外观，并引入了一个外观注入模块，以将视频帧中的外观先验适应到运动建模过程中。这些来自重新捕捉的提示和视频帧的互补多模态表示促进了外观的建模并促进了外观和运动的分离。此外，我们设计了一种特定于运动的嵌入，以进一步增强特定运动的建模。实验结果表明，我们的方法可以有效地从单个或多个参考视频中学习特定的运动模式，在定制视频生成方面比现有方法表现更好。

Title: An overview of diffusion models for generative artificial intelligence

Authors: Davide Gallon, Arnulf Jentzen, Philippe von Wurstemberger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01371
Pdf URL: https://arxiv.org/pdf/2412.01371
Copy Paste: [[2412.01371]] An overview of diffusion models for generative artificial intelligence(https://arxiv.org/abs/2412.01371)
Keywords: generation, generative
Abstract: This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.
摘要：本文从数学上严谨地介绍了用于生成人工智能的去噪扩散概率模型 (DDPM)，有时也称为扩散概率模型或扩散模型。我们为 DDPM 提供了详细的基本数学框架，并解释了训练和生成程序背后的主要思想。在这篇概述文章中，我们还回顾了文献中基本框架的选定扩展和改进，例如改进的 DDPM、去噪扩散隐式模型、无分类器扩散引导模型和潜在扩散模型。

Title: Hierarchical VAE with a Diffusion-based VampPrior

Authors: Anna Kuzina, Jakub M. Tomczak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01373
Pdf URL: https://arxiv.org/pdf/2412.01373
Copy Paste: [[2412.01373]] Hierarchical VAE with a Diffusion-based VampPrior(https://arxiv.org/abs/2412.01373)
Keywords: generative
Abstract: Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.
摘要：深度分层变分自动编码器 (VAE) 是强大的隐变量生成模型。在本文中，我们引入了基于扩散的后验先验变分混合 (VampPrior) 分层 VAE。我们应用摊销将 VampPrior 扩展到具有许多随机层的模型。与原始 VampPrior 工作和其他深度分层 VAE 相比，所提出的方法使我们能够实现更好的性能，同时使用更少的参数。我们在标准基准数据集 (MNIST、OMNIGLOT、CIFAR10) 上对我们的方法进行了实证验证，并证明了训练稳定性和潜在空间利用率的提高。

Title: Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Authors: Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2412.01380
Pdf URL: https://arxiv.org/pdf/2412.01380
Copy Paste: [[2412.01380]] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking(https://arxiv.org/abs/2412.01380)
Keywords: generation
Abstract: While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which result in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46% reduction in memory and 40% increase in throughput with $<$ 0.1 loss in perplexity.
摘要：虽然移动设备提供了越来越多的计算能力，但 DRAM 带宽的改进速度却慢得多。这对于受内存限制严重的大型语言模型 (LLM) 令牌生成来说是不幸的。先前的研究提出利用 ReLU 激活 LLM 中的自然动态激活稀疏性来减少每个令牌的有效 DRAM 带宽。然而，最近的 LLM 使用 SwiGLU 而不是 ReLU，这导致固有稀疏性很小。虽然可以根据幅度修剪 SwiGLU 激活，但由此产生的稀疏模式很难预测，导致以前的方法无效。为了解决这个问题，我们的工作引入了动态输入修剪 (DIP)：一种无预测器的动态稀疏化方法，它以最少的微调保持准确性。DIP 可以进一步使用轻量级 LoRA 适配器来恢复稀疏化过程中丢失的一些性能。最后，我们描述了一种新颖的缓存感知掩码策略，该策略考虑缓存状态和激活幅度以进一步提高缓存命中率，从而提高移动设备上的 LLM 令牌率。在模拟硬件设置中，DIP 在准确度、内存和吞吐量权衡方面优于其他方法。在 Phi-3-Medium 上，DIP 实现了内存减少 46%、吞吐量增加 40% 且困惑度损失小于 0.1。

Title: Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data

Authors: Ivan DeAndres-Tame, Ruben Tolosana, Pietro Melzi, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Luis F. Gomez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zhizhou Zhong, Yuge Huang, Yuxi Mi, Shouhong Ding, Shuigeng Zhou, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Zhihong Xiao, Evgeny Smirnov, Anton Pimenov, Aleksei Grigorev, Denis Timoshenko, Kaleb Mesfin Asfaw, Cheng Yaw Low, Hao Liu, Chuyi Wang, Qing Zuo, Zhixiang He, Hatef Otroshi Shahreza, Anjith George, Alexander Unnervik, Parsa Rahimi, Sébastien Marcel, Pedro C. Neto, Marco Huber, Jan Niklas Kolf, Naser Damer, Fadi Boutros, Jaime S. Cardoso, Ana F. Sequeira, Andrea Atzori, Gianni Fenu, Mirko Marras, Vitomir Štruc, Jiang Yu, Zhangjie Li, Jichun Li, Weisong Zhao, Zhen Lei, Xiangyu Zhu, Xiao-Yu Zhang, Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01383
Pdf URL: https://arxiv.org/pdf/2412.01383
Copy Paste: [[2412.01383]] Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data(https://arxiv.org/abs/2412.01383)
Keywords: generative
Abstract: Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.
摘要：合成数据在人脸识别技术中越来越受欢迎，这主要是由于获取真实数据所涉及的隐私问题和挑战，包括多样化场景、质量和人口群体等。与真实数据相比，合成数据还具有一些优势，例如可以生成大量数据或能够对其进行自定义以适应特定的问题解决需求。为了有效地使用这些数据，人脸识别模型还应专门设计为充分利用合成数据。为了促进新型生成式人工智能方法和合成数据的提出，并研究合成数据在更好地训练人脸识别系统方面的应用，我们推出了第二届 FRCSyn-onGoing 挑战赛，该挑战赛基于最初在 CVPR 2024 上启动的第二届合成数据时代人脸识别挑战赛 (FRCSyn)。这是一项持续进行的挑战赛，为研究人员提供了一个可访问的平台来对 i) 新型生成式人工智能方法和合成数据的提出，以及 ii) 专门提出的利用合成数据的新型人脸识别系统进行基准测试。我们专注于探索单独使用合成数据以及与真实数据结合使用合成数据来解决当前人脸识别面临的挑战，例如人口统计学偏差、领域适应性以及苛刻情况下的性能限制，例如训练和测试之间的年龄差异、姿势变化或遮挡。第二版获得了非常有趣的发现，包括与第一版的直接比较，其中合成数据库仅限于 DCFace 和 GANDiffFace。

Title: HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Authors: Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01407
Pdf URL: https://arxiv.org/pdf/2412.01407
Copy Paste: [[2412.01407]] HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving(https://arxiv.org/abs/2412.01407)
Keywords: generation, generative
Abstract: Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.
摘要：生成模型显著提高了自动驾驶摄像头图像或激光雷达点云的生成和预测质量。然而，现实世界的自动驾驶系统使用多种输入模态，通常是摄像头和激光雷达，它们包含用于生成的互补信息，而现有的生成方法忽略了这一关键特征，导致生成的结果仅涵盖单独的二维或三维信息。为了填补自动驾驶 2D-3D 多模态联合生成的空白，在本文中，我们提出了我们的框架 \emph{HoloDrive}，以联合生成摄像头图像和激光雷达点云。我们在异构生成模型之间采用 BEV-to-Camera 和 Camera-to-BEV 转换模块，并在二维生成模型中引入深度预测分支以消除从图像空间到 BEV 空间的非投影歧义，然后通过添加时间结构和精心设计的渐进式训练来扩展该方法以预测未来。此外，我们对单帧生成和世界模型基准进行了实验，并证明我们的方法在生成指标方面比 SOTA 方法具有显著的性能提升。

Title: FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration

Authors: Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01427
Pdf URL: https://arxiv.org/pdf/2412.01427
Copy Paste: [[2412.01427]] FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration(https://arxiv.org/abs/2412.01427)
Keywords: restoration
Abstract: Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
摘要：尽管一体化模型在通用图像恢复方面取得了重大进展，但现有方法在实际场景中存在泛化瓶颈，因为它们大多是在小规模合成数据集上进行训练的，退化程度有限。因此，迫切需要大规模高质量的真实世界训练数据来促进图像恢复基础模型的出现。为了推动这一领域的发展，我们不遗余力地贡献了一个百万级的数据集，它比现有的训练数据有两个显着的优势：更大规模的真实世界样本和更多样化的退化类型。通过调整内部相机设置和外部成像条件，我们可以使用我们精心设计的数据采集系统在多轮中捕获对齐的图像对和我们的数据对齐标准。此外，我们提出了一个强大的模型 FoundIR，以更好地解决现实场景中更广泛的恢复任务，向基础模型迈出了进一步的一步。具体来说，我们首先利用基于扩散的通用模型通过从不同的输入中学习与退化无关的共同表示来消除退化，其中采用增量学习策略来更好地指导模型训练。为了提高模型在复杂场景中的恢复能力，我们引入了退化感知专家模型，以实现最终的高质量结果。大量实验证明了我们数据集的价值和我们方法的有效性。

Title: CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01429
Pdf URL: https://arxiv.org/pdf/2412.01429
Copy Paste: [[2412.01429]] CPA: Camera-pose-awareness Diffusion Transformer for Video Generation(https://arxiv.org/abs/2412.01429)
Keywords: generation
Abstract: Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
摘要：尽管基于扩散变换器 (DiT) 的方法在视频生成方面取得了重大进展，但在可控相机姿势视角方面仍然存在明显差距。现有的作品（例如 OpenSora）并未精确遵循预期的轨迹和物理交互，从而限制了下游应用的灵活性。为了缓解这个问题，我们引入了 CPA，这是一种统一的相机姿势感知文本到视频生成方法，它详细说明了相机运动并整合了文本、视觉和空间条件。具体来说，我们部署了稀疏运动编码 (SME) 模块将相机姿势信息转换为时空嵌入，并激活时间注意注入 (TAI) 模块将运动补丁注入每个 ST-DiT 块。我们的插件架构可容纳原始 DiT 参数，促进各种类型的相机姿势和灵活的物体运动。大量的定性和定量实验表明，我们的方法在长视频生成方面优于基于 LDM 的方法，同时在轨迹一致性和物体一致性方面实现了最佳性能。

Title: DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model

Authors: Zhixiang Wang, Guangnan Ye, Xiaosen Wang, Siheng Chen, Zhibo Wang, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01440
Pdf URL: https://arxiv.org/pdf/2412.01440
Copy Paste: [[2412.01440]] DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model(https://arxiv.org/abs/2412.01440)
Keywords: generation, generative
Abstract: Physical adversarial patches printed on clothing can easily allow individuals to evade person detectors. However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing. Although existing methods using generative adversarial networks or diffusion models can produce more natural-looking patches, they often struggle to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these challenges, we propose a novel diffusion-based customizable patch generation framework termed DiffPatch, specifically tailored for creating naturalistic and customizable adversarial patches. Our approach enables users to utilize a reference image as the source, rather than starting from random noise, and incorporates masks to craft naturalistic patches of various shapes, not limited to squares. To prevent the original semantics from being lost during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Notably, while maintaining a natural appearance, our method achieves a comparable attack performance to state-of-the-art non-naturalistic patches when using similarly sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes over a thousand images across diverse scenarios, validating the effectiveness of our attack in real-world environments. Moreover, it provides a valuable resource for future research.
摘要：印在衣服上的物理对抗补丁可以轻松让人躲避人体探测器。然而，大多数现有的对抗补丁生成方法都优先考虑攻击有效性而不是隐身性，导致补丁在外观上不美观。虽然使用生成对抗网络或扩散模型的现有方法可以生成更自然的补丁，但它们往往难以平衡隐身性和攻击有效性，并且缺乏用户自定义的灵活性。为了应对这些挑战，我们提出了一种基于扩散的新型可定制补丁生成框架，称为 DiffPatch，专门用于创建自然和可定制的对抗补丁。我们的方法使用户能够利用参考图像作为源，而不是从随机噪声开始，并结合掩码来制作各种形状的自然补丁，而不仅限于正方形。为了防止在扩散过程中丢失原始语义，我们使用空文本反转将随机噪声样本映射到单个输入图像，并通过不完全扩散优化 (IDO) 生成补丁。值得注意的是，在保持自然外观的同时，我们的方法在使用类似大小的攻击时实现了与最先进的非自然补丁相当的攻击性能。使用 DiffPatch，我们创建了一个物理对抗 T 恤数据集 AdvPatch-1K，专门针对 YOLOv5。该数据集包含一千多张不同场景的图像，验证了我们在现实环境中攻击的有效性。此外，它为未来的研究提供了宝贵的资源。

Title: Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond

Authors: MD Raqib Khan, Anshul Negi, Ashutosh Kulkarni, Shruti S. Phutke, Santosh Kumar Vipparthi, Subrahmanyam Murala
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2412.01456
Pdf URL: https://arxiv.org/pdf/2412.01456
Copy Paste: [[2412.01456]] Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond(https://arxiv.org/abs/2412.01456)
Keywords: restoration
Abstract: Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.
摘要：由于水的折射和吸收作用，水下图像的质量会下降，从而导致偏色、模糊和能见度受限等问题。这种质量下降会对用于海洋应用的自主水下航行器的性能产生负面影响。为了应对这些挑战，我们提出了一种轻量级的基于相位的变压器网络，该网络具有 1.77M 个参数，用于水下图像恢复 (UIR)。我们的方法侧重于使用基于相位的自注意力机制有效地提取未受污染的特征。我们还引入了一个优化的相位注意力模块，通过从输入中传播突出的注意力特征来恢复结构信息。我们在合成 (UIEB、UFO-120) 和真实世界 (UIEB、U45、UCCS、SQUID) 水下图像数据集上评估了我们的方法。此外，我们使用 LOL 数据集证明了其对低光图像增强的有效性。通过广泛的消融研究和比较分析，很明显，所提出的方法优于现有的最先进 (SOTA) 方法。

Title: SerialGen: Personalized Image Generation by First Standardization Then Personalization

Authors: Cong Xie, Han Zou, Ruiqi Yu, Yan Zhang, Zhenpeng Zhan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01485
Pdf URL: https://arxiv.org/pdf/2412.01485
Copy Paste: [[2412.01485]] SerialGen: Personalized Image Generation by First Standardization Then Personalization(https://arxiv.org/abs/2412.01485)
Keywords: generation
Abstract: In this work, we are interested in achieving both high text controllability and overall appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's overall appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.
摘要：在这项工作中，我们感兴趣的是在个性化人物角色生成中同时实现高文本可控性和整体外观一致性。我们提出了一种名为 SerialGen 的新框架，这是一种串行生成方法，由两个阶段组成：首先是标准化参考图像的标准化阶段，然后是基于标准化参考的个性化生成阶段。此外，我们引入了两个旨在增强标准化过程的模块。我们的实验结果验证了所提出的框架能够生成个性化图像，这些图像忠实地恢复了参考图像的整体外观，同时准确响应了各种文本提示。通过彻底的分析，我们强调了所提出的串行生成方法和标准化模型的关键贡献，证明了参考图像和输出图像之间以及从各种文本提示生成的串行输出之间的外观一致性得到了增强。本文中的“串行”一词具有双重含义：它指的是两阶段方法，也强调了我们能够生成始终具有一致外观的串行图像的能力。

Title: RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications

Authors: Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Maciej A. Mazurowski
Subjects: cs.CV, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2412.01496
Pdf URL: https://arxiv.org/pdf/2412.01496
Copy Paste: [[2412.01496]] RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications(https://arxiv.org/abs/2412.01496)
Keywords: generation, generative
Abstract: Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.
摘要：确定两组图像是属于同一领域还是不同领域是现代医学图像分析和深度学习中的一项关键任务，其中领域转移是一个常见问题，通常会导致模型性能下降。这一确定对于评估生成模型的输出质量也很重要，例如用于缓解领域转移的图像到图像转换模型。当前的指标要么依赖于某些下游任务（例如分割）的（可能有偏差的）选择，要么采用来自自然成像的独立于任务的感知指标（例如 FID），这些指标不足以捕捉医学图像中的解剖一致性和真实感。我们引入了一种针对医学图像量身定制的新感知指标：放射特征距离 (RaD)，它利用标准化、具有临床意义且可解释的图像特征。我们在各种实验中表明，RaD 优于其他域外 (OOD) 检测指标。此外，RaD 与下游任务性能以及解剖一致性和真实性有更强的关联，因此在图像到图像转换方面优于以前的感知指标（FID、KID 等），并且在评估无条件图像生成方面表现出类似的效用。RaD 还具有其他优势，例如可解释性以及在低样本量下的稳定性和计算效率。我们的结果得到了涵盖四个多领域医学图像数据集、九个下游任务、六个图像转换模型和其他因素的广泛实验的支持，凸显了 RaD 在医学图像分析方面的广泛潜力。

Title: Structured 3D Latents for Scalable and Versatile 3D Generation

Authors: Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01506
Pdf URL: https://arxiv.org/pdf/2412.01506
Copy Paste: [[2412.01506]] Structured 3D Latents for Scalable and Versatile 3D Generation(https://arxiv.org/abs/2412.01506)
Keywords: generation
Abstract: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
摘要：我们引入了一种新颖的 3D 生成方法，用于创建多功能、高质量的 3D 资产。其基石是统一的结构化 LATent (SLAT) 表示，它允许解码为不同的输出格式，例如辐射场、3D 高斯和网格。这是通过将稀疏填充的 3D 网格与从强大的视觉基础模型中提取的密集多视图视觉特征相结合来实现的，全面捕获结构（几何）和纹理（外观）信息，同时在解码过程中保持灵活性。我们使用为 SLAT 量身定制的整流变压器作为我们的 3D 生成模型，并在包含 500K 个不同对象的大型 3D 资产数据集上训练具有多达 20 亿个参数的模型。我们的模型在文本或图像条件下生成高质量的结果，大大超越了现有方法，包括最近在类似规模下的方法。我们展示了以前的模型不提供的灵活的输出格式选择和本地 3D 编辑功能。代码、模型和数据将发布。

Title: InfinityDrive: Breaking Time Limits in Driving World Models

Authors: Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weixuan Tang, Wei Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01522
Pdf URL: https://arxiv.org/pdf/2412.01522
Copy Paste: [[2412.01522]] InfinityDrive: Breaking Time Limits in Driving World Models(https://arxiv.org/abs/2412.01522)
Keywords: generation
Abstract: Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (approximately 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: this https URL
摘要：由于对安全导航至关重要的多样化、广泛和分布不均的驾驶数据的访问有限，自动驾驶系统难以应对复杂场景。世界模型为这一挑战提供了一个有希望的解决方案；然而，当前的驾驶世界模型受到时间窗口短和场景多样性有限的限制。为了弥补这一差距，我们推出了 InfinityDrive，这是第一个具有出色泛化能力的驾驶世界模型，在高保真度、一致性和多样性方面提供最先进的性能，并生成分钟级视频。InfinityDrive 引入了高效的时空协同建模模块和扩展的时间训练策略，能够生成具有一致空间和时间连贯性的高分辨率（576$\times$1024）视频。通过结合内存注入和保留机制以及自适应内存曲线损失来最大限度地减少累积误差，实现持续超过 1500 帧（约 2 分钟）的一致视频生成。在多个数据集中进行的综合实验验证了 InfinityDrive 生成复杂多样场景的能力，凸显了其作为为满足自动驾驶不断发展的需求而构建的下一代驾驶世界模型的潜力。我们的项目主页：此 https URL

Title: Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Authors: Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.01564
Pdf URL: https://arxiv.org/pdf/2412.01564
Copy Paste: [[2412.01564]] Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates(https://arxiv.org/abs/2412.01564)
Keywords: generation
Abstract: The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.
摘要：在化学信息学领域，语言模型 (LM) 使用 SMILES 和 SELFIES 等线符号来生成分子结构已经得到广泛应用。然而，扩展这些模型来生成 3D 分子结构面临着巨大的挑战。主要存在两个障碍：(1) 难以设计出确保 SE(3) 不变原子坐标的 3D 线符号；(2) 为 LM 使用连续坐标进行标记并非易事，因为 LM 本身需要离散输入。为了应对这些挑战，我们提出了 Mol-StrucTok，一种标记 3D 分子结构的新方法。我们的方法包括两个关键创新：(1) 我们通过提取球面坐标系中的局部原子坐标来设计 3D 分子的线符号。这种符号建立在现有的 2D 线符号之上，并且与它们的具体形式无关，从而确保与各种分子表示方案兼容。 (2) 我们采用矢量量化变分自动编码器 (VQ-VAE) 对这些坐标进行标记，将它们视为生成描述符。为了进一步增强表示，我们将邻域键长和键角作为理解描述符。利用这个标记框架，我们为 3D 分子生成任务训练了一个 GPT-2 样式模型。与以前的方法相比，结果显示性能强劲，生成速度明显更快，化学稳定性更具竞争力。此外，通过将我们学习到的离散表示集成到 Graphormer 模型中以对 QM9 数据集进行属性预测，Mol-StrucTok 揭示了各种分子属性的持续改进，凸显了我们方法的多功能性和稳健性。

Title: Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art

Authors: Sebastian Peitz, Sedjro Salomon Hotegni
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.01566
Pdf URL: https://arxiv.org/pdf/2412.01566
Copy Paste: [[2412.01566]] Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art(https://arxiv.org/abs/2412.01566)
Keywords: generative
Abstract: Simultaneously considering multiple objectives in machine learning has been a popular approach for several decades, with various benefits for multi-task learning, the consideration of secondary goals such as sparsity, or multicriteria hyperparameter tuning. However - as multi-objective optimization is significantly more costly than single-objective optimization - the recent focus on deep learning architectures poses considerable additional challenges due to the very large number of parameters, strong nonlinearities and stochasticity. This survey covers recent advancements in the area of multi-objective deep learning. We introduce a taxonomy of existing methods - based on the type of training algorithm as well as the decision maker's needs - before listing recent advancements, and also successful applications. All three main learning paradigms supervised learning, reinforcement learning and unsupervised learning are covered, and we also address the recently very popular area of generative modeling.
摘要：几十年来，同时考虑机器学习中的多个目标一直是一种流行的方法，它对多任务学习、考虑稀疏性等次要目标或多准则超参数调整具有各种好处。然而，由于多目标优化比单目标优化成本高得多，最近对深度学习架构的关注由于参数数量非常多、非线性强和随机性而带来了相当大的额外挑战。本综述涵盖了多目标深度学习领域的最新进展。我们根据训练算法的类型以及决策者的需求对现有方法进行了分类，然后列出了最近的进展和成功的应用。涵盖了所有三种主要学习范式：监督学习、强化学习和无监督学习，我们还讨论了最近非常流行的生成建模领域。

Title: 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Authors: Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Wuzong Kai, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01583
Pdf URL: https://arxiv.org/pdf/2412.01583
Copy Paste: [[2412.01583]] 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting(https://arxiv.org/abs/2412.01583)
Keywords: generative
Abstract: The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input this http URL proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.
摘要：传统上，3D 场景的创建既费力又费钱，需要设计师精心配置 3D 资源和环境。生成式 AI 的最新进展，包括文本转 3D 和图像转 3D 方法，大大降低了此过程的复杂性和成本。然而，当前用于编辑复杂 3D 场景的技术仍然依赖于通常交互式的多步骤、2D 到 3D 投影方法和基于扩散的技术，这些技术通常缺乏控制精度并妨碍实时性能。在这项工作中，我们提出了 3DSceneEditor，这是一个完全基于 3D 的范例，用于使用高斯溅射实时、精确编辑复杂的 3D 场景。与传统方法不同，3DSceneEditor 通过简化的 3D 管道运行，能够直接操作高斯，从而根据输入进行高效、高质量的编辑。此 http URL 提出的框架 (i) 集成了预先训练的实例分割模型用于语义标记； (ii) 采用 CLIP 的零样本基础方法将目标对象与用户提示对齐；(iii) 直接在高斯分布上应用场景修改，例如对象添加、重新定位、重新着色、替换和删除。大量实验结果表明，3DSceneEditor 相对于当前的 SOTA 3D 场景编辑方法实现了卓越的编辑精度和速度，为高效和交互式 3D 场景定制建立了新的基准。

Title: OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

Authors: Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01615
Pdf URL: https://arxiv.org/pdf/2412.01615
Copy Paste: [[2412.01615]] OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking(https://arxiv.org/abs/2412.01615)
Keywords: generative
Abstract: With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.
摘要：随着生成式人工智能的快速发展及其在图像编辑中的广泛应用，数字内容的真实性和完整性出现了新的风险。现有的通用水印方法在篡改定位精度和视觉质量之间苦苦挣扎。受制于以前框架有限的灵活性，它们的局部水印必须在所有图像上保持不变。在 AIGC 编辑下，它们的版权提取准确性也不令人满意。为了应对这些挑战，我们提出了 OmniGuard，这是一种新颖的增强型通用水印方法，它将主动嵌入与被动盲提取相结合，以实现强大的版权保护和篡改定位。OmniGuard 采用混合取证框架，可灵活选择定位水印，并引入了降级感知篡改提取网络，以便在具有挑战性的条件下进行精确定位。此外，还设计了一个轻量级的 AIGC 编辑模拟层，以增强全局和局部编辑的稳健性。大量实验表明，OmniGuard 实现了卓越的保真度、稳健性和灵活性。与最近最先进的方法 EditGuard 相比，我们的方法在容器图像的 PSNR 上优于它 4.25dB，在噪声条件下的 F1 分数优于它 20.7%，平均位准确度优于它 14.8%。

Title: Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Authors: Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01672
Pdf URL: https://arxiv.org/pdf/2412.01672
Copy Paste: [[2412.01672]] Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning(https://arxiv.org/abs/2412.01672)
Keywords: generative
Abstract: Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.
摘要：自监督学习 (SSL) 方法已成为强大的视觉表征学习器，它通过训练图像编码器来最大化同一图像不同视图特征之间的相似性。为了执行此视图不变性任务，当前的 SSL 算法依赖于手工制作的增强，例如随机裁剪和颜色抖动，以创建图像的多个视图。最近，生成扩散模型已被证明可以通过提供更广泛的数据增强来改进 SSL。然而，这些扩散模型需要在大规模图像文本数据集上进行预训练，这可能不适用于组织病理学等许多专业领域。在这项工作中，我们引入了 Gen-SIS，这是一种基于扩散的增强技术，专门针对未标记的图像数据进行训练，消除了对文本标题等外部监督源的任何依赖。我们首先在数据集上仅使用手工制作的增强来训练初始 SSL 编码器。然后，我们训练一个以该 SSL 编码器的嵌入为条件的扩散模型。经过训练，给定源图像的嵌入，该扩散模型可以合成其不同的视图。我们表明，这些“自我增强”，即基于原始 SSL 编码器嵌入的生成增强，有助于训练更强大的 SSL 编码器。此外，基于在编码器潜在空间中插值图像的能力，我们引入了新颖的借口任务，即解开插值合成图像的两个源图像。我们通过展示自然图像（通常以对象为中心）和数字组织病理学图像（通常基于上下文）中各种下游任务的性能改进来验证 Gen-SIS 的有效性。

Title: Driving Scene Synthesis on Free-form Trajectories with Generative Prior

Authors: Zeyu Yang, Zijie Pan, Yuankun Yang, Xiatian Zhu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01717
Pdf URL: https://arxiv.org/pdf/2412.01717
Copy Paste: [[2412.01717]] Driving Scene Synthesis on Free-form Trajectories with Generative Prior(https://arxiv.org/abs/2412.01717)
Keywords: generative
Abstract: Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.
摘要：沿自由形式轨迹的驾驶场景合成对于驾驶模拟至关重要，以实现端到端驾驶策略的闭环评估。虽然现有方法擅长在记录轨迹上进行新视图合成，但由于驾驶视频视图有限且驾驶环境广阔，它们在新轨迹方面面临挑战。为了应对这一挑战，我们提出了一种新颖的自由形式驾驶视图合成方法，称为 DriveX，通过利用视频生成先验来优化各种轨迹的 3D 模型。具体来说，我们设计了一个逆问题，使视频扩散模型可以用作参数化 3D 模型（例如高斯分布）的多轨迹优化的先验。为了无缝使用生成先验，我们在优化过程中迭代地进行此过程。我们得到的模型可以在记录轨迹之外产生高保真的虚拟驾驶环境，从而实现自由形式轨迹驾驶模拟。除了真实的驾驶场景，DriveX 还可以用于通过 AI 生成的视频模拟虚拟驾驶世界。

Title: LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Authors: Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01720
Pdf URL: https://arxiv.org/pdf/2412.01720
Copy Paste: [[2412.01720]] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant(https://arxiv.org/abs/2412.01720)
Keywords: generative
Abstract: With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.
摘要：随着多模态信息检索的快速发展，检索任务也变得越来越复杂。现有的方法主要依赖于针对特定任务的视觉语言模型的微调，通常是使用图像-文本对比学习训练的模型。在本文中，我们探讨了将生成式大型多模态模型 (LMM) 重新用于检索的可能性。这种方法能够将所有检索任务统一在同一公式下，更重要的是，允许在无需额外训练的情况下推断出未知的检索任务。我们的贡献可以概括为以下几个方面：(i) 我们引入了 LamRA，这是一个多功能框架，旨在为 LMM 提供复杂的检索和重新排名功能。(ii) 对于检索，我们采用两阶段训练策略，包括纯语言预训练和多模态指令调整，以逐步提高 LMM 的检索性能。 (iii) 对于重新排序，我们采用逐点和逐列表重新排序的联合训练，提供两种不同的方法来进一步提高检索性能。 (iv) 大量实验结果强调了我们的方法在处理十多个检索任务中的有效性，并在监督和零样本设置中表现出稳健的性能，包括涉及以前未见过的检索任务的场景。

Title: XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01762
Pdf URL: https://arxiv.org/pdf/2412.01762
Copy Paste: [[2412.01762]] XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation(https://arxiv.org/abs/2412.01762)
Keywords: generation, generative
Abstract: Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.
摘要：图像标记器在塑造后续生成模型的性能方面起着至关重要的作用。自 VQ-GAN 推出以来，离散图像标记器取得了显著的进步。架构、量化技术和训练方案的改进显著提高了图像重建和下游生成质量。在本文中，我们提出了 XQ-GAN，这是一个专为图像重建和生成任务而设计的图像标记框架。我们的框架在高度灵活和可定制的训练环境中集成了最先进的量化技术，包括矢量量化 (VQ)、残差量化 (RQ)、多尺度残差量化 (MSVQ)、乘积量化 (PQ)、无查找量化 (LFQ) 和二元球面量化 (BSQ)。在标准 ImageNet 256x256 基准上，我们发布的模型实现了 0.64 的 rFID，大大超过了 MAGVIT-v2 (0.9 rFID) 和 VAR (0.9 rFID)。此外，我们证明使用 XQ-GAN 作为标记器可以与 rFID 一起改善 gFID 指标。例如，在相同的 VAR 架构下，XQ-GAN+VAR 实现了 2.6 的 gFID，远远超过 VAR 的 3.3 gFID。为了支持进一步的研究，我们为社区提供了不同图像标记器的预训练权重，以便直接在其上训练后续的生成模型或针对专门任务进行微调。

Title: Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions

Authors: Chaoran Cheng, Boran Han, Danielle C. Maddix, Abdul Fatir Ansari, Andrew Stuart, Michael W. Mahoney, Yuyang Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.01786
Pdf URL: https://arxiv.org/pdf/2412.01786
Copy Paste: [[2412.01786]] Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions(https://arxiv.org/abs/2412.01786)
Keywords: generation, generative
Abstract: Generative models that satisfy hard constraints are crucial in many scientific and engineering applications where physical laws or system requirements must be strictly respected. However, many existing constrained generative models, especially those developed for computer vision, rely heavily on gradient information, often sparse or computationally expensive in fields like partial differential equations (PDEs). In this work, we introduce a novel framework for adapting pre-trained, unconstrained flow-matching models to satisfy constraints exactly in a zero-shot manner without requiring expensive gradient computations or fine-tuning. Our framework, ECI sampling, alternates between extrapolation (E), correction (C), and interpolation (I) stages during each iterative sampling step of flow matching sampling to ensure accurate integration of constraint information while preserving the validity of the generation. We demonstrate the effectiveness of our approach across various PDE systems, showing that ECI-guided generation strictly adheres to physical constraints and accurately captures complex distribution shifts induced by these constraints. Empirical results demonstrate that our framework consistently outperforms baseline approaches in various zero-shot constrained generation tasks and also achieves competitive results in the regression tasks without additional fine-tuning.
摘要：满足硬约束的生成模型在许多科学和工程应用中至关重要，因为这些应用必须严格遵守物理定律或系统要求。然而，许多现有的受约束的生成模型，尤其是为计算机视觉开发的模型，严重依赖梯度信息，而这些信息在偏微分方程 (PDE) 等领域通常是稀疏的或计算成本高昂的。在这项工作中，我们引入了一个新颖的框架，用于调整预先训练的、不受约束的流匹配模型，使其以零样本方式精确满足约束，而无需昂贵的梯度计算或微调。我们的框架 ECI 采样在流匹配采样的每个迭代采样步骤中在外推 (E)、校正 (C) 和插值 (I) 阶段之间交替，以确保准确集成约束信息，同时保持生成的有效性。我们证明了我们的方法在各种 PDE 系统中的有效性，表明 ECI 引导的生成严格遵守物理约束并准确捕捉由这些约束引起的复杂分布变化。实证结果表明，我们的框架在各种零样本约束生成任务中始终优于基线方法，并且无需额外的微调即可在回归任务中取得有竞争力的结果。

Title: Pretrained Reversible Generation as Unsupervised Visual Representation Learning

Authors: Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.01787
Pdf URL: https://arxiv.org/pdf/2412.01787
Copy Paste: [[2412.01787]] Pretrained Reversible Generation as Unsupervised Visual Representation Learning(https://arxiv.org/abs/2412.01787)
Keywords: generation, generative
Abstract: Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous flow model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model-based methods, including 78\% top-1 accuracy on ImageNet. Extensive ablation studies further validate the effectiveness of our approach.
摘要：最近基于分数匹配和流匹配的生成模型显著推进了生成任务，但它们在判别任务中的潜力仍未得到充分开发。由于设计复杂，以前的方法（例如生成分类器）尚未充分利用这些模型在判别任务中的能力。我们提出了预训练可逆生成 (PRG)，它通过逆转预训练连续流模型的生成过程来提取无监督表示。PRG 有效地重用了无监督生成模型，利用其高容量作为下游任务的稳健且可泛化的特征提取器。我们的方法在多个基准测试中始终优于之前的方法，在基于生成模型的方法中实现了最先进的性能，包括 ImageNet 上 78% 的 top-1 准确率。广泛的消融研究进一步验证了我们方法的有效性。

Title: IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

Authors: Khaled Abud, Sergey Lavrushkin, Alexey Kirillov, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01794
Pdf URL: https://arxiv.org/pdf/2412.01794
Copy Paste: [[2412.01794]] IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models(https://arxiv.org/abs/2412.01794)
Keywords: generation, generative, quality assessment
Abstract: Diffusion-based models have recently transformed conditional image generation, achieving unprecedented fidelity in generating photorealistic and semantically accurate images. However, consistently generating high-quality images remains challenging, partly due to the lack of mechanisms for conditioning outputs on perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. First, we experiment with gradient-based guidance to optimize image quality directly and show this approach has limited generalizability. To address this, we introduce IQA-Adapter, a novel architecture that conditions generation on target quality levels by learning the relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter shifts the distribution of generated images towards a higher-quality subdomain. This approach achieves up to a 10% improvement across multiple objective metrics, as confirmed by a subjective study, while preserving generative diversity and content. Additionally, IQA-Adapter can be used inversely as a degradation model, generating progressively more distorted images when conditioned on lower quality scores. Our quality-aware methods also provide insights into the adversarial robustness of IQA models, underscoring the potential of quality conditioning in generative modeling and the importance of robust IQA methods.
摘要：基于扩散的模型最近改变了条件图像生成，在生成照片般逼真且语义准确的图像方面实现了前所未有的保真度。然而，持续生成高质量图像仍然具有挑战性，部分原因是缺乏根据感知质量调节输出的机制。在这项工作中，我们提出了将图像质量评估 (IQA) 模型集成到基于扩散的生成器中的方法，从而实现质量感知图像生成。首先，我们尝试使用基于梯度的指导来直接优化图像质量，并表明这种方法的通用性有限。为了解决这个问题，我们引入了 IQA-Adapter，这是一种新颖的架构，它通过学习图像和质量分数之间的关系，根据目标质量水平来调节生成。当以高目标质量为条件时，IQA-Adapter 会将生成的图像的分布转移到更高质量的子域。这种方法在多个客观指标上实现了高达 10% 的改进，这已得到主观研究的证实，同时保留了生成的多样性和内容。此外，IQA-Adapter 可以反向用作退化模型，当以较低的质量分数为条件时，会逐渐生成更加扭曲的图像。我们的质量感知方法还提供了对 IQA 模型对抗鲁棒性的见解，强调了质量条件在生成建模中的潜力以及稳健的 IQA 方法的重要性。

Title: SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Authors: Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01801
Pdf URL: https://arxiv.org/pdf/2412.01801
Copy Paste: [[2412.01801]] SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation(https://arxiv.org/abs/2412.01801)
Keywords: generation
Abstract: We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.
摘要：我们提出了 SceneFactor，这是一种基于扩散的大规模 3D 场景生成方法，可实现可控生成和轻松编辑。SceneFactor 通过我们的因子扩散公式实现文本引导的 3D 场景合成，利用潜在语义和几何流形生成任意大小的 3D 场景。虽然文本输入可以轻松实现可控生成，但文本引导对于直观、局部编辑和操作生成的 3D 场景仍然不够精确。我们的因子语义扩散生成由语义 3D 框组成的代理语义空间，通过添加、删除、更改引导高保真、一致的 3D 几何编辑的语义 3D 代理框的大小，可以对生成的场景进行可控编辑。大量实验表明，我们的方法通过因子扩散方法实现了高保真 3D 场景合成和有效的可控编辑。

Title: Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Authors: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01819
Pdf URL: https://arxiv.org/pdf/2412.01819
Copy Paste: [[2412.01819]] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis(https://arxiv.org/abs/2412.01819)
Keywords: generation
Abstract: This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation this http URL, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.
摘要：这项工作提出了 Switti，一种用于文本到图像生成的尺度变换器。从现有的下一个尺度预测 AR 模型开始，我们首先探索它们的 T2I 生成，并提出架构修改以提高它们的收敛性和整体性能。然后，我们观察到我们预训练的尺度 AR 模型的自注意力图对先前尺度的依赖性较弱。基于这一见解，我们提出了一种非 AR 对应物，以促进 ${\sim}11\%$ 更快的采样和更低的内存使用量，同时实现略微更好的生成此 http URL，我们发现高分辨率尺度下的无分类器指导通常是不必要的，甚至会降低性能。% 不仅不必要，而且可能有害。通过禁用这些尺度上的指导，我们实现了额外的 ${\sim}20\%$ 采样加速并改进了细粒度细节的生成。大量的人类偏好研究和自动评估表明，Switti 的表现优于现有的 T2I AR 模型，并可与最先进的 T2I 扩散模型相媲美，同时速度最高可达 $7{\times}$。

Title: Towards Universal Soccer Video Understanding

Authors: Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01820
Pdf URL: https://arxiv.org/pdf/2412.01820
Copy Paste: [[2412.01820]] Towards Universal Soccer Video Understanding(https://arxiv.org/abs/2412.01820)
Keywords: generation
Abstract: As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
摘要：足球是全球著名的运动，吸引了世界各地球迷的广泛关注。本文旨在开发一个全面的多模态足球视频理解框架。具体来说，我们在本文中做出了以下贡献：（i）我们引入了迄今为止最大的多模态足球数据集 SoccerReplay-1988，其中包含 1,988 场完整比赛的视频和详细注释，并配有自动注释流程；（ii）我们提出了足球领域的第一个视觉语言基础模型 MatchVision，该模型利用足球视频中的时空信息，在各种下游任务中表现出色；（iii）我们对动作分类、评论生成和多视角犯规识别进行了广泛的实验和消融研究，并在所有这些方面都展示了最先进的性能，大大优于现有模型，这证明了我们提出的数据和模型的优越性。我们相信这项工作将为体育理解研究提供一个标准范例。代码和模型将公开供复制。

Title: World-consistent Video Diffusion with Explicit 3D Modeling

Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.01821
Pdf URL: https://arxiv.org/pdf/2412.01821
Copy Paste: [[2412.01821]] World-consistent Video Diffusion with Explicit 3D Modeling(https://arxiv.org/abs/2412.01821)
Keywords: generation
Abstract: Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.
摘要：扩散模型的最新进展为图像和视频生成树立了新的标杆，使单帧和多帧环境中的逼真视觉合成成为可能。然而，这些模型仍然难以有效、明确地生成 3D 一致的内容。为了解决这个问题，我们提出了世界一致性视频扩散 (WVD)，这是一个新颖的框架，它结合了使用 XYZ 图像的显式 3D 监督，为每个图像像素编码全局 3D 坐标。更具体地说，我们训练扩散变换器来学习 RGB 和 XYZ 帧的联合分布。这种方法通过灵活的修复策略支持多任务适应性。例如，WVD 可以从地面实况 RGB 估计 XYZ 帧，或者使用沿指定相机轨迹的 XYZ 投影生成新的 RGB 帧。通过这样做，WVD 统一了单图像到 3D 生成、多视图立体和相机控制的视频生成等任务。我们的方法在多个基准测试中表现出了具有竞争力的性能，并通过单个预训练模型为 3D 一致的视频和图像生成提供了可扩展的解决方案。

Title: X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Authors: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2412.01824
Pdf URL: https://arxiv.org/pdf/2412.01824
Copy Paste: [[2412.01824]] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models(https://arxiv.org/abs/2412.01824)
Keywords: generation
Abstract: In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
摘要：上下文生成是大型语言模型 (LLM) 开放任务泛化能力的关键组成部分。通过利用一些示例作为上下文，LLM 可以执行域内和域外任务。基于 LLM 构建的自回归视觉语言模型 (VLM) 的最新进展在文本到图像生成方面展示了令人印象深刻的性能。然而，上下文学习在一般图像生成任务中的潜力仍未得到充分开发。为了解决这个问题，我们推出了 X-Prompt，这是一种纯自回归大型视觉语言模型，旨在在统一的上下文学习框架内，在广泛的可见和不可见图像生成任务中提供具有竞争力的性能。X-Prompt 采用专门的设计，可以有效地压缩上下文示例中的有价值特征，支持更长的上下文标记序列，并提高其泛化到不可见任务的能力。文本和图像预测的统一训练任务使 X-Prompt 能够处理一般图像生成，并增强上下文示例中的任务感知能力。大量实验验证了该模型在各种可见图像生成任务中的性能及其推广到以前未见的任务的能力。

Title: RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Authors: Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.01827
Pdf URL: https://arxiv.org/pdf/2412.01827
Copy Paste: [[2412.01827]] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders(https://arxiv.org/abs/2412.01827)
Keywords: generation
Abstract: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at this https URL.
摘要：我们推出了 RandAR，这是一种仅使用解码器的视觉自回归 (AR) 模型，能够以任意 token 顺序生成图像。与之前依赖预定义生成顺序的仅使用解码器的 AR 模型不同，RandAR 消除了这种归纳偏差，从而解锁了仅使用解码器的生成的新功能。我们的基本设计通过在每个要预测的图像 token 之前插入一个“位置指令 token”来实现随机顺序，表示下一个图像 token 的空间位置。RandAR 在随机排列的 token 序列上进行训练——这是一项比固定顺序生成更具挑战性的任务，但它实现了与传统光栅顺序相当的性能。更重要的是，从随机顺序训练的仅使用解码器的转换器获得了新功能。针对 AR 模型的效率瓶颈，RandAR 在推理时采用 KV-Cache 并行解码，在不牺牲生成质量的情况下享受 2.5 倍的加速。此外，RandAR 以零样本方式支持修复、修复和分辨率外推。我们希望 RandAR 能够为仅使用解码器的视觉生成模型带来新的方向，并拓宽它们在不同场景中的应用。我们的项目页面位于这个 https URL。