2024-12-18

Title: SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout

Authors: Chiyu Max Jiang, Yijing Bai, Andre Cornman, Christopher Davis, Xiukun Huang, Hong Jeon, Sakshum Kulshrestha, John Lambert, Shuangyu Li, Xuanyu Zhou, Carlos Fuertes, Chang Yuan, Mingxing Tan, Yin Zhou, Dragomir Anguelov
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.12129
Pdf URL: https://arxiv.org/pdf/2412.12129
Copy Paste: [[2412.12129]] SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout(https://arxiv.org/abs/2412.12129)
Keywords: generation
Abstract: Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models.
摘要：逼真且交互式的场景模拟是自动驾驶汽车 (AV) 开发的关键先决条件。在这项工作中，我们介绍了 SceneDiffuser，这是一种专为交通模拟而设计的场景级扩散先验。它提供了一个统一的框架，可解决模拟的两个关键阶段：场景初始化（涉及生成初始交通布局）和场景推出（包含代理行为的闭环模拟）。虽然扩散模型已被证明可有效学习逼真的多模态代理分布，但仍存在一些挑战，包括可控性、在闭环模拟中保持真实性以及确保推理效率。为了解决这些问题，我们引入了用于模拟的摊销扩散。这种新颖的扩散去噪范式将去噪的计算成本摊销到未来的模拟步骤中，从而显着降低了每个推出步骤的成本（推理步骤减少了 16 倍），同时还减轻了闭环错误。我们通过引入广义硬约束、简单但有效的推理时间约束机制以及通过大型语言模型 (LLM) 的少量提示实现基于语言的受限场景生成，进一步增强了可控性。我们对模型扩展的研究表明，增加计算资源可显著提高整体模拟真实性。我们在 Waymo Open Sim Agents Challenge 上展示了我们方法的有效性，在扩散模型中实现了顶级开环性能和最佳闭环性能。

Title: Climate Aware Deep Neural Networks (CADNN) for Wind Power Simulation

Authors: Ali Forootani, Danial Esmaeili Aliabadi, Daniela Thraen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12160
Pdf URL: https://arxiv.org/pdf/2412.12160
Copy Paste: [[2412.12160]] Climate Aware Deep Neural Networks (CADNN) for Wind Power Simulation(https://arxiv.org/abs/2412.12160)
Keywords: generation
Abstract: Wind power forecasting plays a critical role in modern energy systems, facilitating the integration of renewable energy sources into the power grid. Accurate prediction of wind energy output is essential for managing the inherent intermittency of wind power, optimizing energy dispatch, and ensuring grid stability. This paper proposes the use of Deep Neural Network (DNN)-based predictive models that leverage climate datasets, including wind speed, atmospheric pressure, temperature, and other meteorological variables, to improve the accuracy of wind power simulations. In particular, we focus on the Coupled Model Intercomparison Project (CMIP) datasets, which provide climate projections, as inputs for training the DNN models. These models aim to capture the complex nonlinear relationships between the CMIP-based climate data and actual wind power generation at wind farms located in Germany. Our study compares various DNN architectures, specifically Multilayer Perceptron (MLP), Long Short-Term Memory (LSTM) networks, and Transformer-enhanced LSTM models, to identify the best configuration among these architectures for climate-aware wind power simulation. The implementation of this framework involves the development of a Python package (CADNN) designed to support multiple tasks, including statistical analysis of the climate data, data visualization, preprocessing, DNN training, and performance evaluation. We demonstrate that the DNN models, when integrated with climate data, significantly enhance forecasting accuracy. This climate-aware approach offers a deeper understanding of the time-dependent climate patterns that influence wind power generation, providing more accurate predictions and making it adaptable to other geographical regions.
摘要：风电预测在现代能源系统中发挥着关键作用，有助于将可再生能源整合到电网中。准确预测风能输出对于管理风电固有的间歇性、优化能源调度和确保电网稳定性至关重要。本文提出使用基于深度神经网络 (DNN) 的预测模型，利用气候数据集（包括风速、气压、温度和其他气象变量）来提高风电模拟的准确性。特别是，我们专注于耦合模型比对项目 (CMIP) 数据集，这些数据集提供气候预测，作为训练 DNN 模型的输入。这些模型旨在捕捉基于 CMIP 的气候数据与德国风电场实际风力发电量之间的复杂非线性关系。我们的研究比较了各种 DNN 架构，特别是多层感知器 (MLP)、长短期记忆 (LSTM) 网络和 Transformer 增强型 LSTM 模型，以确定这些架构中用于气候感知风电模拟的最佳配置。该框架的实施涉及开发一个 Python 包 (CADNN)，旨在支持多项任务，包括气候数据的统计分析、数据可视化、预处理、DNN 训练和性能评估。我们证明，DNN 模型与气候数据相结合后，可显著提高预测准确性。这种气候感知方法可以更深入地了解影响风力发电的时间相关气候模式，提供更准确的预测并使其适用于其他地理区域。

Title: Towards LLM-based optimization compilers. Can LLMs learn how to apply a single peephole optimization? Reasoning is all LLMs need!

Authors: Xiangxin Fang, Lev Mukhanov
Subjects: cs.LG, cs.AI, cs.PL
Abstract URL: https://arxiv.org/abs/2412.12163
Pdf URL: https://arxiv.org/pdf/2412.12163
Copy Paste: [[2412.12163]] Towards LLM-based optimization compilers. Can LLMs learn how to apply a single peephole optimization? Reasoning is all LLMs need!(https://arxiv.org/abs/2412.12163)
Keywords: generation
Abstract: Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.
摘要：大型语言模型 (LLM) 在各种语言处理任务中都表现出巨大的潜力，最近的研究探索了它们在编译器优化中的应用。然而，所有这些研究都集中在传统的开源 LLM 上，例如 Llama2，它们缺乏增强的推理机制。在本研究中，我们研究了经过微调的 7B 参数 Llama2 模型在尝试学习和应用 AArch64 汇编代码的简单窥孔优化时产生的错误。我们对 LLM 产生的错误进行了分析，并将其与实现高级推理逻辑的最先进的 OpenAI 模型进行了比较，包括 GPT-4o 和 GPT-o1（预览版）。我们证明，尽管没有经过微调，但 OpenAI GPT-o1 的表现优于经过微调的 Llama2 和 GPT-4o。我们的研究结果表明，这一优势很大程度上归功于 GPT-o1 中实现的思路链推理。我们希望我们的工作能够激发进一步的研究，使用具有增强推理机制和思路链的 LLM 进行代码生成和优化。

Title: Multimodal Approaches to Fair Image Classification: An Ethical Perspective

Authors: Javon Hickmon
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12165
Pdf URL: https://arxiv.org/pdf/2412.12165
Copy Paste: [[2412.12165]] Multimodal Approaches to Fair Image Classification: An Ethical Perspective(https://arxiv.org/abs/2412.12165)
Keywords: generation
Abstract: In the rapidly advancing field of artificial intelligence, machine perception is becoming paramount to achieving increased performance. Image classification systems are becoming increasingly integral to various applications, ranging from medical diagnostics to image generation; however, these systems often exhibit harmful biases that can lead to unfair and discriminatory outcomes. Machine Learning systems that depend on a single data modality, i.e. only images or only text, can exaggerate hidden biases present in the training data, if the data is not carefully balanced and filtered. Even so, these models can still harm underrepresented populations when used in improper contexts, such as when government agencies reinforce racial bias using predictive policing. This thesis explores the intersection of technology and ethics in the development of fair image classification models. Specifically, I focus on improving fairness and methods of using multiple modalities to combat harmful demographic bias. Integrating multimodal approaches, which combine visual data with additional modalities such as text and metadata, allows this work to enhance the fairness and accuracy of image classification systems. The study critically examines existing biases in image datasets and classification algorithms, proposes innovative methods for mitigating these biases, and evaluates the ethical implications of deploying such systems in real-world scenarios. Through comprehensive experimentation and analysis, the thesis demonstrates how multimodal techniques can contribute to more equitable and ethical AI solutions, ultimately advocating for responsible AI practices that prioritize fairness.
摘要：在快速发展的人工智能领域，机器感知对于提高性能至关重要。图像分类系统正日益成为各种应用不可或缺的一部分，从医疗诊断到图像生成；然而，这些系统往往表现出有害的偏见，可能导致不公平和歧视性的结果。如果数据没有经过仔细的平衡和过滤，依赖于单一数据模态（即只有图像或只有文本）的机器学习系统可能会夸大训练数据中存在的隐藏偏见。即便如此，当这些模型在不适当的环境中使用时，例如当政府机构使用预测性警务强化种族偏见时，这些模型仍然会伤害代表性不足的人群。本论文探讨了公平图像分类模型开发中技术与道德的交集。具体来说，我专注于提高公平性和使用多种模态来对抗有害人口偏见的方法。集成多模态方法（将视觉数据与文本和元数据等其他模态相结合）使这项工作能够提高图像分类系统的公平性和准确性。该研究批判性地审视了图像数据集和分类算法中现有的偏见，提出了减轻这些偏见的创新方法，并评估了在现实场景中部署此类系统的伦理影响。通过全面的实验和分析，该论文展示了多模态技术如何有助于实现更公平、更合乎道德的人工智能解决方案，最终倡导以公平为优先的负责任的人工智能实践。

Title: Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning

Authors: Melanie Sclar, Jane Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.12175
Pdf URL: https://arxiv.org/pdf/2412.12175
Copy Paste: [[2412.12175]] Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning(https://arxiv.org/abs/2412.12175)
Keywords: generation
Abstract: Do large language models (LLMs) have theory of mind? A plethora of papers and benchmarks have been introduced to evaluate if current models have been able to develop this key ability of social intelligence. However, all rely on limited datasets with simple patterns that can potentially lead to problematic blind spots in evaluation and an overestimation of model capabilities. We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data for robust training and evaluation. Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios to stress test the limits of LLMs. Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data, highlighting the need for more robust theory of mind evaluation. As our generations are a conceptual superset of prior work, fine-tuning on our data yields a 27-point accuracy improvement on the classic ToMi benchmark (Le et al., 2019). ExploreToM also enables uncovering underlying skills and factors missing for models to show theory of mind, such as unreliable state tracking or data imbalances, which may contribute to models' poor performance on benchmarks.
摘要：大型语言模型 (LLM) 有心智理论吗？已经有大量论文和基准被提出来评估当前模型是否能够开发这种关键的社交智能能力。然而，所有这些都依赖于具有简单模式的有限数据集，这可能会导致评估中出现问题盲点并高估模型能力。我们推出了 ExploreToM，这是第一个允许大规模生成多样化且具有挑战性的心智理论数据以进行稳健训练和评估的框架。我们的方法利用自定义领域特定语言的 A* 搜索来生成复杂的故事结构和新颖、多样但合理的场景来对 LLM 的极限进行压力测试。我们的评估表明，最先进的 LLM（例如 Llama-3.1-70B 和 GPT-4o）在 ExploreToM 生成的数据上的准确率低至 0% 和 9%，这凸显了对更稳健的心智理论评估的需求。由于我们的代数是先前工作的概念超集，对我们的数据进行微调后，在经典的 ToMi 基准测试（Le 等人，2019 年）上准确率提高了 27 分。ExploreToM 还可以发现模型展示心智理论时缺少的潜在技能和因素，例如不可靠的状态跟踪或数据不平衡，这些都可能导致模型在基准测试中表现不佳。

Title: Adopting Explainable-AI to investigate the impact of urban morphology design on energy and environmental performance in dry-arid climates

Authors: Pegah Eshraghi, Riccardo Talami, Arman Nikkhah Dehnavi, Maedeh Mirdamadi, Zahra-Sadat Zomorodian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12183
Pdf URL: https://arxiv.org/pdf/2412.12183
Copy Paste: [[2412.12183]] Adopting Explainable-AI to investigate the impact of urban morphology design on energy and environmental performance in dry-arid climates(https://arxiv.org/abs/2412.12183)
Keywords: generation
Abstract: In rapidly urbanizing regions, designing climate-responsive urban forms is crucial for sustainable development, especially in dry arid-climates where urban morphology has a significant impact on energy consumption and environmental performance. This study advances urban morphology evaluation by combining Urban Building Energy Modeling (UBEM) with machine learning methods (ML) and Explainable AI techniques, specifically Shapley Additive Explanations (SHAP). Using Tehran's dense urban landscape as a case study, this research assesses and ranks the impact of 30 morphology parameters at the urban block level on key energy metrics (cooling, heating, and lighting demand) and environmental performance (sunlight exposure, photovoltaic generation, and Sky View Factor). Among seven ML algorithms evaluated, the XGBoost model was the most effective predictor, achieving high accuracy (R2: 0.92) and a training time of 3.64 seconds. Findings reveal that building shape, window-to-wall ratio, and commercial ratio are the most critical parameters affecting energy efficiency, while the heights and distances of neighboring buildings strongly influence cooling demand and solar access. By evaluating urban blocks with varied densities and configurations, this study offers generalizable insights applicable to other dry-arid regions. Moreover, the integration of UBEM and Explainable AI offers a scalable, data-driven framework for developing climate-responsive urban designs adaptable to high-density environments worldwide.
摘要：在快速城市化的地区，设计适应气候的城市形态对于可持续发展至关重要，尤其是在干旱气候地区，城市形态对能源消耗和环境绩效有重大影响。本研究将城市建筑能源模型 (UBEM) 与机器学习方法 (ML) 和可解释的人工智能技术（特别是 Shapley 加法解释 (SHAP)）相结合，推进了城市形态评估。以德黑兰密集的城市景观为例，本研究评估并排序了城市街区层面的 30 个形态参数对关键能源指标（制冷、供暖和照明需求）和环境绩效（日照、光伏发电和天空视野系数）的影响。在评估的七种 ML 算法中，XGBoost 模型是最有效的预测器，实现了高精度（R2：0.92）和 3.64 秒的训练时间。研究结果表明，建筑形状、窗墙比和商业比是影响能源效率的最关键参数，而相邻建筑的高度和距离则对制冷需求和太阳能利用率有重大影响。通过评估不同密度和配置的城市街区，本研究提供了适用于其他干旱地区的可推广见解。此外，UBEM 与可解释 AI 的集成提供了一个可扩展的数据驱动框架，用于开发适应全球高密度环境的气候响应型城市设计。

Title: Multi-Surrogate-Teacher Assistance for Representation Alignment in Fingerprint-based Indoor Localization

Authors: Son Minh Nguyen, Linh Duy Tran, Duc Viet Le, Paul J.M Havinga
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12189
Pdf URL: https://arxiv.org/pdf/2412.12189
Copy Paste: [[2412.12189]] Multi-Surrogate-Teacher Assistance for Representation Alignment in Fingerprint-based Indoor Localization(https://arxiv.org/abs/2412.12189)
Keywords: generative
Abstract: Despite remarkable progress in knowledge transfer across visual and textual domains, extending these achievements to indoor localization, particularly for learning transferable representations among Received Signal Strength (RSS) fingerprint datasets, remains a challenge. This is due to inherent discrepancies among these RSS datasets, largely including variations in building structure, the input number and disposition of WiFi anchors. Accordingly, specialized networks, which were deprived of the ability to discern transferable representations, readily incorporate environment-sensitive clues into the learning process, hence limiting their potential when applied to specific RSS datasets. In this work, we propose a plug-and-play (PnP) framework of knowledge transfer, facilitating the exploitation of transferable representations for specialized networks directly on target RSS datasets through two main phases. Initially, we design an Expert Training phase, which features multiple surrogate generative teachers, all serving as a global adapter that homogenizes the input disparities among independent source RSS datasets while preserving their unique characteristics. In a subsequent Expert Distilling phase, we continue introducing a triplet of underlying constraints that requires minimizing the differences in essential knowledge between the specialized network and surrogate teachers through refining its representation learning on the target dataset. This process implicitly fosters a representational alignment in such a way that is less sensitive to specific environmental dynamics. Extensive experiments conducted on three benchmark WiFi RSS fingerprint datasets underscore the effectiveness of the framework that significantly exerts the full potential of specialized networks in localization.
摘要：尽管在视觉和文本领域的知识转移方面取得了显著进展，但将这些成就扩展到室内定位，特别是学习接收信号强度 (RSS) 指纹数据集之间的可转移表示，仍然是一个挑战。这是由于这些 RSS 数据集之间存在固有差异，主要包括建筑结构、输入数量和 WiFi 锚点的配置不同。因此，专门的网络缺乏辨别可转移表示的能力，很容易将环境敏感线索纳入学习过程，从而限制了它们应用于特定 RSS 数据集时的潜力。在这项工作中，我们提出了一个即插即用 (PnP) 知识转移框架，通过两个主要阶段促进专门网络直接在目标 RSS 数据集上利用可转移表示。首先，我们设计了一个专家培训阶段，该阶段具有多个代理生成教师，它们都充当全局适配器，在保留其独特特征的同时，使独立源 RSS 数据集之间的输入差异同质化。在随后的专家提炼阶段，我们继续引入三重底层约束，要求通过在目标数据集上改进其表征学习来最小化专业网络和代理教师之间的基本知识差异。此过程隐式地促进了表征对齐，使其对特定环境动态的敏感度降低。在三个基准 WiFi RSS 指纹数据集上进行的大量实验强调了该框架的有效性，该框架充分发挥了专业网络在定位方面的潜力。

Title: Provably Secure Robust Image Steganography via Cross-Modal Error Correction

Authors: Yuang Qi, Kejiang Chen, Na Zhao, Zijin Yang, Weiming Zhang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2412.12206
Pdf URL: https://arxiv.org/pdf/2412.12206
Copy Paste: [[2412.12206]] Provably Secure Robust Image Steganography via Cross-Modal Error Correction(https://arxiv.org/abs/2412.12206)
Keywords: generation
Abstract: The rapid development of image generation models has facilitated the widespread dissemination of generated images on social networks, creating favorable conditions for provably secure image steganography. However, existing methods face issues such as low quality of generated images and lack of semantic control in the generation process. To leverage provably secure steganography with more effective and high-performance image generation models, and to ensure that stego images can accurately extract secret messages even after being uploaded to social networks and subjected to lossy processing such as JPEG compression, we propose a high-quality, provably secure, and robust image steganography method based on state-of-the-art autoregressive (AR) image generation models using Vector-Quantized (VQ) tokenizers. Additionally, we employ a cross-modal error-correction framework that generates stego text from stego images to aid in restoring lossy images, ultimately enabling the extraction of secret messages embedded within the images. Extensive experiments have demonstrated that the proposed method provides advantages in stego quality, embedding capacity, and robustness, while ensuring provable undetectability.
摘要：图像生成模型的快速发展促进了生成的图像在社交网络上的广泛传播，为可证明安全的图像隐写术创造了有利条件。然而，现有的方法面临生成图像质量低、生成过程缺乏语义控制等问题。为了将可证明安全的隐写术与更有效、更高性能的图像生成模型结合起来，并确保隐写图像在被上传到社交网络并经过 JPEG 压缩等有损处理后仍能准确提取秘密信息，我们提出了一种高质量、可证明安全且鲁棒的图像隐写术方法，该方法基于使用矢量量化 (VQ) 标记器的最先进的自回归 (AR) 图像生成模型。此外，我们采用了一个跨模态纠错框架，可以从隐写图像生成隐写文本，以帮助恢复有损图像，最终实现提取图像中嵌入的秘密信息。大量实验表明，所提出的方法在隐秘质量、嵌入容量和鲁棒性方面具有优势，同时确保了可证明的不可检测性。

Title: Can video generation replace cinematographers? Research on the cinematic language of generated video

Authors: Xiaozhe Li, Kai WU, Siyi Yang, YiZhan Qu, Guohua.Zhang, Zhiyu Chen, Jiayao Li, Jiangchuan Mu, Xiaobin Hu, Wen Fang, Mingliang Xiong, Hao Deng, Qingwen Liu, Gang Li, Bin He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12223
Pdf URL: https://arxiv.org/pdf/2412.12223
Copy Paste: [[2412.12223]] Can video generation replace cinematographers? Research on the cinematic language of generated video(https://arxiv.org/abs/2412.12223)
Keywords: generation
Abstract: Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance the visual coherence of videos generated from textual descriptions. However, most research has primarily focused on object motion, with limited attention given to cinematic language in videos, which is crucial for cinematographers to convey emotion and narrative pacing. To address this limitation, we propose a threefold approach to enhance the ability of T2V models to generate controllable cinematic language. Specifically, we introduce a cinematic language dataset that encompasses shot framing, angle, and camera movement, enabling models to learn diverse cinematic styles. Building on this, to facilitate robust cinematic alignment evaluation, we present CameraCLIP, a model fine-tuned on the proposed dataset that excels in understanding complex cinematic language in generated videos and can further provide valuable guidance in the multi-shot composition process. Finally, we propose CLIPLoRA, a cost-guided dynamic LoRA composition method that facilitates smooth transitions and realistic blending of cinematic language by dynamically fusing multiple pre-trained cinematic LoRAs within a single video. Our experiments demonstrate that CameraCLIP outperforms existing models in assessing the alignment between cinematic language and video, achieving an R@1 score of 0.81. Additionally, CLIPLoRA improves the ability for multi-shot composition, potentially bridging the gap between automatically generated videos and those shot by professional cinematographers.
摘要：文本到视频 (T2V) 生成的最新进展利用了扩散模型来增强从文本描述生成的视频的视觉连贯性。然而，大多数研究主要集中在物体运动上，对视频中的电影语言关注有限，而这对于电影摄影师传达情感和叙事节奏至关重要。为了解决这一限制，我们提出了一种三重方法来增强 T2V 模型生成可控电影语言的能力。具体来说，我们引入了一个电影语言数据集，该数据集涵盖了镜头取景、角度和摄像机运动，使模型能够学习不同的电影风格。在此基础上，为了促进稳健的电影对齐评估，我们提出了 CameraCLIP，这是一个在所提出的数据集上进行了微调的模型，它擅长理解生成的视频中复杂的电影语言，并可以在多镜头合成过程中提供有价值的指导。最后，我们提出了 CLIPLoRA，这是一种成本引导的动态 LoRA 合成方法，通过在单个视频中动态融合多个预先训练的电影 LoRA，促进电影语言的平滑过渡和逼真的融合。我们的实验表明，CameraCLIP 在评估电影语言与视频之间的一致性方面优于现有模型，R@1 得分达到 0.81。此外，CLIPLoRA 提高了多镜头构图的能力，有可能缩小自动生成的视频与专业电影摄影师拍摄的视频之间的差距。

Title: You Only Submit One Image to Find the Most Suitable Generative Model

Authors: Zhi Zhou, Lan-Zhe Guo, Peng-Xiao Song, Yu-Feng Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12232
Pdf URL: https://arxiv.org/pdf/2412.12232
Copy Paste: [[2412.12232]] You Only Submit One Image to Find the Most Suitable Generative Model(https://arxiv.org/abs/2412.12232)
Keywords: generation, generative
Abstract: Deep generative models have achieved promising results in image generation, and various generative model hubs, e.g., Hugging Face and Civitai, have been developed that enable model developers to upload models and users to download models. However, these model hubs lack advanced model management and identification mechanisms, resulting in users only searching for models through text matching, download sorting, etc., making it difficult to efficiently find the model that best meets user requirements. In this paper, we propose a novel setting called Generative Model Identification (GMI), which aims to enable the user to identify the most appropriate generative model(s) for the user's requirements from a large number of candidate models efficiently. To our best knowledge, it has not been studied yet. In this paper, we introduce a comprehensive solution consisting of three pivotal modules: a weighted Reduced Kernel Mean Embedding (RKME) framework for capturing the generated image distribution and the relationship between images and prompts, a pre-trained vision-language model aimed at addressing dimensionality challenges, and an image interrogator designed to tackle cross-modality issues. Extensive empirical results demonstrate the proposal is both efficient and effective. For example, users only need to submit a single example image to describe their requirements, and the model platform can achieve an average top-4 identification accuracy of more than 80%.
摘要：深度生成模型在图像生成方面取得了令人鼓舞的成果，并且已经开发了各种生成模型中心，例如 Hugging Face 和 Civitai，使模型开发人员能够上传模型，用户能够下载模型。然而，这些模型中心缺乏先进的模型管理和识别机制，导致用户只能通过文本匹配、下载排序等方式搜索模型，难以有效地找到最符合用户要求的模型。在本文中，我们提出了一种称为生成模型识别 (GMI) 的新设置，旨在使用户能够从大量候选模型中有效地识别出最适合用户需求的生成模型。据我们所知，它还没有被研究过。在本文中，我们介绍了一个由三个关键模块组成的综合解决方案：用于捕获生成的图像分布和图像与提示之间关系的加权简化核均值嵌入 (RKME) 框架、旨在解决维度挑战的预训练视觉语言模型以及旨在解决跨模态问题的图像询问器。大量实证结果表明该方案高效且有效，例如用户只需要提交一张样例图像描述需求，模型平台就能实现平均 80% 以上的 top-4 识别准确率。

Title: Deep Learning for Hydroelectric Optimization: Generating Long-Term River Discharge Scenarios with Ensemble Forecasts from Global Circulation Models

Authors: Julio Alberto Silva Dias
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12234
Pdf URL: https://arxiv.org/pdf/2412.12234
Copy Paste: [[2412.12234]] Deep Learning for Hydroelectric Optimization: Generating Long-Term River Discharge Scenarios with Ensemble Forecasts from Global Circulation Models(https://arxiv.org/abs/2412.12234)
Keywords: generation
Abstract: Hydroelectric power generation is a critical component of the global energy matrix, particularly in countries like Brazil, where it represents the majority of the energy supply. However, its strong dependence on river discharges, which are inherently uncertain due to climate variability, poses significant challenges. River discharges are linked to precipitation patterns, making the development of accurate probabilistic forecasting models crucial for improving operational planning in systems heavily reliant on this resource. Traditionally, statistical models have been used to represent river discharges in energy optimization. Yet, these models are increasingly unable to produce realistic scenarios due to structural shifts in climate behavior. Changes in precipitation patterns have altered discharge dynamics, which traditional approaches struggle to capture. Machine learning methods, while effective as universal predictors for time series, often focus solely on historical data, ignoring key external factors such as meteorological and climatic conditions. Furthermore, these methods typically lack a probabilistic framework, which is vital for representing the inherent variability of hydrological processes. The limited availability of historical discharge data further complicates the application of large-scale deep learning models to this domain. To address these challenges, we propose a framework based on a modified recurrent neural network architecture. This model generates parameterized probability distributions conditioned on projections from global circulation models, effectively accounting for the stochastic nature of river discharges. Additionally, the architecture incorporates enhancements to improve its generalization capabilities. We validate this framework within the Brazilian Interconnected System, using projections from the SEAS5-ECMWF system as conditional variables.
摘要：水力发电是全球能源矩阵的重要组成部分，尤其是在巴西等国家，水力发电占能源供应的绝大部分。然而，水力发电对河流流量的严重依赖带来了重大挑战，而河流流量由于气候变化而具有内在的不确定性。河流流量与降水模式有关，因此开发准确的概率预测模型对于改善严重依赖这种资源的系统的运营规划至关重要。传统上，统计模型已用于表示能源优化中的河流流量。然而，由于气候行为的结构性变化，这些模型越来越无法产生现实的情景。降水模式的变化改变了流量动态，而传统方法很难捕捉到这一点。机器学习方法虽然可以作为时间序列的通用预测器，但通常只关注历史数据，而忽略了气象和气候条件等关键的外部因素。此外，这些方法通常缺乏概率框架，而概率框架对于表示水文过程的固有变化至关重要。历史流量数据的有限可用性进一步使大规模深度学习模型在此领域的应用变得复杂。为了应对这些挑战，我们提出了一个基于改进的循环神经网络架构的框架。该模型根据全球环流模型的预测生成参数化概率分布，有效地解释了河流流量的随机性。此外，该架构还采用了增强功能以提高其泛化能力。我们在巴西互联系统中验证了该框架，使用 SEAS5-ECMWF 系统的预测作为条件变量。

Title: OmniPrism: Learning Disentangled Visual Concept for Image Generation

Authors: Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12242
Pdf URL: https://arxiv.org/pdf/2412.12242
Copy Paste: [[2412.12242]] OmniPrism: Learning Disentangled Visual Concept for Image Generation(https://arxiv.org/abs/2412.12242)
Keywords: generation
Abstract: Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts.
摘要：创意视觉概念生成通常会从参考图像中的特定概念中汲取灵感，以产生相关结果。然而，现有方法通常局限于单方面概念生成，或者在多方面概念场景中很容易被不相关的概念干扰，从而导致概念混淆并阻碍创意生成。为了解决这个问题，我们提出了 OmniPrism，这是一种用于创意图像生成的视觉概念解缠方法。我们的方法在自然语言的指导下学习解缠的概念表征，并训练一个扩散模型来整合这些概念。我们利用多模态提取器的丰富语义空间从给定的图像和概念指导中实现概念解缠。为了解缠具有不同语义的概念，我们构建了一个成对的概念解缠数据集 (PCD-200K)，其中每对共享相同的概念，例如内容、风格和构图。我们通过对比正交解缠 (COD) 训练管道学习解缠的概念表征，然后将其注入额外的扩散交叉注意层进行生成。设计了一组块嵌入来适应扩散模型中每个块的概念域。大量实验表明，我们的方法可以生成高质量、概念分离的结果，并且对文本提示和所需概念具有高保真度。

Title: Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Authors: Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12278
Pdf URL: https://arxiv.org/pdf/2412.12278
Copy Paste: [[2412.12278]] Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content(https://arxiv.org/abs/2412.12278)
Keywords: generative
Abstract: Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.
摘要：现有的 DeepFake 检测技术主要侧重于面部操作，例如换脸或口型同步。然而，文本到视频 (T2V) 和图像到视频 (I2V) 生成模型的进步现在允许完全由 AI 生成的合成内容和无缝背景更改，这对以面部为中心的检测方法提出了挑战，并要求采用更通用的方法。为了解决这个问题，我们引入了用于 \underline{U}universal \underline{N}etwork for \underline{I}identifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) 模型，与传统检测器不同，该模型可以捕获全帧操作。 \texttt{UNITE} 将检测功能扩展到没有面部、非人类主体和复杂背景修改的场景。它利用基于转换器的架构，通过 SigLIP-So400M 基础模型处理从视频中提取的与领域无关的特征。鉴于包含面部/背景更改和 T2V/I2V 内容的数据集有限，我们在训练中将与任务无关的数据与标准 DeepFake 数据集整合在一起。我们通过结合注意力多样性 (AD) 损失来进一步缓解模型过度关注面部的倾向，这促进了视频帧之间的多样化空间注意力。将 AD 损失与交叉熵相结合可提高不同环境下的检测性能。比较评估表明，\texttt{UNITE} 在具有面部/背景操作和完全合成的 T2V/I2V 视频的数据集（在跨数据设置中）上的表现优于最先进的检测器，展示了其适应性和可推广的检测能力。

Title: RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems

Authors: Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis (Yiannis)Kompatsiaris
Subjects: cs.LG, cs.AI, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2412.12322
Pdf URL: https://arxiv.org/pdf/2412.12322
Copy Paste: [[2412.12322]] RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems(https://arxiv.org/abs/2412.12322)
Keywords: generation
Abstract: We present RAG Playground, an open-source framework for systematic evaluation of Retrieval-Augmented Generation (RAG) systems. The framework implements and compares three retrieval approaches: naive vector search, reranking, and hybrid vector-keyword search, combined with ReAct agents using different prompting strategies. We introduce a comprehensive evaluation framework with novel metrics and provide empirical results comparing different language models (Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our experiments demonstrate significant performance improvements through hybrid search methods and structured self-evaluation prompting, achieving up to 72.7% pass rate on our multi-metric evaluation framework. The results also highlight the importance of prompt engineering in RAG systems, with our custom-prompted agents showing consistent improvements in retrieval accuracy and response quality.
摘要：我们介绍了 RAG Playground，这是一个用于系统评估检索增强生成 (RAG) 系统的开源框架。该框架实现并比较了三种检索方法：简单向量搜索、重新排序和混合向量关键字搜索，并结合使用不同提示策略的 ReAct 代理。我们引入了一个具有新指标的综合评估框架，并提供了在不同检索配置中比较不同语言模型 (Llama 3.1 和 Qwen 2.5) 的实证结果。我们的实验表明，通过混合搜索方法和结构化自我评估提示，性能得到了显著提升，在我们的多指标评估框架上实现了高达 72.7% 的通过率。结果还强调了提示工程在 RAG 系统中的重要性，我们的自定义提示代理在检索准确性和响应质量方面表现出持续的改进。

Title: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

Authors: Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2412.12359
Pdf URL: https://arxiv.org/pdf/2412.12359
Copy Paste: [[2412.12359]] Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering(https://arxiv.org/abs/2412.12359)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. We introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance.
摘要：多模态大型语言模型 (MLLM) 通过将视觉表征集成到大型语言模型 (LLM) 中，显著提高了视觉任务的效率。从 LLM 继承的文本模态为 MLLM 提供了指令跟踪和上下文学习等能力。相比之下，视觉模态通过利用丰富的语义内容、空间信息和基础能力来提高下游任务的性能。这些内在模态在各种视觉任务中协同工作。我们的研究最初揭示了这些模态之间存在持续的不平衡，在视觉指令调整期间，文本通常主导输出生成。这种不平衡发生在使用完全微调和参数高效微调 (PEFT) 方法时。然后我们发现重新平衡这些模态可以显著减少所需的可训练参数数量，从而为进一步优化视觉指令调整提供了方向。我们引入了模态线性表征指导 (MoReS) 来实现这一目标。 MoReS 有效地重新平衡了整个模型的内在模态，其关键思想是通过每个模型层上的视觉子空间中的线性变换来引导视觉表示。为了验证我们的解决方案，我们编写了 LLaVA Steering，这是一套与所提出的 MoReS 方法集成的模型。评估结果表明，组合的 LLaVA Steering 模型所需的可训练参数平均比 LoRA 少 500 倍，同时仍在三个视觉基准和八个视觉问答任务中实现可比性能。最后，我们介绍了 LLaVA Steering Factory，这是一个内部开发的平台，使研究人员能够快速定制具有基于组件的架构的各种 MLLM，以无缝集成最先进的模型，并评估其内在模态不平衡。

Title: Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Authors: Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12391
Pdf URL: https://arxiv.org/pdf/2412.12391
Copy Paste: [[2412.12391]] Efficient Scaling of Diffusion Transformers for Text-to-Image Generation(https://arxiv.org/abs/2412.12391)
Keywords: generation
Abstract: We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.
摘要：我们通过执行广泛而严格的消融，包括在高达 6 亿张图像的数据集上训练从 0.3B 到 8B 参数的缩放 DiT，实证研究了用于文本到图像生成的各种扩散变换器 (DiT) 的缩放特性。我们发现，与基于交叉注意的 DiT 变体相比，基于纯自注意的 DiT 模型 U-ViT 提供了更简单的设计并且扩展更有效，这允许直接扩展额外条件和其他模态。我们发现 2.3B U-ViT 模型在受控设置下可以获得比 SDXL UNet 和其他 DiT 变体更好的性能。在数据扩展方面，我们研究了增加数据集大小和增强长标题如何提高文本图像对齐性能和学习效率。

Title: Causally Consistent Normalizing Flow

Authors: Qingyang Zhou, Kangjie Lu, Meng Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12401
Pdf URL: https://arxiv.org/pdf/2412.12401
Copy Paste: [[2412.12401]] Causally Consistent Normalizing Flow(https://arxiv.org/abs/2412.12401)
Keywords: generative
Abstract: Causal inconsistency arises when the underlying causal graphs captured by generative models like \textit{Normalizing Flows} (NFs) are inconsistent with those specified in causal models like \textit{Struct Causal Models} (SCMs). This inconsistency can cause unwanted issues including the unfairness problem. Prior works to achieve causal consistency inevitably compromise the expressiveness of their models by disallowing hidden layers. In this work, we introduce a new approach: \textbf{C}ausally \textbf{C}onsistent \textbf{N}ormalizing \textbf{F}low (CCNF). To the best of our knowledge, CCNF is the first causally consistent generative model that can approximate any distribution with multiple layers. CCNF relies on two novel constructs: a sequential representation of SCMs and partial causal transformations. These constructs allow CCNF to inherently maintain causal consistency without sacrificing expressiveness. CCNF can handle all forms of causal inference tasks, including interventions and counterfactuals. Through experiments, we show that CCNF outperforms current approaches in causal inference. We also empirically validate the practical utility of CCNF by applying it to real-world datasets and show how CCNF addresses challenges like unfairness effectively.
摘要：当诸如 \textit{规范化流} (NF) 之类的生成模型捕获的底层因果图与诸如 \textit{结构因果模型} (SCM) 之类的因果模型中指定的因果图不一致时，就会出现因果不一致。这种不一致会导致不必要的问题，包括不公平问题。先前为实现因果一致性而进行的工作不可避免地通过禁止隐藏层来损害其模型的表达能力。在这项工作中，我们引入了一种新方法：\textbf{C}ausally \textbf{C}onsistent \textbf{N}ormalizing \textbf{F}low (CCNF)。据我们所知，CCNF 是第一个可以用多层近似任何分布的因果一致的生成模型。CCNF 依赖于两个新颖的构造：SCM 的顺序表示和部分因果转换。这些构造使 CCNF 能够在不牺牲表达能力的情况下固有地保持因果一致性。CCNF 可以处理所有形式的因果推理任务，包括干预和反事实。通过实验，我们表明 CCNF 在因果推理方面优于当前方法。我们还通过将 CCNF 应用于现实世界的数据集来实证验证其实用性，并展示 CCNF 如何有效应对不公平等挑战。

Title: Numerical Pruning for Efficient Autoregressive Models

Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12441
Pdf URL: https://arxiv.org/pdf/2412.12441
Copy Paste: [[2412.12441]] Numerical Pruning for Efficient Autoregressive Models(https://arxiv.org/abs/2412.12441)
Keywords: generation
Abstract: Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high computational costs due to their substantial model size. This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning to improve the model efficiency while preserving performance for both language and image generation tasks. Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and MLP modules, respectively. Besides, we further propose another compensation algorithm to recover the pruned model for better performance. To verify the effectiveness of our method, we provide both theoretical support and extensive experiments. Our experiments show that our method achieves state-of-the-art performance with reduced memory usage and faster generation speeds on GPUs.
摘要：Transformer 已经成为深度学习领域的领先架构，除了语言和图像处理之外，它在多个领域都具有通用性和高效性。然而，由于模型规模庞大，其令人印象深刻的性能往往会带来高昂的计算成本。本文重点研究通过结构权重剪枝来压缩仅基于解码器的 Transformer 的自回归模型，以提高模型效率，同时保持语言和图像生成任务的性能。具体来说，我们提出了一种无需训练的剪枝方法，该方法分别使用牛顿法为 Attention 和 MLP 模块计算数值分数。此外，我们还提出了另一种补偿算法来恢复剪枝后的模型以获得更好的性能。为了验证我们方法的有效性，我们提供了理论支持和大量实验。我们的实验表明，我们的方法在减少内存使用量和加快 GPU 生成速度的同时实现了最佳性能。

Title: LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Authors: Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12444
Pdf URL: https://arxiv.org/pdf/2412.12444
Copy Paste: [[2412.12444]] LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers(https://arxiv.org/abs/2412.12444)
Keywords: generative
Abstract: Diffusion Transformers have emerged as the preeminent models for a wide array of generative tasks, demonstrating superior performance and efficacy across various applications. The promising results come at the cost of slow inference, as each denoising step requires running the whole transformer model with a large amount of parameters. In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps. Furthermore, we show that the lower bound of similarity between outputs at consecutive steps is notably high, and this similarity can be linearly approximated using the inputs. To verify our demonstrations, we propose the \textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached results from earlier steps to skip redundant computations. Specifically, we incorporate lazy learning layers into the model, effectively trained to maximize laziness, enabling dynamic skipping of redundant computations. Experimental results show that LazyDiT outperforms the DDIM sampler across multiple diffusion transformer models at various resolutions. Furthermore, we implement our method on mobile devices, achieving better performance than DDIM with similar latency.
摘要：扩散变压器已成为各种生成任务的卓越模型，在各种应用中都表现出卓越的性能和功效。令人鼓舞的结果是以缓慢的推理为代价的，因为每个去噪步骤都需要使用大量参数运行整个变压器模型。在本文中，我们表明，在每个扩散步骤中执行模型的完整计算是不必要的，因为可以通过延迟重用前几个步骤的结果来跳过一些计算。此外，我们表明，连续步骤中输出之间的相似度下限非常高，并且可以使用输入线性近似该相似度。为了验证我们的演示，我们提出了 \textbf{LazyDiT}，这是一个惰性学习框架，可以有效地利用早期步骤的缓存结果来跳过冗余计算。具体来说，我们将惰性学习层合并到模型中，经过有效训练以最大化惰性，从而能够动态跳过冗余计算。实验结果表明，LazyDiT 在各种分辨率的多个扩散变压器模型中均优于 DDIM 采样器。此外，我们在移动设备上实现了我们的方法，在相似的延迟下实现了比 DDIM 更好的性能。

Title: Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy

Authors: Aditya Ganeshan, Thibault Groueix, Paul Guerrero, Radomír Měch, Matthew Fisher, Daniel Ritchie
Subjects: cs.CV, cs.AI, cs.GR, cs.HC
Abstract URL: https://arxiv.org/abs/2412.12463
Pdf URL: https://arxiv.org/pdf/2412.12463
Copy Paste: [[2412.12463]] Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy(https://arxiv.org/abs/2412.12463)
Keywords: generative
Abstract: Pattern images are everywhere in the digital and physical worlds, and tools to edit them are valuable. But editing pattern images is tricky: desired edits are often programmatic: structure-aware edits that alter the underlying program which generates the pattern. One could attempt to infer this underlying program, but current methods for doing so struggle with complex images and produce unorganized programs that make editing tedious. In this work, we introduce a novel approach to perform programmatic edits on pattern images. By using a pattern analogy -- a pair of simple patterns to demonstrate the intended edit -- and a learning-based generative model to execute these edits, our method allows users to intuitively edit patterns. To enable this paradigm, we introduce SplitWeave, a domain-specific language that, combined with a framework for sampling synthetic pattern analogies, enables the creation of a large, high-quality synthetic training dataset. We also present TriFuser, a Latent Diffusion Model (LDM) designed to overcome critical issues that arise when naively deploying LDMs to this task. Extensive experiments on real-world, artist-sourced patterns reveals that our method faithfully performs the demonstrated edit while also generalizing to related pattern styles beyond its training distribution.
摘要：图案图像在数字和物理世界中无处不在，编辑它们的工具很有价值。但编辑图案图像很棘手：所需的编辑通常是程序化的：结构感知编辑会改变生成图案的底层程序。人们可以尝试推断这个底层程序，但目前这样做的方法很难处理复杂的图像，而且会产生无组织的程序，使编辑变得乏味。在这项工作中，我们介绍了一种对图案图像执行程序化编辑的新方法。通过使用图案类比（一对简单的图案来演示预期的编辑）和基于学习的生成模型来执行这些编辑，我们的方法允许用户直观地编辑图案。为了实现这种范式，我们引入了 SplitWeave，这是一种领域特定语言，结合用于采样合成图案类比的框架，可以创建大型、高质量的合成训练数据集。我们还介绍了 TriFuser，这是一种潜在扩散模型 (LDM)，旨在克服在天真地将 LDM 部署到此任务时出现的关键问题。对现实世界中艺术家来源的图案进行的大量实验表明，我们的方法忠实地执行了所演示的编辑，同时还推广到其训练分布之外的相关图案样式。

Title: Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues

Authors: Yan Zhang, Gangyan Zeng, Huawen Shen, Daiqing Wu, Yu Zhou, Can Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12502
Pdf URL: https://arxiv.org/pdf/2412.12502
Copy Paste: [[2412.12502]] Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues(https://arxiv.org/abs/2412.12502)
Keywords: generative
Abstract: Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly reasoning textual and visual information in a given video. Inspired by the development of TextVQA in image domain, existing Video TextVQA approaches leverage a language model (e.g. T5) to process text-rich multiple frames and generate answers auto-regressively. Nevertheless, the spatio-temporal relationships among visual entities (including scene text and objects) will be disrupted and models are susceptible to interference from unrelated information, resulting in irrational reasoning and inaccurate answering. To tackle these challenges, we propose the TEA (stands for ``\textbf{T}rack th\textbf{E} \textbf{A}nswer'') method that better extends the generative TextVQA framework from image to video. TEA recovers the spatio-temporal relationships in a complementary way and incorporates OCR-aware clues to enhance the quality of reasoning questions. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. TEA outperforms existing TextVQA methods, video-language pretraining methods and video large language models by great margins.
摘要：基于视频文本的视觉问答 (Video TextVQA) 是一项实用任务，旨在通过联合推理给定视频中的文本和视觉信息来回答问题。受图像域中 TextVQA 发展的启发，现有的 Video TextVQA 方法利用语言模型 (例如 T5) 来处理富含文本的多帧并自回归生成答案。然而，视觉实体 (包括场景文本和对象) 之间的时空关系将被破坏，模型容易受到不相关信息的干扰，导致不合理的推理和不准确的回答。为了应对这些挑战，我们提出了 TEA (代表 ``\textbf{T}rack th\textbf{E} \textbf{A}nswer'') 方法，该方法更好地将生成式 TextVQA 框架从图像扩展到视频。TEA 以互补的方式恢复时空关系，并结合 OCR 感知线索来提高推理问题的质量。在多个公开的视频 TextVQA 数据集上进行的大量实验验证了我们框架的有效性和泛化能力。TEA 的表现远胜于现有的 TextVQA 方法、视频语言预训练方法和视频大型语言模型。

Title: Invisible Watermarks: Attacks and Robustness

Authors: Dongjun Hwang, Sungwon Woo, Tom Gao, Raymond Luo, Sunghwan Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12511
Pdf URL: https://arxiv.org/pdf/2412.12511
Copy Paste: [[2412.12511]] Invisible Watermarks: Attacks and Robustness(https://arxiv.org/abs/2412.12511)
Keywords: generative
Abstract: As Generative AI continues to become more accessible, the case for robust detection of generated images in order to combat misinformation is stronger than ever. Invisible watermarking methods act as identifiers of generated content, embedding image- and latent-space messages that are robust to many forms of perturbations. The majority of current research investigates full-image attacks against images with a single watermarking method applied. We introduce novel improvements to watermarking robustness as well as minimizing degradation on image quality during attack. Firstly, we examine the application of both image-space and latent-space watermarking methods on a single image, where we propose a custom watermark remover network which preserves one of the watermarking modalities while completely removing the other during decoding. Then, we investigate localized blurring attacks (LBA) on watermarked images based on the GradCAM heatmap acquired from the watermark decoder in order to reduce the amount of degradation to the target image. Our evaluation suggests that 1) implementing the watermark remover model to preserve one of the watermark modalities when decoding the other modality slightly improves on the baseline performance, and that 2) LBA degrades the image significantly less compared to uniform blurring of the entire image. Code is available at: this https URL
摘要：随着生成式人工智能变得越来越普及，为了打击错误信息，对生成的图像进行稳健检测的必要性比以往任何时候都要强烈。隐形水印方法充当生成内容的标识符，嵌入对多种形式的扰动都具有鲁棒性的图像和潜在空间消息。当前大多数研究都研究了对应用单一水印方法的图像进行全图像攻击。我们引入了新的水印鲁棒性改进，并最大限度地减少了攻击期间图像质量的下降。首先，我们研究了图像空间和潜在空间水印方法在单个图像上的应用，其中我们提出了一个自定义水印去除器网络，它在解码过程中保留其中一种水印模式，同时完全去除另一种。然后，我们基于从水印解码器获取的 GradCAM 热图研究对水印图像的局部模糊攻击 (LBA)，以减少目标图像的质量下降。我们的评估表明：1）实施水印去除器模型以在解码另一种模态时保留其中一种水印模态，可以略微提高基线性能；2）与整个图像的均匀模糊相比，LBA 对图像的降级要小得多。代码可从以下网址获取：此 https URL

Title: Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling

Authors: Iman Khazrak, Shakhnoza Takhirova, Mostafa M. Rezaee, Mehrdad Yadollahi, Robert C. Green II, Shuteng Niu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12532
Pdf URL: https://arxiv.org/pdf/2412.12532
Copy Paste: [[2412.12532]] Addressing Small and Imbalanced Medical Image Datasets Using Generative Models: A Comparative Study of DDPM and PGGANs with Random and Greedy K Sampling(https://arxiv.org/abs/2412.12532)
Keywords: generative
Abstract: The development of accurate medical image classification models is often constrained by privacy concerns and data scarcity for certain conditions, leading to small and imbalanced datasets. To address these limitations, this study explores the use of generative models, such as Denoising Diffusion Probabilistic Models (DDPM) and Progressive Growing Generative Adversarial Networks (PGGANs), for dataset augmentation. The research introduces a framework to assess the impact of synthetic images generated by DDPM and PGGANs on the performance of four models: a custom CNN, Untrained VGG16, Pretrained VGG16, and Pretrained ResNet50. Experiments were conducted using Random Sampling and Greedy K Sampling to create small, imbalanced datasets. The synthetic images were evaluated using Frechet Inception Distance (FID) and compared to original datasets through classification metrics. The results show that DDPM consistently generated more realistic images with lower FID scores and significantly outperformed PGGANs in improving classification metrics across all models and datasets. Incorporating DDPM-generated images into the original datasets increased accuracy by up to 6%, enhancing model robustness and stability, particularly in imbalanced scenarios. Random Sampling demonstrated superior stability, while Greedy K Sampling offered diversity at the cost of higher FID scores. This study highlights the efficacy of DDPM in augmenting small, imbalanced medical image datasets, improving model performance by balancing the dataset and expanding its size.
摘要：准确的医学图像分类模型的开发通常受到隐私问题和某些条件下的数据稀缺性的制约，导致数据集较小且不平衡。为了解决这些限制，本研究探索了使用生成模型（例如去噪扩散概率模型 (DDPM) 和渐进式增长生成对抗网络 (PGGAN)）进行数据集增强。该研究引入了一个框架来评估 DDPM 和 PGGAN 生成的合成图像对四种模型性能的影响：自定义 CNN、未训练的 VGG16、预训练的 VGG16 和预训练的 ResNet50。使用随机抽样和贪婪 K 抽样进行实验以创建小型、不平衡的数据集。使用 Frechet 初始距离 (FID) 评估合成图像，并通过分类指标将其与原始数据集进行比较。结果表明，DDPM 始终以较低的 FID 分数生成更逼真的图像，并且在改进所有模型和数据集的分类指标方面明显优于 PGGAN。将 DDPM 生成的图像合并到原始数据集中可将准确率提高高达 6%，从而增强模型的稳健性和稳定性，尤其是在不平衡场景中。随机采样表现出卓越的稳定性，而贪婪 K 采样则以更高的 FID 分数为代价提供了多样性。这项研究强调了 DDPM 在增强小型、不平衡的医学图像数据集方面的有效性，通过平衡数据集并扩大其大小来提高模型性能。

Title: Stiefel Flow Matching for Moment-Constrained Structure Elucidation

Authors: Austin Cheng, Alston Lo, Kin Long Kelvin Lee, Santiago Miret, Alán Aspuru-Guzik
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2412.12540
Pdf URL: https://arxiv.org/pdf/2412.12540
Copy Paste: [[2412.12540]] Stiefel Flow Matching for Moment-Constrained Structure Elucidation(https://arxiv.org/abs/2412.12540)
Keywords: generative
Abstract: Molecular structure elucidation is a fundamental step in understanding chemical phenomena, with applications in identifying molecules in natural products, lab syntheses, forensic samples, and the interstellar medium. We consider the task of predicting a molecule's all-atom 3D structure given only its molecular formula and moments of inertia, motivated by the ability of rotational spectroscopy to measure these moments. While existing generative models can conditionally sample 3D structures with approximately correct moments, this soft conditioning fails to leverage the many digits of precision afforded by experimental rotational spectroscopy. To address this, we first show that the space of $n$-atom point clouds with a fixed set of moments of inertia is embedded in the Stiefel manifold $\mathrm{St}(n, 4)$. We then propose Stiefel Flow Matching as a generative model for elucidating 3D structure under exact moment constraints. Additionally, we learn simpler and shorter flows by finding approximate solutions for equivariant optimal transport on the Stiefel manifold. Empirically, enforcing exact moment constraints allows Stiefel Flow Matching to achieve higher success rates and faster sampling than Euclidean diffusion models, even on high-dimensional manifolds corresponding to large molecules in the GEOM dataset.
摘要：分子结构解析是理解化学现象的基本步骤，可用于识别天然产物、实验室合成、法医样本和星际介质中的分子。我们考虑在仅给定分子式和转动惯量的情况下预测分子的全原子三维结构，其动机是旋转光谱能够测量这些矩。虽然现有的生成模型可以有条件地对具有近似正确矩的三维结构进行采样，但这种软条件无法利用实验旋转光谱所提供的多位精度。为了解决这个问题，我们首先证明具有一组固定转动惯量的 $n$ 原子点云空间嵌入在 Stiefel 流形 $\mathrm{St}(n, 4)$ 中。然后，我们提出 Stiefel 流匹配作为在精确矩约束下解析三维结构的生成模型。此外，我们通过在 Stiefel 流形上寻找等变最优传输的近似解来学习更简单、更短的流动。从经验上讲，强制实施精确矩约束可使 Stiefel 流匹配实现比欧几里德扩散模型更高的成功率和更快的采样速度，即使在 GEOM 数据集中与大分子相对应的高维流形上也是如此。

Title: Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration

Authors: Xinlong Cheng, Tiantian Cao, Guoan Cheng, Bangxuan Huang, Xinghan Tian, Ye Wang, Xiaoyu He, Weixin Li, Tianfan Xue, Xuan Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12550
Pdf URL: https://arxiv.org/pdf/2412.12550
Copy Paste: [[2412.12550]] Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration(https://arxiv.org/abs/2412.12550)
Keywords: restoration
Abstract: In this work, we address the limitations of denoising diffusion models (DDMs) in image restoration tasks, particularly the shape and color distortions that can compromise image quality. While DDMs have demonstrated a promising performance in many applications such as text-to-image synthesis, their effectiveness in image restoration is often hindered by shape and color distortions. We observe that these issues arise from inconsistencies between the training and testing data used by DDMs. Based on our observation, we propose a novel training method, named data-consistent training, which allows the DDMs to access images with accumulated errors during training, thereby ensuring the model to learn to correct these errors. Experimental results show that, across five image restoration tasks, our method has significant improvements over state-of-the-art methods while effectively minimizing distortions and preserving image fidelity.
摘要：在本文中，我们解决了去噪扩散模型 (DDM) 在图像恢复任务中的局限性，特别是形状和颜色失真可能会损害图像质量。虽然 DDM 在文本到图像合成等许多应用中都表现出色，但它们在图像恢复中的有效性往往受到形状和颜色失真的阻碍。我们观察到这些问题源于 DDM 使用的训练数据和测试数据之间的不一致。根据我们的观察，我们提出了一种名为数据一致性训练的新训练方法，该方法允许 DDM 访问在训练期间累积错误的图像，从而确保模型学会纠正这些错误。实验结果表明，在五个图像恢复任务中，我们的方法比最先进的方法有显着的改进，同时有效地最大限度地减少了失真并保持了图像保真度。

Title: SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps

Authors: Sparsh Pekhale, Rakshith Sathish, Sathisha Basavaraju, Divya Sharma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12552
Pdf URL: https://arxiv.org/pdf/2412.12552
Copy Paste: [[2412.12552]] SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps(https://arxiv.org/abs/2412.12552)
Keywords: generation
Abstract: Land-use and land cover (LULC) analysis is critical in remote sensing, with wide-ranging applications across diverse fields such as agriculture, utilities, and urban planning. However, automating LULC map generation using machine learning is rendered challenging due to noisy labels. Typically, the ground truths (e.g. ESRI LULC, MapBioMass) have noisy labels that hamper the model's ability to learn to accurately classify the pixels. Further, these erroneous labels can significantly distort the performance metrics of a model, leading to misleading evaluations. Traditionally, the ambiguous labels are rectified using unsupervised algorithms. These algorithms struggle not only with scalability but also with generalization across different geographies. To overcome these challenges, we propose a zero-shot approach using the foundation model, Segment Anything Model (SAM), to automatically delineate different land parcels/regions and leverage them to relabel the unsure pixels by using the local label statistics within each detected region. We achieve a significant reduction in label noise and an improvement in the performance of the downstream segmentation model by $\approx 5\%$ when trained with denoised labels.
摘要：土地利用和土地覆盖 (LULC) 分析在遥感中至关重要，广泛应用于农业、公用事业和城市规划等不同领域。然而，由于标签噪声，使用机器学习自动生成 LULC 地图变得具有挑战性。通常，地面实况（例如 ESRI LULC、MapBioMass）具有噪声标签，这会妨碍模型学习准确分类像素的能力。此外，这些错误的标签会严重扭曲模型的性能指标，导致评估误导。传统上，使用无监督算法来纠正模糊标签。这些算法不仅在可扩展性方面存在困难，而且在跨不同地理区域的泛化方面也存在困难。为了克服这些挑战，我们提出了一种零样本方法，使用基础模型 Segment Anything Model (SAM) 自动划定不同的地块/区域，并利用它们通过使用每个检测到的区域内的本地标签统计数据重新标记不确定的像素。当使用去噪标签进行训练时，我们显著降低了标签噪声，并且下游分割模型的性能提高了$\approx 5\%$。

Title: ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12571
Pdf URL: https://arxiv.org/pdf/2412.12571
Copy Paste: [[2412.12571]] ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers(https://arxiv.org/abs/2412.12571)
Keywords: generation
Abstract: Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at this https URL
摘要：最近的研究 arXiv:2410.15027 arXiv:2410.23775 强调了预训练扩散变压器 (DiT) 固有的上下文生成功能，使它们能够无缝适应各种视觉任务，而无需或只需进行很少的架构修改。这些功能是通过将跨多个输入和目标图像的自注意力标记与分组和屏蔽生成管道相结合来解锁的。在此基础上，我们提出了 ChatDiT，这是一个零样本、通用且交互式的视觉生成框架，它利用原始形式的预训练扩散变压器，无需额外的调整、适配器或修改。用户可以与 ChatDiT 交互以创建交错的文本图像文章、多页图画书、编辑图像、设计 IP 衍生品或开发角色设计设置，所有这些都通过一个或多个对话回合中的自由形式的自然语言完成。 ChatDiT 的核心是采用多智能体系统，该系统由三个关键组件组成：解释用户上传的图像和指令的指令解析代理、设计单步或多步生成操作的策略规划代理，以及使用上下文扩散变换器工具包执行这些操作的执行代理。我们在 IDEA-Bench arXiv:2412.11767 上对 ChatDiT 进行了全面评估，其中包括 100 个真实世界的设计任务和 275 个案例，这些案例具有不同的指令和不同数量的输入和目标图像。尽管 ChatDiT 简单且无需训练，但它超越了所有竞争对手，包括那些专门设计和训练大量多任务数据集的竞争对手。我们进一步确定了预训练 DiT 在零样本适应任务方面的主要局限性。我们在此 https URL 发布所有代码、代理、结果和中间输出，以促进进一步研究

Title: A Simple and Efficient Baseline for Zero-Shot Generative Classification

Authors: Zipeng Qi, Buhua Liu, Shiyan Zhang, Bao Li, Zhiqiang Xu, Haoyi Xiong, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12594
Pdf URL: https://arxiv.org/pdf/2412.12594
Copy Paste: [[2412.12594]] A Simple and Efficient Baseline for Zero-Shot Generative Classification(https://arxiv.org/abs/2412.12594)
Keywords: generative
Abstract: Large diffusion models have become mainstream generative models in both academic studies and industrial AIGC applications. Recently, a number of works further explored how to employ the power of large diffusion models as zero-shot classifiers. While recent zero-shot diffusion-based classifiers have made performance advancement on benchmark datasets, they still suffered badly from extremely slow classification speed (e.g., ~1000 seconds per classifying single image on ImageNet). The extremely slow classification speed strongly prohibits existing zero-shot diffusion-based classifiers from practical applications. In this paper, we propose an embarrassingly simple and efficient zero-shot Gaussian Diffusion Classifiers (GDC) via pretrained text-to-image diffusion models and DINOv2. The proposed GDC can not only significantly surpass previous zero-shot diffusion-based classifiers by over 10 points (61.40% - 71.44%) on ImageNet, but also accelerate more than 30000 times (1000 - 0.03 seconds) classifying a single image on ImageNet. Additionally, it provides probability interpretation of the results. Our extensive experiments further demonstrate that GDC can achieve highly competitive zero-shot classification performance over various datasets and can promisingly self-improve with stronger diffusion models. To the best of our knowledge, the proposed GDC is the first zero-shot diffusionbased classifier that exhibits both competitive accuracy and practical efficiency.
摘要：大型扩散模型已成为学术研究和工业 AIGC 应用中的主流生成模型。最近，许多研究进一步探索了如何利用大型扩散模型作为零样本分类器。虽然最近的基于零样本扩散的分类器在基准数据集上取得了性能提升，但它们仍然受到极慢的分类速度的严重影响（例如，在 ImageNet 上对单个图像进行分类大约需要 1000 秒）。极慢的分类速度严重阻碍了现有的基于零样本扩散的分类器的实际应用。在本文中，我们通过预训练的文本到图像扩散模型和 DINOv2 提出了一种非常简单且高效的零样本高斯扩散分类器 (GDC)。所提出的 GDC 不仅在 ImageNet 上显著超越了之前基于零样本扩散的分类器 10 多个百分点（61.40% - 71.44%），而且在 ImageNet 上对单个图像进行分类的速度提高了 30000 多倍（1000 - 0.03 秒）。此外，它还提供了结果的概率解释。我们进行了大量的实验，进一步表明 GDC 可以在各种数据集上实现极具竞争力的零样本分类性能，并且有望通过更强大的扩散模型进行自我改进。据我们所知，所提出的 GDC 是第一个既具有竞争力的准确性又具有实际效率的基于零样本扩散的分类器。

Title: OpenViewer: Openness-Aware Multi-View Learning

Authors: Shide Du, Zihan Fang, Yanchao Tan, Changwei Wang, Shiping Wang, Wenzhong Guo
Subjects: cs.CV, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2412.12596
Pdf URL: https://arxiv.org/pdf/2412.12596
Copy Paste: [[2412.12596]] OpenViewer: Openness-Aware Multi-View Learning(https://arxiv.org/abs/2412.12596)
Keywords: generation
Abstract: Multi-view learning methods leverage multiple data sources to enhance perception by mining correlations across views, typically relying on predefined categories. However, deploying these models in real-world scenarios presents two primary openness challenges. 1) Lack of Interpretability: The integration mechanisms of multi-view data in existing black-box models remain poorly explained; 2) Insufficient Generalization: Most models are not adapted to multi-view scenarios involving unknown categories. To address these challenges, we propose OpenViewer, an openness-aware multi-view learning framework with theoretical support. This framework begins with a Pseudo-Unknown Sample Generation Mechanism to efficiently simulate open multi-view environments and previously adapt to potential unknown samples. Subsequently, we introduce an Expression-Enhanced Deep Unfolding Network to intuitively promote interpretability by systematically constructing functional prior-mapping modules and effectively providing a more transparent integration mechanism for multi-view data. Additionally, we establish a Perception-Augmented Open-Set Training Regime to significantly enhance generalization by precisely boosting confidences for known categories and carefully suppressing inappropriate confidences for unknown ones. Experimental results demonstrate that OpenViewer effectively addresses openness challenges while ensuring recognition performance for both known and unknown samples. The code is released at this https URL.
摘要：多视图学习方法利用多种数据源，通过挖掘视图之间的相关性来增强感知，通常依赖于预定义的类别。然而，在现实世界场景中部署这些模型面临两个主要的开放性挑战。1）缺乏可解释性：现有黑盒模型中多视图数据的集成机制仍未得到很好的解释；2）泛化不足：大多数模型不适用于涉及未知类别的多视图场景。为了应对这些挑战，我们提出了 OpenViewer，一个具有理论支持的开放性感知多视图学习框架。该框架从伪未知样本生成机制开始，以有效模拟开放的多视图环境并预先适应潜在的未知样本。随后，我们引入了一个表达增强深度展开网络，通过系统地构建功能性先验映射模块并有效地为多视图数据提供更透明的集成机制，直观地提高可解释性。此外，我们建立了一个感知增强开放集训练机制，通过精确提升已知类别的置信度并小心地抑制未知类别的不适当置信度来显著增强泛化能力。实验结果表明，OpenViewer 能够有效解决开放性问题，同时保证已知和未知样本的识别性能。代码发布在此 https URL 上。

Title: RDPI: A Refine Diffusion Probability Generation Method for Spatiotemporal Data Imputation

Authors: Zijin Liu, Xiang Zhao, You Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12642
Pdf URL: https://arxiv.org/pdf/2412.12642
Copy Paste: [[2412.12642]] RDPI: A Refine Diffusion Probability Generation Method for Spatiotemporal Data Imputation(https://arxiv.org/abs/2412.12642)
Keywords: generation, quality assessment
Abstract: Spatiotemporal data imputation plays a crucial role in various fields such as traffic flow monitoring, air quality assessment, and climate prediction. However, spatiotemporal data collected by sensors often suffer from temporal incompleteness, and the sparse and uneven distribution of sensors leads to missing data in the spatial dimension. Among existing methods, autoregressive approaches are prone to error accumulation, while simple conditional diffusion models fail to adequately capture the spatiotemporal relationships between observed and missing data. To address these issues, we propose a novel two-stage Refined Diffusion Probability Impuation (RDPI) framework based on an initial network and a conditional diffusion model. In the initial stage, deterministic imputation methods are used to generate preliminary estimates of the missing data. In the refinement stage, residuals are treated as the diffusion target, and observed values are innovatively incorporated into the forward process. This results in a conditional diffusion model better suited for spatiotemporal data imputation, bridging the gap between the preliminary estimates and the true values. Experiments on multiple datasets demonstrate that RDPI not only achieves state-of-the-art imputation accuracy but also significantly reduces sampling computational costs.
摘要：时空数据插补在交通流量监测、空气质量评估和气候预测等各个领域都发挥着至关重要的作用。然而，传感器收集的时空数据往往存在时间不完整性，而传感器的稀疏和不均匀分布导致空间维度上的数据缺失。在现有方法中，自回归方法容易出现误差累积，而简单的条件扩散模型无法充分捕捉观测数据和缺失数据之间的时空关系。为了解决这些问题，我们提出了一种基于初始网络和条件扩散模型的新型两阶段精细扩散概率插补 (RDPI) 框架。在初始阶段，使用确定性插补方法生成缺失数据的初步估计。在细化阶段，将残差作为扩散目标，并将观测值创新地纳入前向过程。这产生了一个更适合时空数据插补的条件扩散模型，弥合了初步估计值和真实值之间的差距。在多个数据集上的实验表明，RDPI 不仅达到了最佳的插补精度，而且还显著降低了采样计算成本。

Title: A Two-Fold Patch Selection Approach for Improved 360-Degree Image Quality Assessment

Authors: Abderrezzaq Sendjasni, Seif-Eddine Benkabou, Mohamed-Chaker Larabi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.12667
Pdf URL: https://arxiv.org/pdf/2412.12667
Copy Paste: [[2412.12667]] A Two-Fold Patch Selection Approach for Improved 360-Degree Image Quality Assessment(https://arxiv.org/abs/2412.12667)
Keywords: quality assessment
Abstract: This article presents a novel approach to improving the accuracy of 360-degree perceptual image quality assessment (IQA) through a two-fold patch selection process. Our methodology combines visual patch selection with embedding similarity-based refinement. The first stage focuses on selecting patches from 360-degree images using three distinct sampling methods to ensure comprehensive coverage of visual content for IQA. The second stage, which is the core of our approach, employs an embedding similarity-based selection process to filter and prioritize the most informative patches based on their embeddings similarity distances. This dual selection mechanism ensures that the training data is both relevant and informative, enhancing the model's learning efficiency. Extensive experiments and statistical analyses using three distance metrics across three benchmark datasets validate the effectiveness of our selection algorithm. The results highlight its potential to deliver robust and accurate 360-degree IQA, with performance gains of up to 4.5% in accuracy and monotonicity of quality score prediction, while using only 40% to 50% of the training patches. These improvements are consistent across various configurations and evaluation metrics, demonstrating the strength of the proposed method. The code for the selection process is available at: this https URL.
摘要：本文介绍了一种通过双重补丁选择过程提高 360 度感知图像质量评估 (IQA) 准确性的新方法。我们的方法将视觉补丁选择与基于嵌入相似性的细化相结合。第一阶段侧重于使用三种不同的采样方法从 360 度图像中选择补丁，以确保全面覆盖 IQA 的视觉内容。第二阶段是我们方法的核心，它采用基于嵌入相似性的选择过程，根据其嵌入相似性距离过滤和优先处理最具信息量的补丁。这种双重选择机制可确保训练数据既相关又具有信息量，从而提高模型的学习效率。在三个基准数据集上使用三个距离指标进行的大量实验和统计分析验证了我们的选择算法的有效性。结果突出了它提供强大而准确的 360 度 IQA 的潜力，在仅使用 40% 到 50% 的训练补丁的情况下，质量分数预测的准确性和单调性性能提高了 4.5%。这些改进在各种配置和评估指标中都是一致的，证明了所提方法的强大之处。选择过程的代码可从以下网址获取：此 https URL。

Title: ShiftedBronzes: Benchmarking and Analysis of Domain Fine-Grained Classification in Open-World Settings

Authors: Rixin Zhou, Honglin Pang, Qian Zhang, Ruihua Qi, Xi Yang, Chuntao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12683
Pdf URL: https://arxiv.org/pdf/2412.12683
Copy Paste: [[2412.12683]] ShiftedBronzes: Benchmarking and Analysis of Domain Fine-Grained Classification in Open-World Settings(https://arxiv.org/abs/2412.12683)
Keywords: generation
Abstract: In real-world applications across specialized domains, addressing complex out-of-distribution (OOD) challenges is a common and significant concern. In this study, we concentrate on the task of fine-grained bronze ware dating, a critical aspect in the study of ancient Chinese history, and developed a benchmark dataset named ShiftedBronzes. By extensively expanding the bronze Ding dataset, ShiftedBronzes incorporates two types of bronze ware data and seven types of OOD data, which exhibit distribution shifts commonly encountered in bronze ware dating scenarios. We conduct benchmarking experiments on ShiftedBronzes and five commonly used general OOD datasets, employing a variety of widely adopted post-hoc, pre-trained Vision Large Model (VLM)-based and generation-based OOD detection methods. Through analysis of the experimental results, we validate previous conclusions regarding post-hoc, VLM-based, and generation-based methods, while also highlighting their distinct behaviors on specialized datasets. These findings underscore the unique challenges of applying general OOD detection methods to domain-specific tasks such as bronze ware dating. We hope that the ShiftedBronzes benchmark provides valuable insights into both the field of bronze ware dating and the and the development of OOD detection methods. The dataset and associated code will be available later.
摘要：在各个专业领域的实际应用中，解决复杂的分布外 (OOD) 挑战是一个常见且重要的问题。在本研究中，我们专注于细粒度青铜器测年任务，这是中国古代历史研究的一个重要方面，并开发了一个名为 ShiftedBronzes 的基准数据集。通过广泛扩展青铜鼎数据集，ShiftedBronzes 融合了两种青铜器数据和七种 OOD 数据，这些数据表现出青铜器测年场景中常见的分布偏移。我们在 ShiftedBronzes 和五个常用的通用 OOD 数据集上进行了基准测试实验，采用了各种广泛采用的事后、基于预训练的 Vision Large Model (VLM) 和基于生成的 OOD 检测方法。通过对实验结果的分析，我们验证了先前关于事后、基于 VLM 和基于生成的方法的结论，同时也强调了它们在专业数据集上的不同行为。这些发现强调了将通用 OOD 检测方法应用于青铜器年代测定等特定领域任务的独特挑战。我们希望 ShiftedBronzes 基准测试能够为青铜器年代测定领域以及 OOD 检测方法的发展提供有价值的见解。数据集和相关代码将在稍后提供。

Title: Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

Authors: Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony Q. S. Quek, Seong-Lyun Kim
Subjects: cs.LG, cs.DC, cs.IT, cs.NI, eess.SP
Abstract URL: https://arxiv.org/abs/2412.12687
Pdf URL: https://arxiv.org/pdf/2412.12687
Copy Paste: [[2412.12687]] Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models(https://arxiv.org/abs/2412.12687)
Keywords: generation
Abstract: This paper studies a hybrid language model (HLM) architecture that integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network. The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM. While this approach ensures alignment between the vocabulary distributions of the SLM and LLM, it suffers from low token throughput due to uplink transmission and the computation costs of running both language models. To address this, we propose a novel HLM structure coined Uncertainty-aware HLM (U-HLM), wherein the SLM locally measures its output uncertainty, and skips both uplink transmissions and LLM operations for tokens that are likely to be accepted. This opportunistic skipping is enabled by our empirical finding of a linear correlation between the SLM's uncertainty and the LLM's rejection probability. We analytically derive the uncertainty threshold and evaluate its expected risk of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM computation by 45.93%, while achieving up to 97.54% of the LLM's inference accuracy and 2.54$\times$ faster token throughput than HLM without skipping.
摘要：本文研究了一种混合语言模型 (HLM) 架构，该架构集成了在移动设备上运行的小型语言模型 (SLM) 和托管在无线网络基站 (BS) 上的大型语言模型 (LLM)。HLM 令牌生成过程遵循推测推理原则：SLM 的词汇分布上传到 LLM，LLM 要么接受要么拒绝它，被拒绝的令牌由 LLM 重新采样。虽然这种方法确保了 SLM 和 LLM 的词汇分布之间的一致性，但由于上行链路传输和运行两个语言模型的计算成本，它的令牌吞吐量较低。为了解决这个问题，我们提出了一种新的 HLM 结构，称为不确定性感知 HLM (U-HLM)，其中 SLM 在本地测量其输出不确定性，并跳过可能被接受的令牌的上行链路传输和 LLM 操作。这种机会性跳过是通过我们通过经验发现 SLM 的不确定性和 LLM 的拒绝概率之间存在线性相关性来实现的。我们通过分析得出了不确定性阈值并评估了其预期的拒绝风险。模拟表明，U-HLM 将上行链路传输和 LLM 计算量减少了 45.93%，同时实现了高达 97.54% 的 LLM 推理准确率和比 HLM 无跳跃时快 2.54$\times$ 的令牌吞吐量。

Title: PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model

Authors: Yuqing Wang, Zhongling Huang, Shuxin Yang, Hao Tang, Xiaolan Qiu, Junwei Han, Dingwen Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12737
Pdf URL: https://arxiv.org/pdf/2412.12737
Copy Paste: [[2412.12737]] PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model(https://arxiv.org/abs/2412.12737)
Keywords: generation
Abstract: PolSAR data presents unique challenges due to its rich and complex characteristics. Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used. However, these formats often face issues related to usability, interpretability, and data integrity. Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively. To address these issues, We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy. PolSAM introduces Microwave Vision Data (MVD), a lightweight and interpretable data representation derived from polarimetric decomposition and semantic correlations. We propose two key components: the Feature-Level Fusion Prompt (FFP), which fuses visual tokens from pseudo-colored SAR images and MVD to address modality incompatibility in the frozen SAM encoder, and the Semantic-Level Fusion Prompt (SFP), which refines sparse and dense segmentation prompts using semantic information. Experimental results on the PhySAR-Seg datasets demonstrate that PolSAM significantly outperforms existing SAM-based and multimodal fusion models, improving segmentation accuracy, reducing data storage, and accelerating inference time. The source code and datasets will be made publicly available at \url{this https URL}.
摘要：极化SAR 数据因其丰富而复杂的特性而面临独特的挑战。现有的数据表示，如复值数据、极化特征和振幅图像，被广泛使用。然而，这些格式通常面临与可用性、可解释性和数据完整性相关的问题。大多数极化SAR 的特征提取网络都很小，限制了它们有效捕获特征的能力。为了解决这些问题，我们提出了基于极化散射机制的 SAM (PolSAM)，这是一种增强的分割任何模型 (SAM)，它集成了特定领域的散射特性和一种新颖的快速生成策略。PolSAM 引入了微波视觉数据 (MVD)，这是一种轻量级且可解释的数据表示，源自极化分解和语义相关性。我们提出了两个关键组件：特征级融合提示 (FFP)，它将伪彩色 SAR 图像和 MVD 中的视觉标记融合在一起，以解决冻结 SAM 编码器中的模态不兼容问题；语义级融合提示 (SFP)，它使用语义信息细化稀疏和密集分割提示。在 PhySAR-Seg 数据集上的实验结果表明，PolSAM 明显优于现有的基于 SAM 和多模态融合模型，提高了分割准确性，减少了数据存储量，并加快了推理时间。源代码和数据集将在 \url{此 https URL} 上公开发布。

Title: Progressive Monitoring of Generative Model Training Evolution

Authors: Vidya Prasad, Anna Vilanova, Nicola Pezzotti
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.12755
Pdf URL: https://arxiv.org/pdf/2412.12755
Copy Paste: [[2412.12755]] Progressive Monitoring of Generative Model Training Evolution(https://arxiv.org/abs/2412.12755)
Keywords: generative
Abstract: While deep generative models (DGMs) have gained popularity, their susceptibility to biases and other inefficiencies that lead to undesirable outcomes remains an issue. With their growing complexity, there is a critical need for early detection of issues to achieve desired results and optimize resources. Hence, we introduce a progressive analysis framework to monitor the training process of DGMs. Our method utilizes dimensionality reduction techniques to facilitate the inspection of latent representations, the generated and real distributions, and their evolution across training iterations. This monitoring allows us to pause and fix the training method if the representations or distributions progress undesirably. This approach allows for the analysis of a models' training dynamics and the timely identification of biases and failures, minimizing computational loads. We demonstrate how our method supports identifying and mitigating biases early in training a Generative Adversarial Network (GAN) and improving the quality of the generated data distribution.
摘要：虽然深度生成模型 (DGM) 已经越来越受欢迎，但它们容易受到偏见和其他低效率的影响，从而导致不良结果，这仍然是一个问题。随着其复杂性的不断增加，迫切需要尽早发现问题，以实现预期结果并优化资源。因此，我们引入了一个渐进式分析框架来监控 DGM 的训练过程。我们的方法利用降维技术来促进对潜在表示、生成和真实分布及其在训练迭代中的演变的检查。如果表示或分布进展不理想，这种监控使我们能够暂停并修复训练方法。这种方法允许分析模型的训练动态并及时识别偏差和故障，从而最大限度地减少计算负荷。我们展示了我们的方法如何支持在训练生成对抗网络 (GAN) 的早期识别和减轻偏差并提高生成的数据分布的质量。

Title: Guided and Variance-Corrected Fusion with One-shot Style Alignment for Large-Content Image Generation

Authors: Shoukun Sun, Min Xian, Tiankai Yao, Fei Xu, Luca Capriotti
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12771
Pdf URL: https://arxiv.org/pdf/2412.12771
Copy Paste: [[2412.12771]] Guided and Variance-Corrected Fusion with One-shot Style Alignment for Large-Content Image Generation(https://arxiv.org/abs/2412.12771)
Keywords: generation
Abstract: Producing large images using small diffusion models is gaining increasing popularity, as the cost of training large models could be prohibitive. A common approach involves jointly generating a series of overlapped image patches and obtaining large images by merging adjacent patches. However, results from existing methods often exhibit obvious artifacts, e.g., seams and inconsistent objects and styles. To address the issues, we proposed Guided Fusion (GF), which mitigates the negative impact from distant image regions by applying a weighted average to the overlapping regions. Moreover, we proposed Variance-Corrected Fusion (VCF), which corrects data variance at post-averaging, generating more accurate fusion for the Denoising Diffusion Probabilistic Model. Furthermore, we proposed a one-shot Style Alignment (SA), which generates a coherent style for large images by adjusting the initial input noise without adding extra computational burden. Extensive experiments demonstrated that the proposed fusion methods improved the quality of the generated image significantly. As a plug-and-play module, the proposed method can be widely applied to enhance other fusion-based methods for large image generation.
摘要：使用小型扩散模型生成大图像越来越受欢迎，因为训练大型模型的成本可能过高。一种常见的方法是联合生成一系列重叠的图像块，并通过合并相邻的块来获取大图像。然而，现有方法的结果通常会表现出明显的伪影，例如接缝和不一致的对象和风格。为了解决这些问题，我们提出了引导融合 (GF)，通过对重叠区域应用加权平均值来减轻远处图像区域的负面影响。此外，我们提出了方差校正融合 (VCF)，它在后平均时校正数据方差，为去噪扩散概率模型生成更准确的融合。此外，我们提出了一种一次性风格对齐 (SA)，它通过调整初始输入噪声为大图像生成连贯的风格，而不会增加额外的计算负担。大量实验表明，所提出的融合方法显著提高了生成图像的质量。作为即插即用模块，所提出的方法可广泛应用于增强其他基于融合的大图像生成方法。

Title: Rethinking Diffusion-Based Image Generators for Fundus Fluorescein Angiography Synthesis on Limited Data

Authors: Chengzhou Yu (South China University of Technology), Huihui Fang (Pazhou Laboratory), Hongqiu Wang (The Hong Kong University of Science and Technology (Guangzhou)), Ting Deng (South China University of Technology), Qing Du (South China University of Technology), Yanwu Xu (South China University of Technology), Weihua Yang (Shenzhen Eye Hospital)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12778
Pdf URL: https://arxiv.org/pdf/2412.12778
Copy Paste: [[2412.12778]] Rethinking Diffusion-Based Image Generators for Fundus Fluorescein Angiography Synthesis on Limited Data(https://arxiv.org/abs/2412.12778)
Keywords: generative
Abstract: Fundus imaging is a critical tool in ophthalmology, with different imaging modalities offering unique advantages. For instance, fundus fluorescein angiography (FFA) can accurately identify eye diseases. However, traditional invasive FFA involves the injection of sodium fluorescein, which can cause discomfort and risks. Generating corresponding FFA images from non-invasive fundus images holds significant practical value but also presents challenges. First, limited datasets constrain the performance and effectiveness of models. Second, previous studies have primarily focused on generating FFA for single diseases or single modalities, often resulting in poor performance for patients with various ophthalmic conditions. To address these issues, we propose a novel latent diffusion model-based framework, Diffusion, which introduces a fine-tuning protocol to overcome the challenge of limited medical data and unleash the generative capabilities of diffusion models. Furthermore, we designed a new approach to tackle the challenges of generating across different modalities and disease types. On limited datasets, our framework achieves state-of-the-art results compared to existing methods, offering significant potential to enhance ophthalmic diagnostics and patient care. Our code will be released soon to support further research in this field.
摘要：眼底成像是眼科的重要工具，不同的成像方式各有优势。例如，眼底荧光血管造影 (FFA) 可以准确识别眼部疾病。然而，传统的侵入式 FFA 需要注射荧光素钠，这可能会导致不适和风险。从非侵入式眼底图像生成相应的 FFA 图像具有重要的实用价值，但也带来了挑战。首先，有限的数据集限制了模型的性能和有效性。其次，以前的研究主要集中于为单一疾病或单一模式生成 FFA，这通常会导致患有各种眼科疾病的患者表现不佳。为了解决这些问题，我们提出了一种基于潜在扩散模型的新型框架 Diffusion，它引入了一种微调协议来克服有限的医疗数据挑战并释放扩散模型的生成能力。此外，我们设计了一种新方法来应对跨不同模式和疾病类型的生成挑战。在有限的数据集上，我们的框架与现有方法相比取得了最先进的结果，为增强眼科诊断和患者护理提供了巨大的潜力。我们的代码将很快发布，以支持该领域的进一步研究。

Title: RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning

Authors: Kanghoon Yoon, Kibum Kim, Jaehyung Jeon, Yeonjun In, Donghyun Kim, Chanyoung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12788
Pdf URL: https://arxiv.org/pdf/2412.12788
Copy Paste: [[2412.12788]] RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning(https://arxiv.org/abs/2412.12788)
Keywords: generation
Abstract: Scene Graph Generation (SGG) research has suffered from two fundamental challenges: the long-tailed predicate distribution and semantic ambiguity between predicates. These challenges lead to a bias towards head predicates in SGG models, favoring dominant general predicates while overlooking fine-grained predicates. In this paper, we address the challenges of SGG by framing it as multi-label classification problem with partial annotation, where relevant labels of fine-grained predicates are missing. Under the new frame, we propose Retrieval-Augmented Scene Graph Generation (RA-SGG), which identifies potential instances to be multi-labeled and enriches the single-label with multi-labels that are semantically similar to the original label by retrieving relevant samples from our established memory bank. Based on augmented relations (i.e., discovered multi-labels), we apply multi-prototype learning to train our SGG model. Several comprehensive experiments have demonstrated that RA-SGG outperforms state-of-the-art baselines by up to 3.6% on VG and 5.9% on GQA, particularly in terms of F@K, showing that RA-SGG effectively alleviates the issue of biased prediction caused by the long-tailed distribution and semantic ambiguity of predicates.
摘要：场景图生成 (SGG) 研究面临两个基本挑战：长尾谓词分布和谓词之间的语义模糊性。这些挑战导致 SGG 模型偏向头部谓词，青睐占主导地位的一般谓词而忽略细粒度谓词。在本文中，我们通过将 SGG 定义为带有部分注释的多标签分类问题来解决其挑战，其中缺少细粒度谓词的相关标签。在新的框架下，我们提出了检索增强场景图生成 (RA-SGG)，它通过从我们已建立的记忆库中检索相关样本来识别需要多标记的潜在实例，并使用与原始标签在语义上相似的多标签来丰富单标签。基于增强关系（即发现的多标签），我们应用多原型学习来训练我们的 SGG 模型。多项综合实验表明，RA-SGG 在 VG 上的表现比最新基线高出 3.6%，在 GQA 上的表现比最新基线高出 5.9%，特别是在 F@K 方面，表明 RA-SGG 有效地缓解了由谓词的长尾分布和语义模糊性引起的预测偏差问题。

Title: Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

Authors: Shiping Ge, Qiang Chen, Zhiwei Jiang, Yafeng Yin, Liu Qin, Ziyao Chen, Qing Gu
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2412.12791
Pdf URL: https://arxiv.org/pdf/2412.12791
Copy Paste: [[2412.12791]] Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning(https://arxiv.org/abs/2412.12791)
Keywords: generation
Abstract: Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.
摘要：弱监督密集视频字幕 (WSDVC) 旨在定位和描述视频中所有感兴趣的事件，而无需注释事件边界。由于没有相关的监督，这种设置对准确定位事件的时间位置提出了巨大挑战。现有方法依赖于事件位置和字幕之间的显式对齐约束，这涉及训练和推理过程中的复杂事件提议程序。为了解决这个问题，我们提出了一种通过互补掩码进行隐式位置字幕对齐的新型范式，它简化了复杂的事件提议和定位过程，同时保持了有效性。具体来说，我们的模型包括两个组件：双模视频字幕模块和掩码生成模块。双模视频字幕模块捕获全局事件信息并生成描述性字幕，而掩码生成模块生成可区分的正掩码和负掩码以定位事件。这些掩码通过确保从正掩码和负掩码视频生成的字幕互补，从而形成完整的视频描述，从而实现事件位置和字幕的隐式对齐。这样，即使在弱监督下，事件位置和事件标题也可以隐式对齐。在公开数据集上进行的大量实验表明，我们的方法优于现有的弱监督方法，并且与全监督方法相比取得了有竞争力的结果。

Title: Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

Authors: Zhengdi Yu, Stefanos Zafeiriou, Tolga Birdal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12861
Pdf URL: https://arxiv.org/pdf/2412.12861
Copy Paste: [[2412.12861]] Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera(https://arxiv.org/abs/2412.12861)
Keywords: generative
Abstract: We propose Dyn-HaMR, to the best of our knowledge, the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild. Reconstructing accurate 3D hand meshes from monocular videos is a crucial task for understanding human behaviour, with significant applications in augmented and virtual reality (AR/VR). However, existing methods for monocular hand reconstruction typically rely on a weak perspective camera model, which simulates hand motion within a limited camera frustum. As a result, these approaches struggle to recover the full 3D global trajectory and often produce noisy or incorrect depth estimations, particularly when the video is captured by dynamic or moving cameras, which is common in egocentric scenarios. Our Dyn-HaMR consists of a multi-stage, multi-objective optimization pipeline, that factors in (i) simultaneous localization and mapping (SLAM) to robustly estimate relative camera motion, (ii) an interacting-hand prior for generative infilling and to refine the interaction dynamics, ensuring plausible recovery under (self-)occlusions, and (iii) hierarchical initialization through a combination of state-of-the-art hand tracking methods. Through extensive evaluations on both in-the-wild and indoor datasets, we show that our approach significantly outperforms state-of-the-art methods in terms of 4D global mesh recovery. This establishes a new benchmark for hand motion reconstruction from monocular video with moving cameras. Our project page is at this https URL.
摘要：据我们所知，我们提出了 Dyn-HaMR，这是第一种从野外动态摄像机记录的单目视频重建 4D 全局手部运动的方法。从单目视频重建精确的 3D 手部网格是理解人类行为的关键任务，在增强现实和虚拟现实 (AR/VR) 中具有重要应用。然而，现有的单目手部重建方法通常依赖于弱透视相机模型，该模型在有限的相机视锥体内模拟手部运动。因此，这些方法难以恢复完整的 3D 全局轨迹，并且经常产生嘈杂或不正确的深度估计，尤其是当视频由动态或移动相机拍摄时，这在以自我为中心的场景中很常见。我们的 Dyn-HaMR 由多阶段、多目标优化流程组成，该流程考虑了 (i) 同时定位和映射 (SLAM) 以稳健地估计相对相机运动，(ii) 交互手先验用于生成填充并改进交互动态，确保在 (自) 遮挡下实现合理恢复，以及 (iii) 通过结合最先进的手部跟踪方法进行分层初始化。通过对野外和室内数据集进行广泛的评估，我们表明我们的方法在 4D 全局网格恢复方面明显优于最先进的方法。这为使用移动摄像机从单目视频重建手部运动建立了新的基准。我们的项目页面位于此 https URL。

Title: ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction

Authors: Zhongjie Duan, Qianyi Zhao, Cen Chen, Daoyuan Chen, Wenmeng Zhou, Yaliang Li, Yingda Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12888
Pdf URL: https://arxiv.org/pdf/2412.12888
Copy Paste: [[2412.12888]] ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction(https://arxiv.org/abs/2412.12888)
Keywords: generation, generative
Abstract: The emergence of diffusion models has significantly advanced image synthesis. The recent studies of model interaction and self-corrective reasoning approach in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models in this paper. To the best of our knowledge, ArtAug is the first one that improves image synthesis models via model interactions with understanding models. In the interactions, we leverage human preferences implicitly learned by image understanding models to provide fine-grained suggestions for image synthesis models. The interactions can modify the image content to make it aesthetically pleasing, such as adjusting exposure, changing shooting angles, and adding atmospheric effects. The enhancements brought by the interaction are iteratively fused into the synthesis model itself through an additional enhancement module. This enables the synthesis model to directly produce aesthetically pleasing images without any extra computational cost. In the experiments, we train the ArtAug enhancement module on existing text-to-image models. Various evaluation metrics consistently demonstrate that ArtAug enhances the generative capabilities of text-to-image models without incurring additional computational costs. The source code and models will be released publicly.
摘要：扩散模型的出现极大地推动了图像合成的发展。最近对大型语言模型中的模型交互和自我纠正推理方法的研究为增强文本到图像模型提供了新的见解。受这些研究的启发，我们在本文中提出了一种称为 ArtAug 的用于增强文本到图像模型的新方法。据我们所知，ArtAug 是第一个通过模型与理解模型的交互来改进图像合成模型的方法。在交互中，我们利用图像理解模型隐式学习到的人类偏好为图像合成模型提供细粒度的建议。交互可以修改图像内容以使其美观，例如调整曝光、改变拍摄角度和添加大气效果。交互带来的增强通过额外的增强模块迭代地融合到合成模型本身中。这使得合成模型能够直接生成美观的图像而无需任何额外的计算成本。在实验中，我们在现有的文本到图像模型上训练 ArtAug 增强模块。各种评估指标一致表明，ArtAug 增强了文本到图像模型的生成能力，而无需额外的计算成本。源代码和模型将公开发布。

Title: An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions

Authors: Shreeyash Gowaikar, Srinivasan Iyengar, Sameer Segal, Shivkumar Kalyanaraman
Subjects: cs.LG, cs.CE, cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2412.12898
Pdf URL: https://arxiv.org/pdf/2412.12898
Copy Paste: [[2412.12898]] An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions(https://arxiv.org/abs/2412.12898)
Keywords: generation, generative
Abstract: The Piping and Instrumentation Diagrams (P&IDs) are foundational to the design, construction, and operation of workflows in the engineering and process industries. However, their manual creation is often labor-intensive, error-prone, and lacks robust mechanisms for error detection and correction. While recent advancements in Generative AI, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs), have demonstrated significant potential across various domains, their application in automating generation of engineering workflows remains underexplored. In this work, we introduce a novel copilot for automating the generation of P&IDs from natural language descriptions. Leveraging a multi-step agentic workflow, our copilot provides a structured and iterative approach to diagram creation directly from Natural Language prompts. We demonstrate the feasibility of the generation process by evaluating the soundness and completeness of the workflow, and show improved results compared to vanilla zero-shot and few-shot generation approaches.
摘要：管道和仪表图 (P&ID) 是工程和流程工业中工作流程设计、构建和操作的基础。然而，手动创建这些图通常需要大量劳动力、容易出错，并且缺乏强大的错误检测和纠正机制。虽然生成式人工智能的最新进展，尤其是大型语言模型 (LLM) 和视觉语言模型 (VLM)，已在各个领域展现出巨大潜力，但它们在自动生成工程工作流程方面的应用仍未得到充分探索。在这项工作中，我们引入了一种新颖的副驾驶，用于自动从自然语言描述生成 P&ID。利用多步骤代理工作流程，我们的副驾驶提供了一种结构化和迭代的方法来直接从自然语言提示创建图表。我们通过评估工作流程的合理性和完整性来证明生成过程的可行性，并展示了与普通零样本和少量样本生成方法相比更好的结果。

Title: Unsupervised Region-Based Image Editing of Denoising Diffusion Models

Authors: Zixiang Li, Yue Song, Renshuai Tao, Xiaohong Jia, Yao Zhao, Wei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12912
Pdf URL: https://arxiv.org/pdf/2412.12912
Copy Paste: [[2412.12912]] Unsupervised Region-Based Image Editing of Denoising Diffusion Models(https://arxiv.org/abs/2412.12912)
Keywords: generation
Abstract: Although diffusion models have achieved remarkable success in the field of image generation, their latent space remains under-explored. Current methods for identifying semantics within latent space often rely on external supervision, such as textual information and segmentation masks. In this paper, we propose a method to identify semantic attributes in the latent space of pre-trained diffusion models without any further training. By projecting the Jacobian of the targeted semantic region into a low-dimensional subspace which is orthogonal to the non-masked regions, our approach facilitates precise semantic discovery and control over local masked areas, eliminating the need for annotations. We conducted extensive experiments across multiple datasets and various architectures of diffusion models, achieving state-of-the-art performance. In particular, for some specific face attributes, the performance of our proposed method even surpasses that of supervised approaches, demonstrating its superior ability in editing local image properties.
摘要：尽管扩散模型在图像生成领域取得了显著成功，但其潜在空间仍未得到充分探索。当前用于识别潜在空间内语义的方法通常依赖于外部监督，例如文本信息和分割掩码。在本文中，我们提出了一种无需进一步训练即可识别预训练扩散模型潜在空间中语义属性的方法。通过将目标语义区域的雅可比矩阵投影到与非掩码区域正交的低维子空间中，我们的方法有助于精确发现语义并控制局部掩码区域，从而无需注释。我们在多个数据集和各种扩散模型架构上进行了广泛实验，取得了最佳性能。特别是对于某些特定的人脸属性，我们提出的方法的性能甚至超过了监督方法，展示了其在编辑局部图像属性方面的卓越能力。

Title: Graph Spring Neural ODEs for Link Sign Prediction

Authors: Andrin Rehmann, Alexandre Bovet
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2412.12916
Pdf URL: https://arxiv.org/pdf/2412.12916
Copy Paste: [[2412.12916]] Graph Spring Neural ODEs for Link Sign Prediction(https://arxiv.org/abs/2412.12916)
Keywords: generation
Abstract: Signed graphs allow for encoding positive and negative relations between nodes and are used to model various online activities. Node representation learning for signed graphs is a well-studied task with important applications such as sign prediction. While the size of datasets is ever-increasing, recent methods often sacrifice scalability for accuracy. We propose a novel message-passing layer architecture called Graph Spring Network (GSN) modeled after spring forces. We combine it with a Graph Neural Ordinary Differential Equations (ODEs) formalism to optimize the system dynamics in embedding space to solve a downstream prediction task. Once the dynamics is learned, embedding generation for novel datasets is done by solving the ODEs in time using a numerical integration scheme. Our GSN layer leverages the fast-to-compute edge vector directions and learnable scalar functions that only depend on nodes' distances in latent space to compute the nodes' positions. Conversely, Graph Convolution and Graph Attention Network layers rely on learnable vector functions that require the full positions of input nodes in latent space. We propose a specific implementation called Spring-Neural-Network (SPR-NN) using a set of small neural networks mimicking attracting and repulsing spring forces that we train for link sign prediction. Experiments show that our method achieves accuracy close to the state-of-the-art methods with node generation time speedup factors of up to 28,000 on large graphs.
摘要：带符号图允许对节点之间的正负关系进行编码，并用于对各种在线活动进行建模。带符号图的节点表示学习是一项研究充分的任务，具有符号预测等重要应用。虽然数据集的大小在不断增加，但最近的方法通常会牺牲可扩展性来换取准确性。我们提出了一种新型消息传递层架构，称为图弹簧网络 (GSN)，该架构以弹簧力为模型。我们将其与图神经常微分方程 (ODE) 形式相结合，以优化嵌入空间中的系统动力学以解决下游预测任务。一旦学习了动态，就可以通过使用数值积分方案及时求解 ODE 来完成新数据集的嵌入生成。我们的 GSN 层利用快速计算的边向量方向和可学习的标量函数，这些函数仅依赖于潜在空间中节点的距离来计算节点的位置。相反，图卷积和图注意网络层依赖于可学习的向量函数，这些函数需要潜在空间中输入节点的完整位置。我们提出了一种称为 Spring-Neural-Network (SPR-NN) 的具体实现，它使用一组小型神经网络来模拟我们为链接符号预测而训练的吸引和排斥弹簧力。实验表明，我们的方法在大型图上实现了接近最先进方法的准确度，节点生成时间加速因子高达 28,000。

Title: CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models

Authors: Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, Libo Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12932
Pdf URL: https://arxiv.org/pdf/2412.12932
Copy Paste: [[2412.12932]] CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models(https://arxiv.org/abs/2412.12932)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.
摘要：大型视觉语言模型 (LVLM) 最近在多模态任务中表现出惊人的成功，包括多模态思维链 (MCoT) 推理方面的进步。尽管取得了这些成功，但当前的基准仍然遵循多模态输入和文本模态输出的传统范式，这导致了诸如缺少视觉操作和表达模糊等重大缺陷。受此启发，我们引入了一种新颖的多模态思维链 (CoMT) 基准来解决这些限制。与传统的 MCoT 基准不同，CoMT 需要多模态输入和多模态推理输出，旨在模仿本质上集成了视觉操作的类人推理。具体来说，CoMT 包括四个类别：(1) 视觉创建、(2) 视觉删除、(3) 视觉更新和 (4) 视觉选择，以全面探索真实场景中复杂的视觉操作和简洁的表达。我们在 CoMT 上评估了各种 LVLM 和策略，揭示了对当前方法的能力和局限性的一些关键见解。我们希望 CoMT 能够激发更多将多模式生成引入推理过程的研究。

Title: Synthetic Data Generation for Anomaly Detection on Table Grapes

Authors: Ionut Marian Motoi, Valerio Belli, Alberto Carpineto, Daniele Nardi, Thomas Alessandro Ciarfuglia
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.12949
Pdf URL: https://arxiv.org/pdf/2412.12949
Copy Paste: [[2412.12949]] Synthetic Data Generation for Anomaly Detection on Table Grapes(https://arxiv.org/abs/2412.12949)
Keywords: generation
Abstract: Early detection of illnesses and pest infestations in fruit cultivation is critical for maintaining yield quality and plant health. Computer vision and robotics are increasingly employed for the automatic detection of such issues, particularly using data-driven solutions. However, the rarity of these problems makes acquiring and processing the necessary data to train such algorithms a significant obstacle. One solution to this scarcity is the generation of synthetic high-quality anomalous samples. While numerous methods exist for this task, most require highly trained individuals for setup. This work addresses the challenge of generating synthetic anomalies in an automatic fashion that requires only an initial collection of normal and anomalous samples from the user - a task that is straightforward for farmers. We demonstrate the approach in the context of table grape cultivation. Specifically, based on the observation that normal berries present relatively smooth surfaces, while defects result in more complex textures, we introduce a Dual-Canny Edge Detection (DCED) filter. This filter emphasizes the additional texture indicative of diseases, pest infestations, or other defects. Using segmentation masks provided by the Segment Anything Model, we then select and seamlessly blend anomalous berries onto normal ones. We show that the proposed dataset augmentation technique improves the accuracy of an anomaly classifier for table grapes and that the approach can be generalized to other fruit types.
摘要：及早发现水果种植中的病虫害对于保持产量质量和植物健康至关重要。计算机视觉和机器人技术越来越多地用于自动检测此类问题，尤其是使用数据驱动的解决方案。然而，这些问题的罕见性使得获取和处理训练此类算法所需的数据成为一个重大障碍。解决这一稀缺性的一个方法是生成合成的高质量异常样本。虽然有许多方法可以完成这项任务，但大多数都需要训练有素的人员进行设置。这项工作解决了以自动方式生成合成异常的挑战，只需要从用户那里初步收集正常和异常样本 - 这项任务对农民来说很简单。我们在鲜食葡萄种植的背景下展示了这种方法。具体来说，基于正常浆果呈现相对光滑的表面，而缺陷导致更复杂的纹理的观察，我们引入了双 Canny 边缘检测 (DCED) 滤波器。该滤波器强调指示疾病、虫害或其他缺陷的额外纹理。然后，我们使用 Segment Anything 模型提供的分割掩码，选择异常浆果并将其无缝混合到正常浆果中。我们表明，所提出的数据集增强技术提高了鲜食葡萄异常分类器的准确性，并且该方法可以推广到其他水果类型。

Title: ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting

Authors: Guillaume Couairon, Renu Singh, Anastase Charantonis, Christian Lessig, Claire Monteleoni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.12971
Pdf URL: https://arxiv.org/pdf/2412.12971
Copy Paste: [[2412.12971]] ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting(https://arxiv.org/abs/2412.12971)
Keywords: generative
Abstract: Weather forecasting plays a vital role in today's society, from agriculture and logistics to predicting the output of renewable energies, and preparing for extreme weather events. Deep learning weather forecasting models trained with the next state prediction objective on ERA5 have shown great success compared to numerical global circulation models. However, for a wide range of applications, being able to provide representative samples from the distribution of possible future weather states is critical. In this paper, we propose a methodology to leverage deterministic weather models in the design of probabilistic weather models, leading to improved performance and reduced computing costs. We first introduce \textbf{ArchesWeather}, a transformer-based deterministic model that improves upon Pangu-Weather by removing overrestrictive inductive priors. We then design a probabilistic weather model called \textbf{ArchesWeatherGen} based on flow matching, a modern variant of diffusion models, that is trained to project ArchesWeather's predictions to the distribution of ERA5 weather states. ArchesWeatherGen is a true stochastic emulator of ERA5 and surpasses IFS ENS and NeuralGCM on all WeatherBench headline variables (except for NeuralGCM's geopotential). Our work also aims to democratize the use of deterministic and generative machine learning models in weather forecasting research, with academic computing resources. All models are trained at 1.5° resolution, with a training budget of $\sim$9 V100 days for ArchesWeather and $\sim$45 V100 days for ArchesWeatherGen. For inference, ArchesWeatherGen generates 15-day weather trajectories at a rate of 1 minute per ensemble member on a A100 GPU card. To make our work fully reproducible, our code and models are open source, including the complete pipeline for data preparation, training, and evaluation, at this https URL .
摘要：天气预报在当今社会中发挥着至关重要的作用，从农业和物流到预测可再生能源的产出，再到为极端天气事件做准备。与数值全球环流模型相比，使用 ERA5 上的下一状态预测目标训练的深度学习天气预报模型取得了巨大成功。然而，对于广泛的应用来说，能够从可能的未来天气状态分布中提供代表性样本至关重要。在本文中，我们提出了一种方法，利用确定性天气模型来设计概率天气模型，从而提高性能并降低计算成本。我们首先介绍 \textbf{ArchesWeather}，这是一种基于变压器的确定性模型，它通过消除过度限制的归纳先验来改进盘古天气。然后，我们基于流匹配设计了一个称为 \textbf{ArchesWeatherGen} 的概率天气模型，流匹配是扩散模型的现代变体，经过训练可以将 ArchesWeather 的预测投射到 ERA5 天气状态的分布中。 ArchesWeatherGen 是 ERA5 的真正随机模拟器，在所有 WeatherBench 主要变量（NeuralGCM 的位势除外）上都超越了 IFS ENS 和 NeuralGCM。我们的工作还旨在利用学术计算资源，使确定性和生成性机器学习模型在天气预报研究中的使用更加民主化。所有模型均以 1.5° 分辨率进行训练，ArchesWeather 的训练预算为 $\sim$9 V100 天，ArchesWeatherGen 的训练预算为 $\sim$45 V100 天。对于推理，ArchesWeatherGen 在 A100 GPU 卡上以每个集合成员 1 分钟的速度生成 15 天的天气轨迹。为了使我们的工作完全可重复，我们的代码和模型都是开源的，包括数据准备、训练和评估的完整流程，位于此 https URL 。

Title: Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance

Authors: Wenhao Sun, Benlei Cui, Jingqun Tang, Xue-Mei Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12974
Pdf URL: https://arxiv.org/pdf/2412.12974
Copy Paste: [[2412.12974]] Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance(https://arxiv.org/abs/2412.12974)
Keywords: generation, generative
Abstract: Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly in image generation. However, when employed for object removal tasks, they still encounter issues such as generating random artifacts and the incapacity to repaint foreground object areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method to empower pre-trained diffusion models for stable and effective object removal. Firstly, in light of the observation that the self-attention maps influence the structure and shape details of the generated images, we propose Attention Activation and Suppression (ASS), which re-engineers the self-attention mechanism within the pre-trained diffusion models based on the given mask, thereby prioritizing the background over the foreground object during the reverse generation process. Moreover, we introduce Self-Attention Redirection Guidance (SARG), which utilizes the self-attention redirected by ASS to guide the generation process, effectively removing foreground objects within the mask while simultaneously generating content that is both plausible and coherent. Experiments demonstrate the stability and effectiveness of Attentive Eraser in object removal across a variety of pre-trained diffusion models, outperforming even training-based methods. Furthermore, Attentive Eraser can be implemented in various diffusion model architectures and checkpoints, enabling excellent scalability. Code is available at this https URL.
摘要：最近，扩散模型在生成模型领域崭露头角，在图像生成中大放异彩。然而，当用于物体移除任务时，它们仍然会遇到诸如生成随机伪影以及移除后无法用适当内容重新绘制前景物体区域等问题。为了解决这些问题，我们提出了 Attentive Eraser，这是一种无需调整的方法，可增强预训练扩散模型的稳定有效物体移除能力。首先，鉴于自注意力图影响生成图像的结构和形状细节的观察结果，我们提出了注意力激活和抑制 (ASS)，它根据给定的掩码重新设计预训练扩散模型中的自注意力机制，从而在反向生成过程中优先考虑背景而不是前景物体。此外，我们引入了自注意力重定向引导 (SARG)，它利用 ASS 重定向的自注意力来指导生成过程，有效地移除掩码内的前景物体，同时生成既合理又连贯的内容。实验证明了 Attentive Eraser 在各种预训练扩散模型中移除物体的稳定性和有效性，甚至优于基于训练的方法。此外，Attentive Eraser 可以在各种扩散模型架构和检查点中实现，从而实现出色的可扩展性。代码可在此 https URL 上找到。

Title: Future Aspects in Human Action Recognition: Exploring Emerging Techniques and Ethical Influences

Authors: Antonios Gasteratos, Stavros N. Moutsis, Konstantinos A. Tsintotas, Yiannis Aloimonos
Subjects: cs.CV, cs.ET, cs.RO
Abstract URL: https://arxiv.org/abs/2412.12990
Pdf URL: https://arxiv.org/pdf/2412.12990
Copy Paste: [[2412.12990]] Future Aspects in Human Action Recognition: Exploring Emerging Techniques and Ethical Influences(https://arxiv.org/abs/2412.12990)
Keywords: generation
Abstract: Visual-based human action recognition can be found in various application fields, e.g., surveillance systems, sports analytics, medical assistive technologies, or human-robot interaction frameworks, and it concerns the identification and classification of individuals' activities within a video. Since actions typically occur over a sequence of consecutive images, it is particularly challenging due to the inclusion of temporal analysis, which introduces an extra layer of complexity. However, although multiple approaches try to handle temporal analysis, there are still difficulties because of their computational cost and lack of adaptability. Therefore, different types of vision data, containing transition information between consecutive images, provided by next-generation hardware sensors will guide the robotics community in tackling the problem of human action recognition. On the other hand, while there is a plethora of still-image datasets, that researchers can adopt to train new artificial intelligence models, videos representing human activities are of limited capabilities, e.g., small and unbalanced datasets or selected without control from multiple sources. To this end, generating new and realistic synthetic videos is possible since labeling is performed throughout the data creation process, while reinforcement learning techniques can permit the avoidance of considerable dataset dependence. At the same time, human factors' involvement raises ethical issues for the research community, as doubts and concerns about new technologies already exist.
摘要：基于视觉的人类动作识别可用于各种应用领域，例如监控系统、体育分析、医疗辅助技术或人机交互框架，它涉及视频中个人活动的识别和分类。由于动作通常发生在一系列连续的图像上，因此由于包含时间分析而特别具有挑战性，这引入了额外的复杂性。然而，尽管有多种方法尝试处理时间分析，但由于计算成本高且缺乏适应性，仍然存在困难。因此，下一代硬件传感器提供的不同类型的视觉数据（包含连续图像之间的过渡信息）将指导机器人社区解决人类动作识别问题。另一方面，虽然有大量的静态图像数据集，研究人员可以采用它们来训练新的人工智能模型，但代表人类活动的视频功能有限，例如数据集小且不平衡或从多个来源无控制地选择。为此，生成新的、逼真的合成视频是可能的，因为标记是在整个数据创建过程中进行的，而强化学习技术可以避免对数据集的大量依赖。与此同时，人为因素的参与给研究界带来了伦理问题，因为对新技术的怀疑和担忧已经存在。

Title: A New Adversarial Perspective for LiDAR-based 3D Object Detection

Authors: Shijun Zheng, Weiquan Liu, Yu Guo, Yu Zang, Siqi Shen, Cheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13017
Pdf URL: https://arxiv.org/pdf/2412.13017
Copy Paste: [[2412.13017]] A New Adversarial Perspective for LiDAR-based 3D Object Detection(https://arxiv.org/abs/2412.13017)
Keywords: generation, generative
Abstract: Autonomous vehicles (AVs) rely on LiDAR sensors for environmental perception and decision-making in driving scenarios. However, ensuring the safety and reliability of AVs in complex environments remains a pressing challenge. To address this issue, we introduce a real-world dataset (ROLiD) comprising LiDAR-scanned point clouds of two random objects: water mist and smoke. In this paper, we introduce a novel adversarial perspective by proposing an attack framework that utilizes water mist and smoke to simulate environmental interference. Specifically, we propose a point cloud sequence generation method using a motion and content decomposition generative adversarial network named PCS-GAN to simulate the distribution of random objects. Furthermore, leveraging the simulated LiDAR scanning characteristics implemented with Range Image, we examine the effects of introducing random object perturbations at various positions on the target vehicle. Extensive experiments demonstrate that adversarial perturbations based on random objects effectively deceive vehicle detection and reduce the recognition rate of 3D object detection models.
摘要：自动驾驶汽车 (AV) 依靠 LiDAR 传感器在驾驶场景中进行环境感知和决策。然而，确保 AV 在复杂环境中的安全性和可靠性仍然是一个紧迫的挑战。为了解决这个问题，我们引入了一个真实世界数据集 (ROLiD)，其中包含两个随机物体的 LiDAR 扫描点云：水雾和烟雾。在本文中，我们通过提出一种利用水雾和烟雾模拟环境干扰的攻击框架来引入一种新颖的对抗视角。具体来说，我们提出了一种点云序列生成方法，使用名为 PCS-GAN 的运动和内容分解生成对抗网络来模拟随机物体的分布。此外，利用使用 Range Image 实现的模拟 LiDAR 扫描特性，我们研究了在目标车辆的各个位置引入随机物体扰动的影响。大量实验表明，基于随机物体的对抗性扰动可以有效欺骗车辆检测并降低 3D 物体检测模型的识别率。

Title: Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Authors: Weiguo Pian, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.13050
Pdf URL: https://arxiv.org/pdf/2412.13050
Copy Paste: [[2412.13050]] Modality-Inconsistent Continual Learning of Multimodal Large Language Models(https://arxiv.org/abs/2412.13050)
Keywords: generation
Abstract: In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our proposed MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
摘要：在本文中，我们介绍了模态不一致持续学习 (MICL)，这是多模态大型语言模型 (MLLM) 的一种新持续学习场景，涉及模态不一致（图像、音频或视频）和各种任务类型（字幕或问答）的任务。与现有的纯视觉或模态增量设置不同，MICL 结合了模态和任务类型转变，这两者都会导致灾难性遗忘。为了应对这些挑战，我们提出了 MoInCL，它采用伪目标生成模块来减轻由先前看到的模态中的任务类型转变引起的遗忘。它还结合了基于指令的知识提炼，以在引入新模态时保留模型处理先前学习的模态的能力。我们使用总共六个任务对 MICL 进行基准测试，并进行实验以验证我们提出的 MoInCL 的有效性。实验结果突出了 MoInCL 的优越性，显示出比代表性和最先进的持续学习基线有显着改进。

Title: VidTok: A Versatile and Open-Source Video Tokenizer

Authors: Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.13061
Pdf URL: https://arxiv.org/pdf/2412.13061
Copy Paste: [[2412.13061]] VidTok: A Versatile and Open-Source Video Tokenizer(https://arxiv.org/abs/2412.13061)
Keywords: generation
Abstract: Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.
摘要：将视频内容编码为紧凑的潜在标记已成为视频生成和理解的一个基本步骤，这是由解决像素级表示中固有冗余的需求所驱动的。因此，随着以视频为中心的研究越来越受到重视，对高性能、开源视频标记器的需求也日益增长。我们推出了 VidTok，这是一款多功能视频标记器，可在连续和离散标记化中提供最先进的性能。与现有方法相比，VidTok 采用了几个关键的改进：1) 模型架构，例如卷积层和上/下采样模块；2) 为了解决通常与传统矢量量化 (VQ) 相关的训练不稳定性和码本崩溃问题，我们将有限标量量化 (FSQ) 集成到离散视频标记化中；3) 改进了训练策略，包括两阶段训练过程和使用降低的帧速率。通过整合这些进步，VidTok 在现有方法上取得了实质性的改进，在标准化评估设置下，在 PSNR、SSIM、LPIPS 和 FVD 等多个指标上表现出优异的性能。

Title: Prompt Augmentation for Self-supervised Text-guided Image Manipulation

Authors: Rumeysa Bodur, Binod Bhattarai, Tae-Kyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13081
Pdf URL: https://arxiv.org/pdf/2412.13081
Copy Paste: [[2412.13081]] Prompt Augmentation for Self-supervised Text-guided Image Manipulation(https://arxiv.org/abs/2412.13081)
Keywords: generation
Abstract: Text-guided image editing finds applications in various creative and practical fields. While recent studies in image generation have advanced the field, they often struggle with the dual challenges of coherent image transformation and context preservation. In response, our work introduces prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. Specifically, we use the augmented prompts to delineate the intended manipulation area. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. Acknowledging the continuous nature of image manipulations, we further refine our approach by incorporating the similarity concept, creating a Soft Contrastive Loss. The new losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.
摘要：文本引导的图像编辑可应用于各种创意和实践领域。虽然图像生成领域的最新研究推动了该领域的发展，但它们往往面临着连贯图像转换和上下文保存的双重挑战。为了应对这一挑战，我们的工作引入了提示增强，这是一种将单个输入提示放大为多个目标提示的方法，可增强文本上下文并实现局部图像编辑。具体来说，我们使用增强的提示来描绘预期的操作区域。我们提出了一种对比损失，通过移动编辑区域并拉近保留区域来推动有效的图像编辑。认识到图像处理的连续性，我们通过结合相似性概念进一步改进我们的方法，创建了软对比损失。新的损失被纳入扩散模型，与最先进的方法相比，在公共数据集和生成的图像上展示了改进或具有竞争力的图像编辑结果。

Title: Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation

Authors: Huaijin Pi, Ruoxi Guo, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.13111
Pdf URL: https://arxiv.org/pdf/2412.13111
Copy Paste: [[2412.13111]] Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation(https://arxiv.org/abs/2412.13111)
Keywords: generation
Abstract: Text-driven human motion synthesis is capturing significant attention for its ability to effortlessly generate intricate movements from abstract text cues, showcasing its potential for revolutionizing motion design not only in film narratives but also in virtual reality experiences and computer game development. Existing methods often rely on 3D motion capture data, which require special setups resulting in higher costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore leveraging 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-motion pairs. To enhance this model to synthesize 3D motion, we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Experiments on the HumanML3D dataset and novel text prompts demonstrate that our method efficiently utilizes 2D data, supporting realistic 3D human motion generation and broadening the range of motion types it supports. Our code will be made publicly available at this https URL.
摘要：文本驱动的人体运动合成因其能够轻松地从抽象的文本提示中生成复杂的动作而备受关注，展示了其不仅在电影叙事中，而且在虚拟现实体验和电脑游戏开发中彻底改变动作设计的潜力。现有方法通常依赖于 3D 运动捕捉数据，而这需要特殊设置，导致数据采集成本更高，最终限制了人体运动的多样性和范围。相比之下，2D 人体视频提供了庞大且可访问的运动数据源，涵盖了更广泛的风格和活动。在本文中，我们探索利用从视频中提取的 2D 人体运动作为替代数据源来改进文本驱动的 3D 运动生成。我们的方法引入了一个新颖的框架，将局部关节运动与全局运动分离，从而能够有效地从 2D 数据中学习局部运动先验。我们首先在大量文本运动对数据集上训练单视图 2D 局部运动生成器。为了增强此模型以合成 3D 运动，我们使用 3D 数据对生成器进行微调，将其转换为多视图生成器，以预测视图一致的局部关节运动和根部动态。在 HumanML3D 数据集和新颖的文本提示上进行的实验表明，我们的方法可以有效利用 2D 数据，支持逼真的 3D 人体运动生成并扩大其支持的运动类型范围。我们的代码将在此 https URL 上公开发布。

Title: F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration

Authors: Lu Liu, Huiyu Duan, Qiang Hu, Liu Yang, Chunlei Cai, Tianxiao Ye, Huayu Liu, Xiaoyun Zhang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13155
Pdf URL: https://arxiv.org/pdf/2412.13155
Copy Paste: [[2412.13155]] F-Bench: Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation, Customization, and Restoration(https://arxiv.org/abs/2412.13155)
Keywords: restoration, generation, generative, quality assessment
Abstract: Artificial intelligence generative models exhibit remarkable capabilities in content creation, particularly in face image generation, customization, and restoration. However, current AI-generated faces (AIGFs) often fall short of human preferences due to unique distortions, unrealistic details, and unexpected identity shifts, underscoring the need for a comprehensive quality evaluation framework for AIGFs. To address this need, we introduce FaceQ, a large-scale, comprehensive database of AI-generated Face images with fine-grained Quality annotations reflecting human preferences. The FaceQ database comprises 12,255 images generated by 29 models across three tasks: (1) face generation, (2) face customization, and (3) face restoration. It includes 32,742 mean opinion scores (MOSs) from 180 annotators, assessed across multiple dimensions: quality, authenticity, identity (ID) fidelity, and text-image correspondence. Using the FaceQ database, we establish F-Bench, a benchmark for comparing and evaluating face generation, customization, and restoration models, highlighting strengths and weaknesses across various prompts and evaluation dimensions. Additionally, we assess the performance of existing image quality assessment (IQA), face quality assessment (FQA), AI-generated content image quality assessment (AIGCIQA), and preference evaluation metrics, manifesting that these standard metrics are relatively ineffective in evaluating authenticity, ID fidelity, and text-image correspondence. The FaceQ database will be publicly available upon publication.
摘要：人工智能生成模型在内容创作方面表现出非凡的能力，特别是在人脸图像生成、定制和恢复方面。然而，目前的人工智能生成的人脸 (AIGF) 往往因独特的扭曲、不切实际的细节和意外的身份转变而达不到人类的偏好，这凸显了对 AIGF 全面质量评估框架的需求。为了满足这一需求，我们推出了 FaceQ，这是一个大规模、全面的人工智能生成人脸图像数据库，具有反映人类偏好的细粒度质量注释。FaceQ 数据库包含 29 个模型在三个任务中生成的 12,255 张图像：(1) 人脸生成、(2) 人脸定制和 (3) 人脸恢复。它包括来自 180 位注释者的 32,742 个平均意见分数 (MOS)，从多个维度进行评估：质量、真实性、身份 (ID) 保真度和文本-图像对应性。利用 FaceQ 数据库，我们建立了 F-Bench，用于比较和评估人脸生成、定制和恢复模型的基准，突出了各种提示和评估维度的优势和劣势。此外，我们评估了现有图像质量评估 (IQA)、人脸质量评估 (FQA)、AI 生成内容图像质量评估 (AIGCIQA) 和偏好评估指标的性能，表明这些标准指标在评估真实性、ID 保真度和文本-图像对应性方面相对无效。FaceQ 数据库将在发布后向公众开放。

Title: Move-in-2D: 2D-Conditioned Human Motion Generation

Authors: Hsin-Ping Huang, Yang Zhou, Jui-Hsien Wang, Difan Liu, Feng Liu, Ming-Hsuan Yang, Zhan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13185
Pdf URL: https://arxiv.org/pdf/2412.13185
Copy Paste: [[2412.13185]] Move-in-2D: 2D-Conditioned Human Motion Generation(https://arxiv.org/abs/2412.13185)
Keywords: generation
Abstract: Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.
摘要：生成逼真的人体视频仍然是一项具有挑战性的任务，目前最有效的方法依赖于人体运动序列作为控制信号。现有方法通常使用从其他视频中提取的现有运动，这将应用限制在特定的运动类型和全局场景匹配上。我们提出了 Move-in-2D，这是一种基于场景图像生成人体运动序列的新方法，允许适应不同场景的多样化运动。我们的方法利用扩散模型，该模型接受场景图像和文本提示作为输入，从而生成针对场景定制的运动序列。为了训练这个模型，我们收集了一个包含单人活动的大规模视频数据集，用相应的人体运动注释每个视频作为目标输出。实验表明，我们的方法可以有效地预测投影后与场景图像一致的人体运动。此外，我们表明生成的运动序列可以提高视频合成任务中的人体运动质量。

Title: StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Authors: Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, Sida Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13188
Pdf URL: https://arxiv.org/pdf/2412.13188
Copy Paste: [[2412.13188]] StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models(https://arxiv.org/abs/2412.13188)
Keywords: generative
Abstract: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.
摘要：本文旨在解决从车辆传感器数据进行照片级逼真视图合成的问题。神经场景表示的最新进展在渲染高质量自动驾驶场景方面取得了显著成功，但随着视点偏离训练轨迹，性能会显著下降。为了缓解这个问题，我们引入了 StreetCrafter，这是一种新颖的可控视频扩散模型，它利用 LiDAR 点云渲染作为像素级条件，充分利用生成先验进行新颖的视图合成，同时保留精确的摄像头控制。此外，利用像素级 LiDAR 条件使我们能够对目标场景进行精确的像素级编辑。此外，StreetCrafter 的生成先验可以有效地融入动态场景表示中，实现实时渲染。在 Waymo Open Dataset 和 PandaSet 上的实验表明，我们的模型能够灵活控制视点变化，扩大视图合成区域以满足渲染要求，这优于现有方法。

Title: MotionBridge: Dynamic Video Inbetweening with Flexible Controls

Authors: Maham Tanveer, Yang Zhou, Simon Niklaus, Ali Mahdavi Amiri, Hao Zhang, Krishna Kumar Singh, Nanxuan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13190
Pdf URL: https://arxiv.org/pdf/2412.13190
Copy Paste: [[2412.13190]] MotionBridge: Dynamic Video Inbetweening with Flexible Controls(https://arxiv.org/abs/2412.13190)
Keywords: generation
Abstract: By generating plausible and smooth transitions between two image frames, video inbetweening is an essential tool for video editing and long video synthesis. Traditional works lack the capability to generate complex large motions. While recent video generation techniques are powerful in creating high-quality results, they often lack fine control over the details of intermediate frames, which can lead to results that do not align with the creative mind. We introduce MotionBridge, a unified video inbetweening framework that allows flexible controls, including trajectory strokes, keyframes, masks, guide pixels, and text. However, learning such multi-modal controls in a unified framework is a challenging task. We thus design two generators to extract the control signal faithfully and encode feature through dual-branch embedders to resolve ambiguities. We further introduce a curriculum training strategy to smoothly learn various controls. Extensive qualitative and quantitative experiments have demonstrated that such multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.
摘要：通过在两个图像帧之间生成合理且平滑的过渡，视频中间帧是视频编辑和长视频合成的重要工具。传统作品缺乏生成复杂大动作的能力。虽然最近的视频生成技术在创建高质量结果方面功能强大，但它们通常缺乏对中间帧细节的精细控制，这可能导致结果与创意思维不一致。我们引入了 MotionBridge，这是一个统一的视频中间帧框架，允许灵活的控制，包括轨迹笔触、关键帧、蒙版、引导像素和文本。然而，在统一框架中学习这种多模式控制是一项具有挑战性的任务。因此，我们设计了两个生成器来忠实地提取控制信号并通过双分支嵌入器对特征进行编码以解决歧义。我们进一步引入了一种课程训练策略来顺利学习各种控制。大量的定性和定量实验表明，这种多模式控制能够实现更动态、可定制和上下文准确的视觉叙事。

Title: CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Authors: Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, Xinguo Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.13195
Pdf URL: https://arxiv.org/pdf/2412.13195
Copy Paste: [[2412.13195]] CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models(https://arxiv.org/abs/2412.13195)
Keywords: generation
Abstract: Text-to-image diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code will be available at this https URL.
摘要：文本到图像的扩散模型擅长生成照片级逼真的图像，但通常难以呈现文本提示中描述的准确空间关系。我们发现这种常见故障背后的两个核心问题：1) 现有数据集中空间相关数据的模糊性，以及 2) 当前文本编码器无法准确解释输入描述的空间语义。我们使用 CoMPaSS 解决了这些问题，这是一个多功能的训练框架，可增强对任何 T2I 扩散模型的空间理解。CoMPaSS 使用面向空间约束的配对 (SCOP) 数据引擎解决了空间相关数据的模糊性，该引擎通过一组原则性的空间约束来整理空间精确的训练数据。为了更好地利用整理的高质量空间先验，CoMPaSS 进一步引入了 Token ENcoding ORdering (TENOR) 模块，以便更好地利用高质量空间先验，从而有效地弥补文本编码器的缺点。对四种流行的开放权重 T2I 扩散模型（涵盖基于 UNet 和 MMDiT 的架构）进行的大量实验证明了 CoMPaSS 的有效性，它在空间关系生成方面的著名基准测试中取得了新的最高水平，并获得了显著的相对收益，包括 VISOR（+98%）、T2I-CompBench Spatial（+67%）和 GenEval Position（+131%）。代码将在此 https URL 上提供。